Sequence-function Relationships in Phage-encoded Bacterial Cell Wall Lytic Enzymes and their Implications for Phage-derived Products Design

Phage (endo)lysins are thought to be a viable alternative to usual antibiotic chemotherapy to fight resistant bacterial infections. However, a landscape view of lysins’ structure and properties regarding their function, with an applied focus, is somewhat lacking. Current literature suggests that specific features typical of lysins from phages infecting Gram-negative bacteria (G−) (higher net charge, amphipathic helices) are responsible for an improved interaction with G− envelope. Such antimicrobial peptide (AMP)-like elements are also of interest for antimicrobial molecules design. Thus, this study aims to provide an updated view on the primary structural landscape of phage lysins to clarify the evolutionary importance of several sequence-predicted properties, particularly for the interaction with the G− surface. A database of 2,182 lysin sequences was compiled, containing relevant information such as domain architectures, data on the phages’ host bacteria and sequence-predicted physicochemical properties. Based on such classifiers, an investigation on the differential appearance of certain features was conducted. Such analyses revealed different lysin architectural variants that are preferably found in phages infecting certain bacterial hosts. Particularly, some physicochemical properties (higher net charge, hydrophobicity, hydrophobic moment and aliphatic index) were associated to G− phage lysins, appearing specifically at their C-terminal end. Evidences on the remarkable genetic specialization of lysins regarding the features of the bacterial hosts have been provided, specifically supporting the nowadays common hypothesis that lysins from G− usually contain AMP-like regions. IMPORTANCE Phage-encoded lytic enzymes, also called lysins, are one of the most promising alternatives to common antibiotics. The lysins potential as novel antimicrobials to tackle antibiotic-resistant bacteria not only arises from features such as a lower chance to provoke resistance, but also from their versatility as synthetic biology parts. Functional modules derived from lysins are currently being used for the design of novel antimicrobials with desired properties. This study provides a view of the lysins diversity landscape by examining a set of phage lysin genes. This way, we have uncovered the fundamental differences between the lysins from phages that infect bacteria with different superficial architectures, and, thus, also the reach of their specialization regarding cell wall structures. These results provide clarity and evidences to sustain some of the common hypothesis in current literature, as well as make available an updated and characterized database of lysins sequences for further developments.

Since the antibiotic pipeline started drying out, a worrying increase in the antibiotic 58 resistant fraction of bacterial populations has been reported (1, 2), and highly antibiotic-59 resistant population percentages have maintained (3,4). Thus, if the situation is set to 60 continue, the cost, both economic and in human lives, will be enormous due to the lack 61 of effective treatments (5, 6). This has prompted the interest in novel antimicrobials 62 development by many public health actors, such as international overseeing 63 organizations (7), public health and disease control agencies (3, 4), governments (8), 64 researchers, and several companies (9). Some of the current efforts to gather a new 65 antimicrobial armamentarium have led science towards bacteriophages (phages) (10, 66 11). 67 normal microbiota to be harmed (16,17), or, conversely, the possibility to have broad-93 range lysins, if needed (18); b) a lower chance to provoke the appearance of resistant 94 bacteria, which is speculated to be because of the essential nature of the highly 95 conserved peptidoglycan (this is, changes in its structure lead to a decreased fitness 96 and/or virulence) (19); c) neither adverse immune responses nor production of 97 neutralizing antibodies are expected, possibly due to the usual presence of phages -and 98 their products-among the normal cohabitating microbial populations in humans (20). 99 Moreover, lysins are amenable for protein engineering strategies (18,(21)(22)(23)(24). Typically,  However, a number of lysins also encompass intrinsic bactericidal activity on G-118 (32-34). This activity was first noticed for the T4 phage lysozyme (35) and several 119 Pseudomonas aeruginosa phage lysins (36). Such unexpected property was attributed to 120 non-enzymatic mechanisms, previously described in partially denatured hen egg-white 121 lysozyme (37), and relies on the presence of antimicrobial peptide (AMP)-like 122 subdomains within such lysins, usually at a C-terminal position (32,38). Recently, it 123 has been suggested that such AMP-like elements are widespread among lysins from 124 phages infecting G−, and that they might cooperate to host lysis by providing an 125 additional affinity towards the cell wall, because of their high net charge (28,33, 41). Since most lysins from G− are assumed to be monomodular, such AMP-like 127 elements are thought to be an alternative to the CWBDs found in multimodular G+ 128 phage lysins for substrate binding. However, it has not been yet properly examined how 129 widespread this trait would actually be, and, therefore, its true functional and 130 evolutionary implications are largely unknown. Of note, such AMP-like elements have 131 been successfully used to design AMPs active on their own (36,39,42).

132
To uncover the actual evolutionary relevance of these AMP-like elements, as well as

152
Outline. A total of 9,539 genomes were prospectively obtained from the National 153 Center for Biotechnology Information (NCBI) database (retrieved on April, 2020).

154
After a careful curation process (for details, see Methods), the final database contained 155 2,182 proteins and a total of 3,303 Pfam (PF) hits (Table S1 in the supplemental 156 material). Each of these sequences was associated with a bacterial genus corresponding 157 to its described host, for which data on its Gram group and peptidoglycan chemotype 158 was added (Table S2 in the supplemental material). In total, our database comprised 159 phage lysins from 47 bacterial genera, accounting for up to a total of 2,179 sequences, 160 plus three lysin sequences from PRD1-like phages that infect several enterobacteria.

161
Taking into account all of the identical sequences, the 2,182 different sequences of our 162 data set correspond, in fact, to 36,365 entries in the NCBI Reference Sequence database 163 (RefSeq; release 202) (44).

164
General differences among lysins. For 1,512 out of 2,182 sequences (69.3%), only 165 one significant PF hit could be predicted (Fig. 1A). This was especially relevant for 166 lysins from phages infecting G-, given that 90.6% of these proteins were predicted to 167 contain a single functional domain. Near 60% of the lysins from phages infecting G+ 168 (for the sake of this work, mycobacteria and their relatives like Rhodococcus or 169 Corynebacterium were included among G+), harboured only one functional domain.

170
Few lysins appear to contain  4 PF hits (Fig. 1A). However, these figures should be

184
As a whole, however, the differential relative amount of single and multiple PF hits 185 sequences between G and G+ phage lysins ( Fig. 1A and E) can be taken into account, 186 in accordance with the usual proposal that G-lysins are typically monomodular, while 187 G+ ones are multimodular (46). This is further supported by the evident difference in 188 protein length distributions (Fig. 1B), where G+ phage lysins tend to be larger (median 189 = 317 aa residues) than G-ones (median = 164 aa residues); and also by the differential 190 distribution of sequence lengths before and after the predicted EADs ( Fig. 1C and 1D).  (Table 1). From the total 3,303 PF hits analysed, 2,460 corresponded to 212 phages infecting G+ bacteria. 2,243 (1,477 G+; 766 G-), 1,054 (982 G+; 72 G-), and 6 213 (G-) corresponded to EADs, CWBDs, and structural domains, respectively (the sources 214 for domains classification as EAD, CWBD or structural, can be consulted at Table S3 in 215 the supplemental material). When the differential Gram group classification of each PF 216 hit was analyzed, it was found that EADs like Amidase_5, Glyco_hydro_25, other genera-specific traits, as discussed above. Some specificities could be found 288 though (e.g., Amidase02_C appears only in phages that infect A1 bacteria or 289 PG_binding_3 only in A1), and some CWBDs that are widespread among different 290 chemotypes could also be observed (PG_binding_1, LysM, SH3_5). In general, 291 however, it cannot be stated that peptidoglycan composition is a major determinant for 292 CWBD specificity, except for some cases such as, for example, ZoocinA_TRD domains, 293 which has been proposed to bind A3 with two Ala residues at the cross-link (60). The 294 poor performance of chemotype as an a priori predictor of the CWBD PF family ligand 295 is more clearly evident if we consider the CWBD types which appeared widespread   Additional information could be drawn from this analysis when applied to the 316 different catalytic activities detected (Fig. 5C). First of all, NAM-amidases were the 317 most represented type of domains and also those that appeared among more different 318 taxonomic groups and chemotypes, even more so than lysozymes. Indeed, Amidase_2, 319 the most abundant PF domain in our dataset (638 hits), appeared both in lysins from G+ 320 and G− phages. The SSN in Fig. S3 in the supplemental material shows, however, that 321 although Amidase_2 seems a rather diverse group, with various observable similarity  Another interesting remark is that peptidase activities were more common amongst 331 lysins from phages infecting bacteria with subgroup A1 peptidoglycans which, in turn, 332 display the simplest cross-linkage of all types, lacking an interpeptide bridge. Thus, 333 peptidases were not uncommon among G-, and were also present in A1 phages from 334 G+ (especially mycobacteriophages, but also listeriophages and phages from 335 Clostridium, Bacillus or Corynebacterium). On the other hand, amidase/peptidases, 336 which is the label given to CHAP domains (Table S3 in (Fig. 6C), NCPR was the most relevant variable to distinguish 362 between G+ and G-, followed by average hydrophobic moment and aliphatic index and, 363 finally, hydrophobicity. In general, these results suggest that lysins from phages that 364 infect G+ and G-can in fact be differentiated by their physicochemical properties in a 365 relatively efficient manner. For visualization of the differences between G+ and G−, a 366 multidimensional scaling (MDS) plot based on the proximity matrix from the random 367 forest model was drawn (Fig. 6D). Such plot showed the clustering of G− lysins within 368 the 2-dimensional space based on the physicochemical variables, while G+ ones seemed 369 to be more dispersed. A qualitative interpretation of this result may reflect the 370 aforementioned wide diversity of functional modules and architectures of G+ lysins 371 (Fig. 1D, Fig. 4)   Moreover, the average prediction of local net charge suggested that such difference is 378 mainly located at the C-terminal part of G− lysins (Fig. 7B). A more thorough

385
Hydrophobicity was also higher in G-lysins, but the difference regarding G+ ones is 386 smaller (ζ = 0.36). This might be related to the rather inconsistent pattern shown by 387 average local hydrophobicity and sequence quartiles comparison (Fig. 7BC). G-lysins 388 tended to have a more hydrophobic N-terminal part, whereas at the C-terminal moiety 389 the tendency was reversed, something that can be explained by the relative abundance 390 of positively charged residues shown before for G-. It is at the third quartile (Q3), 391 immediately before the high positive net charge patch described above, where the 392 difference was statistically more relevant (p ≤ 0.001; ES = 0.59), with higher values in 393 G-s. There was also a statistically significant difference in the average hydrophobic 394 moment distributions between G+ and G-phage lysins. For the G-group, the local plot 395 (Fig. 7B) showed a higher tendency to present greater hydrophobic moments along the 396 whole protein length but the N-terminal part. Analysis of sequence quartiles confirmed 397 a statistically significant superiority of average hydrophobic moment for G-except at 398 N-terminal. The aliphatic index was also significantly higher in G-, although G+ 399 showed an aliphatic index peak at their C-terminal part that surpassed that of G-400 (coincidental with G-basic aa peak, which, understandably, would lower both 401 hydrophobicity and aliphatic index at Q4) (Fig. 7C).

402
Taking all these observations together with the results thrown by the random forest 403 prediction, we can conclude that the physicochemical difference between lysins from 404 phages that infect G+ or G-bacteria is specified as a higher positive net charge of G-, 405 particularly at C-terminal end, combined with a greater propensity in incorporating 406 aliphatic aa and likely resulting in amphiphilic structures.

407
A closer examination of net charge (and C-terminal net charge) of lysins from G-408 infecting phages indicated that the high positive patch trait seems specific to some 409 domain families. As a whole, a statistically significant higher NPRC value was found in 410 lysins bearing Phage_lysozyme, Hydrolase_2 and Glyco_hydro_19 domain families 411 (Fig. 8). At the C-terminal part, higher NCPR was found in lysins bearing the same 412 domains mentioned above, but also in SLT and Muramidase. The average local net 413 charge tendency showed for each EAD group (Fig. S4 in the supplemental material) 414 confirmed that a local high positive charge peak appears in the protein part immediately 415 before the C-terminal apex. 416 Interestingly, all of the aforementioned domains that present a higher, positive charge 417 patches at their C-terminal part were preferentially present in lysins from phages that 418 infect G-bacteria (Table 1). This observation provides a basis to argue a generalized 419 evolutionary tendency in G-infecting phages towards developing AMP-like  c) The differential appearance of EAD families within phages that infect bacteria 475 with a certain chemotype, which suggests an adaptation of the enzyme to the 476 structure of the specific peptidoglycan it has to degrade. This is notable in the 477 case of peptidases. The somewhat wide range of peptidases identified within our 478 data set is mainly distributed among phages infecting bacteria with subtype A1 479 peptidoglycan. In phages that infect subtype A3 bacteria, the most common 480 EAD is CHAP, which has been shown to function either as NAM-amidase or as 481 endopeptidase, in any case, specific for A3 peptidoglycan.  screened for gene products whose annotations could suggest them to be lytic enzymes. Therefore, 508 keywords such as 'lysin', 'lysozyme', 'murein', 'amidase', 'cell wall hydrolase', 'peptidase' or 509 'peptidoglycan' were used as inclusion criteria, while 'structural ', 'tail', 'holin', 'baseplate' or 'virion 510 protein' were used as exclusion terms to try and avoid misidentifications. Associated information such as 511 taxon of the bacterial host, aa sequence, annotations, phage denomination, and protein/genome unique 512 identifiers were also added into the database.

513
Curation included: 1) a sequence length cutoff, established with a minimum of 50 and a maximum of 514 550 aa residues; 2) a sequence identity cutoff using CD-HIT (68) with default parameters and a 98%  Table S1 in the supplemental material and at Digital.CSIC (71).

520
Physicochemical properties prediction and analysis. Prediction of physicochemical properties (net 521 charge, aliphatic index, hydrophobicity, hydrophobic moment) based on the aa sequences retrieved were 522 performed using the R package 'Peptides' implementation (72). Dawson's pK a scale was used for 523 prediction of net charge assuming pH = 7.0 (73); hydrophobicity scale was that proposed by Kyte and 524 Doolittle (74) and hydrophobic moment was calculated as previously proposed (75)

530
A random forest algorithm was used to check the ability of physicochemical properties to predict lysin 531 sequences as from a G+ or G-infecting phage. R package 'caret' was employed for creating, fitting and 532 testing the random forest, and further analyses on the model (ROC curve, MDS plotting) were performed 533 using packages 'pROC' and 'randomForest'. The dataset was randomly partitioned into a training subset 534 (75% of all entries) and a testing subset. The training subset was used to fit the random forest parameters 535 (namely, the randomly selected variables for each node, which was fixed in 4) by a 5-fold cross-536 validation with 3 repeats. Then the constructed random forest was validated using the previously defined 537 testing subset.

538
Sequence similarity networks. SSNs were generated for visually assessing the similarity clustering 539 of sequence sets. For this purpose, the Enzyme Similarity Tool from the Enzyme Function Initiative 540 server (EFI-EST) was employed (76). Briefly, this tool performs a local alignment from which every 541 possible pair of sequences receives a score similar to the E-value obtained from a typical BLAST 542 analysis. A threshold score value was selected for each SSN so that below such threshold sequence pairs 543 were considered nonsimilar and, therefore, the pair would not be connected in the resulting 544 representation. Scores were selected so that sequence pairs whose similarity was below 30-40% were 545 deemed non-similar. The SSN graphs were produced Cytoscape 3 with yFiles organic layout (77).   Vázquez was the recipient of a predoctoral fellowship from CIBERES.

575
The authors gratefully acknowledge Guillermo Padilla for his fundamental advice in statistical data 576 treatment and representation.

577
The authors have declared that no competing interests exist.

578
Authors contribution statement: RV and PG conceptualization; RV performed the analysis and 579 constructed the database; RV and EG data curation; RV wrote the original draft of the paper; all authors 580 read, edited and approved the final manuscript.  Menéndez M, García P. 2013. 624 Improving the lethal effect of Cpl-7, a pneumococcal phage lysozyme with broad 625  Table 1). In 815 distribution charts (B, C, D) Y-axis shows an estimation of the distribution density.