Bias of library preparation for virome characterization in untreated and treated 1 wastewaters

1 The use of metagenomics for virome characterization and its implementation for 2 wastewater analyses, including wastewater-based epidemiology, has increased in the 3 last years. However, the lack of standardized methods can led to highly different results. 4 The aim of this work was to analyze virome profiles in upstream and downstream 5 wastewater samples collected from four wastewater treatment plants (WWTPs) using 6 two different library preparation kits. Viral particles were enriched from wastewater 7 concentrates using a filtration and nuclease digestion procedure prior to total nucleic 8 acid (NA) extraction. Sequencing was performed using the ScriptSeq v2 RNA-Seq (LS) 9 and the NEBNext® Ultra™ II RNA (NB) library preparation kits. Cleaned reads and 10 contigs were annotated using a curated in-house database composed by reads assigned 11 to viruses at NCBI. Significant differences in viral families and in the ratio of detection 12 were shown between the two library kits used. The use of LS library showed 13 Virgaviridae , Microviridae and Siphoviridae as the most abundant families; while 14 Ackermannviridae and Helleviridae were highly represented within the NB library. 15 Additionally, the two sequencing libraries produced outcomes that differed in the 16 detection of viral indicators. These results highlighted the importance of library


Introduction
The reuse of water, including for irrigation, cooling, and other non-potable applications is an emerging topic due to climate change and water scarcity.Treatment and regeneration of household sewage water in urban regions are usually performed by wastewater treatment plants (WWTPs); however, they are not always able to completely eliminate the microbiological risks present in treated wastewaters (Chalmers et al., 2010;Randazzo et al., 2019;Sano et al., 2016).Fecal bacteria have traditionally been used as indicators for the presence of pathogenic microorganisms even though they fail to detect the presence of human pathogenic enteric viruses (Eslamian, 2016;Gerba et al., 2013;Kitajima et al., 2014).Thus, several viruses (i.e.crAssphage, Pepper mild mottle virus, adenovirus, polyomavirus, …) have been proposed as indicators because of their similarity to pathogenic viruses in terms of environmental stability and resistance to wastewater sanitation treatments (Farkas et al., 2020).The presence of human enteric viruses in treated wastewaters has been well documented (Gerba et al., 2018;Randazzo et al., 2019;Sano et al., 2016), posing public health risk-related concerns also because of their stability into the environment.Thus far, nearly one hundred different types of human enteric viruses are known, which cause a variety of illnesses and diseases in humans (Fong and Lipp, 2005), primarily gastroenteritis and hepatitis, and new pathogenic strains and species continue to be discovered.Among others, the viruses most commonly detected in untreated and treated wastewaters include human norovirus, adenovirus (AdV), enterovirus (EV), sapovirus (SaV), astrovirus (HAstV), rotavirus A (RV), and hepatitis A and E viruses (HAV and HEV) (Haramoto et al., 2018).Surveillance of human enteric viruses in untreated and treated wastewaters is performed by molecular procedures (e.g., real time PCR (qPCR) or digital PCR (dPCR)) (Haramoto et al., 2018).Currently, a wastewater-based epidemiology surveillance has been globally implemented to monitor COVID-19 disease, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with notable implications for public health response in local settings (Bivins et al., 2020;Polo et al., 2020).These approaches require reference sequences for primer and probe design which limit the number and variety of viruses to be analyzed.
Alternatively, recent shotgun or untargeted metagenomic approaches enable the simultaneous identification of viral sequences from a sample, referred to as 'virome', which is a diverse community of mainly eukaryotic RNA and DNA eukaryotic viruses and bacteriophages.Virome characterization in wastewater provides a potential solution to the challenges associated with the traditional surveillance of viruses in sewage (Nieuwenhuijse and Koopmans, 2017).
In this pilot study, we have used metagenomics analyses using two different library preparation kits for metagenomic sequencing to characterize the virome composition in influent and effluent samples from four different WWTPs.Thus, the objectives of this study were to: 1) evaluate different sequencing libraries for virome characterization; and 2) investigate virome distribution and diversity in influent and effluent samples.

Sample processing
Five-hundred mL of influent (IW) and effluent (EW) grab samples from four different WWTPs were collected in November 2018 in Valencia (Spain).Treatment plants differed in the number of equivalent population, the volume of treated wastewater and the disinfection treatments (Table S1).Escherichia coli counts, expressed as Most Probably Number (MPN), were performed using the Colilert ® kit (IDDEX Laboratories, Spain) following the ISO 9308-2:2012 standard on the same sampling day.Samples were kept for further analyses at -20°C and thawed for 12 h at approximately 20 °C before processing.After thawing, 200 mL of each sample were inoculated with 7 log PCRU/L of mengovirus (MgV) vMC 0 (CECT 100000), used as a process control.Samples were processed using the aluminum-based precipitation protocol described elsewhere (AAVV, 2018;Randazzo et al., 2019).Briefly, 200 mL of sample was adjusted to pH 6.0.The Al(OH) 3 precipitate was performed mixing 1 part of AlCl 3 0.9N per 100 parts of sample and the solution was mixed at 150 rpm for 15 min.Then, samples were centrifuged at 1,700 x g for 20 min and the pellet was resuspended in 10 mL of 3% beef extract (pH 7.4) and shaken at room temperature (RT) for 10 min at 150 rpm.Finally, samples were centrifuged for 30 min at 1,900 x g and the resulting pellet was resuspended in 1 mL phosphate saline buffer (PBS, pH 7.4) and stored at -80°C.

Sample processing for metagenomics
Viral particles were enriched from sample concentrates (n=8) following the NetoVIR protocol, which includes both filtration and nuclease digestion steps (Conceição-Neto et al., 2015).In brief, 500 µL of concentrates were homogenized using the MP FastPrep24 5G (MP Biomedicals, Spain) for 40 seconds at a speed of 6.0.The homogenate was centrifuged at 16,000 × g for 3 min and 200 µL of the supernatant was filtered through a 0.8 μm PES filter (Sartorius, UK) to remove large particles.The filtrate was incubated with benzonase (Millipore, Spain) and microccocal nuclease (New England Biolabs, USA) enzymes at 37°C for 2 h to degrade free nucleic acids.Capsid protected viral nucleic acids were extracted with the NucleoSpin ® RNA virus kit (Macherey-Nagel GmbH & Co., Germany), according to the manufacturer's instructions, without adding carrier RNA.Thus, both DNA and RNA viral nucleic acids were concomitantly extracted.Nucleic acids were eluted in 50 μL RNase-free water.Libraries were generated from 1 to 50 ng of a DNA-RNA sample using two different library preparation kits.The first library preparation kit was the ScriptSeq v2 RNA-Seq Library Preparation Kit (Illumina, USA), referenced as LS, with slight modifications.An initial denaturation step (95 °C for 5 min) was added to the protocol, and PCR cycles were increased to 20 to obtain enough library concentration to sequence.Additionally, the RT enzyme from the original library preparation kit was substituted by the AMV Reverse Transcriptase (Promega, Spain).The second library preparation kit was the NEBNext® Ultra™ II RNA Library Prep Kit (New England BioLabs Inc., Ipswich, UK) (referenced as NB) following manufacturer's instructions.The two libraries compared in this study differ in terms of fragmentation times, enzymes, cDNA synthesis conditions, primers used in the PCR, as well as the conditions for aforesaid amplification (Table S2).
Libraries were normalized, pooled, and sequenced using the NextSeq™ 500 platform (Illumina), following the manufacturer's protocol, with a configuration of 150 cycles paired-end reads.Sequencing was performed by Lifesequencing S.L. (Valencia, Spain).

Data analyses
Obtained reads were cleaned for adaptor removal using cutadapt software (Martin, 2011) with a minimum overlap of 5 nucleotides between read and adaptor and a maximum error rate of 0.1.Reads were cleaned with the reformat.shscript from BBMap software (sourceforge.net/projects/bbmap/) in order to remove nucleotides from both ends with Phred scores lower than 20 and reads shorter than 50 bp.Cleaned reads were merged in to single reads with FLASH v1.2.11 (Magoč and Salzberg, 2011) allowing outies.Additionally, cleaned reads were assembled with Ray 2.3.1.(Boisvert et al., 2012) using 31-mers.
Merged reads and contigs were taxonomically annotated using BLASTn algorithm (Boratyn et al., 2013) with a manually curated in-house database constructed with all the viral sequences (NCBI:txid10239; release May, 5 2020) available at GenBank (https://www.ncbi.nlm.nih.gov/nuccore/?term=viruses%5Borganism%5D).For the BLASTn analysis of viral reads against this curated in-house database, a cut off of 70% of query sequence coverage and 80% of identity was used, respectively.Rarefaction curves and diversity indexes Shannon and Simpson were calculated with R package vegan v2.5-6.

Virus quantification
For virus quantification an optimized viability RT-qPCR was applied as previously described (Randazzo et al., 2019).In brief, 150 μL sample concentrates were added to 50 μM PMAxx (Biotium, USA) and 0.5% Triton 100-X (Thermo Fisher Scientific, Spain) and incubated in the dark at RT for 10 min at 150 rpm.Then, samples were exposed to photo-activation using a photo-activation system (Led-Active Blue, GenIUL, Spain) for 15 min.RNA was extracted using the NucleoSpin® RNA virus kit (Macherey-Nagel GmbH & Co.) according to the manufacturer's instructions including the Plant RNA Isolation Aid (Ambion, Spain) pretreatment.Primers, probes and RT-qPCR conditions for norovirus GI, norovirus GII, RV, HAV, HEV, mengovirus and HAstV quantification have been previously reported (Randazzo et al., 2019, Cuevas-Ferrando et al., 2020).
For crAssphage quantification by qPCR, the primer set CPQ_064 described by Stachler et al. (2017) was used.PCR conditions were an initial denaturation step of 30 seconds at 95˚C followed by 45 cycles of 5s at 95˚C and 30s at 60˚C.The Premix Ex Taq master mix for probe-based real-time PCR kit (Takara, France) was used for the reaction.For the crAssphage quantification, the standard curve was performed with a customized gBlock® fragment (Integrated DNA Technologies, Spain) of 228 bp that contained the crAssphage sequences used for amplification.
Limit of quantification, qPCR efficiency and standard curve R 2 values for all the tested genes are displayed in Table S3.For all RT-qPCR assays, undiluted and ten-fold diluted RNA was tested to check for RT-qPCR inhibitors.

Correlation and similarity analyses
Correlation analyses were carried out between data sets obtained by both libraries at family level, and between metagenomics and RT-qPCR results using the R package Hmisc v4.2-0 (https://CRAN.R-project.org/package=Hmisc) and applying the Spearman method (ρ).Significance was set at 0.05.Representation of correlation matrix values was performed with the R library corrplot v0.84 (https://CRAN.Rproject.org/package=corrplot).
For each individual sample, the Jaccard index was used to analyze the similarity among results obtained with both libraries.Calculations were performed using R package betapart v1.5.2 (Baselga, 2010) taking into account the beta.JAC values representing the overall beta diversity for each sample pair.

Overview of bias due to library preparation
Each concentrated sample was sequenced using two sequencing libraries: the ScriptSeq v2 RNA-Seq Library Preparation Kit (LS) and the NEBNext® Ultra™ II RNA Library Prep Kit (NB).The average number of reads was 3.2 and 11.5 million for LS and NB libraries, respectively.Rarefaction analyses showed that 5 out of 8 samples sequenced by the LS library reached the plateau, while 2 out of 8 samples sequenced by NB library reached it.Despite that, remaining samples were close to stabilization with both libraries (Fig. S1).Merged viral reads were annotated through a BLASTn comparison with the curated in-house database that comprised all the viral sequences (CDS and complete genomes) available at GenBank.For the LS library, the percentage of viral reads ranged from 0.6% to 2.4% in influent and from 0.4% to 4.4% in effluent samples.
For NB library, the BLASTn analysis showed a high number of sequences ascribed to the same taxon, suggesting an overrepresentation due to sequencing bias, representing between 33 and 60% of the total viral reads.For that reason, the relative calculations of subsequent analyses were made also taking into account this overrepresentation.These corrected calculations will be called NB-corrected.For the NB library, viral reads ranged from 38% to 58% in influent and from 14% to 24% in effluent samples.Taking into account the results of NB-corrected, these percentages ranged from 9% to 12% and from 7% to 12% in influent and effluent samples, respectively.Shannon and Simpson diversity indexes were calculated for each type of sample (influent or effluent) and for each library (LS or NB) (Fig. 1).Shannon indexes were higher in influent samples sequenced with LS library (mean values of 3.85±0.33for LS and 1.30±0.12for NB); however, for effluent samples both indexes showed similar means (1.81±1.21for LS and 1.90±0.2for NB), being effluent samples sequenced with LS library more variable (0.36-3.07).Similar results were obtained for Simpson index, even though the mean values for influent samples were highly different (0.93±0.02 for LS and 0.62±0.05for NB).
Raw data was deposited at SRA under the Bioproject PRJNA67378 with the following accession numbers: SAMN16633937-SAMN16633944 for ScriptSeq v2 RNA-Seq Library Preparation Kit samples and SAMN16634071-SAMN16634078 for NEBNext® Ultra™ II RNA Library Prep Kit samples.

Mengovirus recovery
Mengovirus (MgV) was used as a process control to analyze the performance of each library to recover reads and the entire genome of MgV.Its recovery, represented as the percentage of viral reads and the percentage of MgV isolate M genome (L22089.1)obtained for each sample with each library, was different depending on the library used.
For LS library, the percentage of viral reads of MgV ranged from 0.05% to 0.79% in influent and from 0.35% to 3.68% in effluent samples.For NB libraries, these values ranged from 0.01% to 0.16% in influent samples, and from 0.63% to 5.77% in effluent samples.However, the percentage of MgV reads with the NB-corrected values were higher in effluent samples, ranging from 1.38% to 11.38%.For the analysis of the recovery of MgV genome, assembled contigs belonging to this species were compared with the genome of Mengovirus isolate M (L22089.1).LS library genome recovery ranges from 6.0% to 95.1%.The highest recovery was obtained in the sample IW3.On the other hand, the coverage of this genome by the NB library ranged from 98.4% to 100% (EW3).

Virome comparison
Regarding the virome composition for each library at family levels, results showed high differences between the two approaches (Fig. 2).While the most represented families with the LS library were Virgaviridae, Microviridae and Siphoviridae; the most abundant families with the NB library were Ackermannviridae and Helleviridae.These differences allowed the detection of some viral families depending on the library used.
For example, families as Rhabdoviridae, Pospiviroidae or Mitoviridae were only detected when the NB library was used for sequencing.Also the taxa uncultured human fecal virus (NCBI:txid239364) and uncultured marine virus (NCBI:txid186617) were only detected with the NB approach.Regarding the families detected with both libraries, only Genomoviridae showed total correlation (ρ=1) between values obtained with both libraries in influent and effluent samples.For influent wastewater samples, families Nairoviridae and Virgaviridae showed total correlations; however, this correlation was only observed for Parvoviridae in effluent samples.Other families showed high correlations (ρ=0.8) in influent wastewaters, as Peribunyaviridae and Picornaviridae; while Podoviridae, Poxviridae, Reoviridae and Virgaviridae families showed high correlations (ρ=0.8) in effluent samples.Jaccard indexes showed similarities between the same sample sequenced with each library that ranged from 0.76 (IW1) to 0.91 (IW2), with mean values of 0.83±0.09for IWs and 0.81±0.04for EWs.

Analyses of viral fecal indicators by NGS and correlation with enteric viruses detected by RT-qPCR
Each of the libraries used in this study showed different power for detecting fecal indicators.Similarly, influent and effluent samples showed different detections rates (Fig. 3A).For example, LS library detected CrAssphage with read percentages higher than 1% but these percentages decreased to less than 0.01% with NB library.Most importantly, LS library was unable to detect the fecal indicator adenovirus in effluent wastewaters, while the NB library detected adenoviruses in percentages between 0.1-1%.The same scenario was observed for the Picobirnavirus indicator.However, indicators families as Inoviridae, Microviridae, Myoviridae and Podoviridae showed a better detection with the LS library.Siphoviridae family detection did not show differences in its detection capacity between the two different tested libraries, with the exception of sample EW3 (Fig. 3A).
Correlation between the number of reads of proposed viral indicators obtained with both libraries and the quantifications obtained by RT-qPCR for enteric viruses along with the E. coli counts were calculated.Figure 3B  Interestingly, the proposed indicator AdV showed negative correlations with all the enteric viruses analyzed (ρ>-0.7),with the exception of HAV and HEV.Results obtained by NGS for the pepper mild mottle virus (PMMoV), also proposed as viral indicator, showed no correlation with any of the enteric viruses.

Virome comparison between influent and effluent wastewaters
Clean reads obtained from each library and each sample were assembled and contigs longer than 200 bp were used for taxonomical classification by using BLASTn algorithm and the in-house database.Due to the different results observed at reads level for each library, contigs classification obtained with both libraries for each sample were merged for results representation and virome analysis.The relative abundances of different taxa are shown in Fig. 4. As observed in the heatmap graphic, the most abundant viruses were bacteriophages, as Dickeya phage or Listeria phage WIL-3, even with higher percentages in effluent samples.The higher detection of these phages in treated samples, as occurred with other species (i.e.cucumber green mottle mosaic virus, EBPR podovirus 2, PMMoV, Stealth virus 1, or Tobacco and tomato mosaic viruses), can be due to the decrease of other viruses after wastewater treatment that allows the its detection.Similarly, this effect could be the responsible of the detection of some viruses in effluent samples that were not detected in influent samples, as the case of human adenovirus and human gammaherpesvirus.Some viruses were only found in high percentages in influent samples, as the indicator crAssphage, some Aeromonas phages, Escherichia phages or viruses belonging to the Microviridae family.
Wastewater treatments could be the factor that produce this decrease; however, more studies along time from the same WWTPs must be performed in order to ensure the effect of performed treatments.

Discussion
The virome of wastewaters have been previously characterized from samples collected around the world (Adriaenssens et al., 2018;Aw et al., 2014;Cantalupo et al., 2011;Fernandez-Cassi et al., 2018;Furtak et al., 2016;Nieuwenhuijse et al., 2020;Rusiñol et al., 2020;Strubbia et al., 2019aStrubbia et al., , 2019b;;Wang et al., 2018); however, much less is known about the virome of effluent samples as only one study has analyzed two effluent samples collected in the UK (Adriaenssens et al., 2018).As far as we know, this is the first study that concomitantly analyzes the RNA and DNA viruses present in influent and effluent samples besides providing a comparison of viruses profiles detected using different sequencing library kits.
Results obtained in our study showed high differences regarding not only viruses, but also the power of detection of viral fecal indicators.Both aspects are important for the use of random metagenomics as tool for specific detection.Our results evidenced the influence of the library used for virome studies together with their variability.Additionally, by using MgV as process control for both metagenomic and RT-qPCR analyses, we further assessed the sensitivity of each library, being higher when using NB library.Recoveries of MgV complete genome were between 6.0 and 95.1% for LS library and between 98.4 and 100% for NB library.In contrast, in a recent study, MgV reads were not recovered from spiked water and sediment samples (Adriaenssens et al., 2020).According to the authors, this was likely due to an inclusion of an inactivation step of the DNase at 75°C, which potentially exacerbated the effect of the RNase step (Adriaenssens et al., 2018).The use of models of a virus of interest when comparing sequencing libraries can be an excellent tool for the library selection.
For the analysis of the virome of influent and effluent wastewaters, results obtained by both libraries were merged.Phages as crAssphage, Aeromonas phages, Escherichia phages or viruses belonging to the Microviridae family were found in high percentages in influent wastewaters.The absence of these viruses in effluent samples can be due to the sanitation treatments applied in WWTPs, even though further analysis that includes a wider sampling design needs to be performed.These results are in line with previous studies showing a high abundance of bacteriophages families (Aw et al., 2014;Cantalupo et al., 2011;Fernandez-Cassi et al., 2018;Rusiñol et al., 2020;Wang et al., 2018) in influent sewage samples.Nevertheless, other studies showed Virgaviridae as the most represented viruses (Furtak et al., 2016).Differences in virome profiling with other studies might be due to the influence of library sequencing and the intrinsic characteristics of the virome related to the sample itself and the area of study.On the other hand, the higher presence of some viruses or even its detection only in effluent samples could be produced by the decrease of other viruses that allowed its detection.
The presence of pathogenic viruses is an important aspect for defining the final use of treated waters as it may be the case of irrigation.Due to their high environmental resistance, the presence of human enteric viruses has been reported in treated wastewaters (Adriaenssens et al., 2018).However, some of these pathogenic viruses are not always detected by metagenomics analyses.For instance, in the study by Fernández-Cassi et al. (2018), human adenoviruses (HAdV) reads were not detected in samples concentrated from 10 liters of wastewater.Adenoviridae was also not detected in the study of Adrianssens et al. (2018), in which the sample was concentrated from 1 liter of wastewater.In our study, concentrating 200 mL of effluent samples, we were able to detect HAdV in percentages between 0.16% and 0.35%.In contrast, percentages of HAdV in influent wastewaters were lower than 0.01%.Overall, the majority of the annotated virome belonged to bacteriophages.This indicates that metagenomics is poor in sensitivity when used to detect a low abundance of viral pathogens against a large background of bacteriophages, as occurred for the enteric viruses detected by viability RT-qPCR.For example, in the present study, norovirus genomes could not be retrieved from the reads as reported elsewhere (Adriaenssens et al., 2018;Fernández-Cassi et al., 2018;Strubbia et al., 2019b).In the current study, the number of generated paired reads per sample was 3.2 and 11.5 million for LS and NB, respectively; while Adriaenssens et al., (2018) reported between 10 and 110 million, increasing significantly the probability to retrieve full or partial viral genomes.Alternatively, methods to detect and characterize specific viruses have been described and rely on the selection of target RNA prior to library preparation through a capture using VirCapSeq-VERT target enrichment, as reported for norovirus (Strubbia et al., 2019b).

Conclusion
The use of metagenomics for virome characterization and its implementation for wastewater analyses has increased in the last years ( Nieuwenhuijse and Koopmans, 2017).However, the major problem of this approach is the lack of standardized procedures and the substantial differences among studies; thus, available data must be interpreted with caution.The present study showed a procedure that allows the detection and the characterization of viral populations in untreated and treated wastewater samples.Overall, this study sheds light on the diversity of the viral communities in untreated and treated wastewaters providing valuable information also in terms of viral fecal indicators.The study also evidences the bias on virome profiles obtained by tested sequencing libraries.Our results underline the need for further studies to elucidate the influence of sequencing procedures in virome profiles in wastewater matrices in order to improve the knowledge of the virome in the environment.

Figure legends
shows the Spearman values of correlation (ρ) calculated with 95% confidence.Norovirus GI and GII showed high correlation values (ρ>0.8) with the indicators crAssphage, Picobirnavirus and Inoviridae.The highest correlation values (ρ=0.7) between RV and indicators reads were crAssphage, Inoviridae and Microviridae.For HAstrV, high correlations (ρ>0.8)only occurred with crAssphage and Myoviridae.HAV and HEV did not correlate with any of the indicators.

Figure 1 :
Figure 1: Shannon and Simpson diversity indexes for viral species in influent (IW) and effluent (EW) samples processed by using ScriptSeq v2 RNA-Seq Library Preparation kit (LS) and NEBNext® Ultra™ II RNA Library Prep Kit libraries (NB) for metagenomics characterization.

Figure 2 .
Figure 2. Relative abundance at family level of the viral population detected in influent and effluent samples from four different WWTPs by metagenomics with ScriptSeq v2 RNA-Seq Library Preparation kit (LS, green) and NEBNext® Ultra™ II RNA Library Prep Kit (NB, orange).

Figure 3 .
Figure 3. Viral indicators analysis in influent (IW) and effluent (EW) samples.Panel A, Detection of viral indicators with ScriptSeq v2 RNA-Seq Library Preparation kit (LS) and NEBNext® Ultra™ II RNA Library Prep Kit (NB and NB-corrected).Panel B, Correlation matrix between the reads of viral indicators obtained by NGS and the load of enteric viruses (RT-qPCR) and E. coli counts.

Figure 4 .
Figure 4. Heatmap showing the virome composition at species level obtained by merging the results of ScriptSeq v2 RNA-Seq Library Preparation kit (LS) and NEBNext® Ultra™ II RNA Library Prep Kit (NB).Only species with percentages higher than 1% are shown.