A validated workflow for rapid taxonomic assignment and monitoring of a national fauna of bees (Apiformes) using high throughput DNA barcoding

Improved taxonomic methods are needed to quantify declining populations of insect pollinators. This study devises a high‐throughput DNA barcoding protocol for a regional fauna (United Kingdom) of bees (Apiformes), consisting of reference library construction, a proof‐of‐concept monitoring scheme, and the deep barcoding of individuals to assess potential artefacts and organismal associations. A reference database of cytochrome oxidase c subunit 1 (cox1) sequences including 92.4% of 278 bee species known from the UK showed high congruence with morphological taxon concepts, but molecular species delimitations resulted in numerous split and (fewer) lumped entities within the Linnaean species. Double tagging permitted deep Illumina sequencing of 762 separate individuals of bees from a UK‐wide survey. Extracting the target barcode from the amplicon mix required a new protocol employing read abundance and phylogenetic position, which revealed 180 molecular entities of Apiformes identifiable to species. An additional 72 entities were ascribed to nuclear pseudogenes based on patterns of read abundance and phylogenetic relatedness to the reference set. Clustering of reads revealed a range of secondary operational taxonomic units (OTUs) in almost all samples, resulting from traces of insect species caught in the same traps, organisms associated with the insects including a known mite parasite of bees, and the common detection of human DNA, besides evidence for low‐level cross‐contamination in pan traps and laboratory procedures. Custom scripts were generated to conduct critical steps of the bioinformatics protocol. The resources built here will greatly aid DNA‐based monitoring to inform management and conservation policies for the protection of pollinators.


| INTRODUC TI ON
Widespread declines in pollinator populations are causing concern about the future of global biodiversity and agricultural productivity (Garibaldi et al., 2013;Hallmann et al., 2017;Lever, van Nes, Scheffer, & Bascompte, 2014), driven by the combined effects of habitat loss, introduction of non-native and invasive species, pathogens and parasites, and various other factors contributing to environmental change (Vanbergen et al., 2013). Landscape effects on pollination of crops through agricultural intensification, particularly the use of monoculture crops, have led to significant changes in pollinator communities (Kennedy et al., 2013;Ricketts et al., 2008), with obvious economic implications for the agricultural sector and pollination services worth hundreds of millions of pounds in the United Kingdom alone (Potts et al., 2010). However, these trends in species distribution and abundance are difficult to quantify, unless solid methodologies for monitoring at regional levels can be implemented.
Thus, there is an urgent need to develop strategies for large-scale and long-term systematic monitoring of pollinator populations, to better understand the impacts of declines on pollination services to crops and wild plants, and inform policy decisions and conservation efforts.

Current evidence of change in pollinator populations in the United
Kingdom comes primarily from records of species occurrence submitted by volunteer recorders (e.g., Biesmeijer et al., 2006;Powney et al., 2019). While these allow for the analysis of large-scale changes in species distributions, they provide no information on abundance or local population size, and are known to be temporally and spatially biased (Isaac & Pocock, 2015). Instead, pan traps have been proposed as the most effective method for systematic monitoring of bee diversity in European agricultural and grassland habitats (Westphal et al., 2008).
Species identification is usually performed by expert taxonomists, but there is a growing need for alternative methods, in particular because the great species diversity and large quantity of specimens from mass trapping make them challenging and costly to identify (Lebuhn et al., 2013).
This impediment may be overcome through the use of high throughput sequencing (HTS) techniques to identify species by their DNA "barcode" at the individual or community level. A suitable HTS-based approach for largescale DNA barcoding could assay thousands of specimens potentially generated in the course of a pollinator monitoring scheme. The great sequencing depth of such high-throughput barcoding (HT barcoding) methodology may also reveal DNA from organisms internally or externally associated with a target specimen or as a carry-over from other specimens in the trap. Similarly, the more recent approach of '"metabarcoding", by which entire trap catches are subjected to high throughput amplicon sequencing in bulk, produces community-level species incidence data based on the mixed sequence read profile (Yoccoz et al., 2012). Current HTS protocols can maintain the individual information of thousands of samples using unique tags in the initial PCR prior to pooling for Illumina sequencing with secondary tags, which allows sequences to be traced back to the associated specimen (Arribas, Andujar, Hopkins, Shepherd, & Vogler, 2016;Shokralla et al., 2015). However, crucial to these approaches is a comprehensive and validated reference set of DNA sequence data from target species to provide accurate and verifiable molecular identification.
This study lays the groundwork for the use of HTS techniques to assess diversity and abundance of mass-trapped samples of a regional-scale pollinator fauna, focused on the bees (Hymenoptera: Apiformes) of the United Kingdom. The first step in this process was the generation of a well curated reference database for the 278 species of bees known from the UK using the cytochrome oxidase c subunit I (cox1) barcode marker (Hebert, Cywinska, Ball, & DeWaard, 2003), which provides good species discrimination in Hymenoptera (Smith et al., 2008). This reference set was then used to identify HTSobtained short barcode reads from 762 bee specimens gathered as part of a pilot study for a national monitoring scheme. Morphological identifications performed in parallel provided comparisons of molecular identification against conventional methodology and allowed further refinement of the database. In addition, the deep-sequencing approach allowed the assessment of organisms associated with the target specimens, as well as cross-contamination from other species present in the traps or from specimen handling and laboratory procedures.

| Building a regional reference database
A cox1 reference database was generated from DNA barcoding of bee species known to occur in the UK according to the list of Falk and Lewington (2015) and notes from various sources maintained by coauthor DGN. Most specimens were caught by hand netting and identified by DGN, using the latest keys available at the time (Amiet, Herrmann, Müller, & Neumeyer, 2001, 2004, Amiet, Herrmann, Müller, & Neumeyer, 2010Amiet, Müller, & Neumeyer, 2014;Bogusch & Straka, 2012;Falk & Lewington, 2015;Benton, 2006;Müller, 2016). Identifications had to draw on these various references because the comprehensive key of Falk and Lewington (2015) became available only part way through the study, while some identifications were also cross-checked between different publications. Addit ional specimens were obtained using pan traps from the survey described below. The reference set included all available unique UK species as determined by morphology, with multiple specimens per species where available. These within-species replicates allowed inclusion of specimens from across the geographical range of species, identified by different taxonomists, or belonging to species complexes. Specimen data for morphological vouchers are available at the Natural History Museum Data Portal (data.nhm. ac.uk; http://dx.doi.org/10.5519/0002965).
DNA was extracted from a hind leg using a Qiagen DNeasy Blood and Tissue Kit, after incubation at 56°C in extraction buffer (ATL and Proteinase K) overnight in a shaking incubator at 75 rpm.
The complete cox1 "barcode region" (658 bp) was amplified using primers (BEEf TWYTCWACWAAYCATAAAGATATTGG and BEEr TAWACTTCWGGRTGWCCAAAAAATCA) newly designed based on an alignment of 84 mitochondrial genomes from 22 genera of Apiformia. PCR and sequencing using ABI dye terminator technology followed standard procedures (Appendix S1). Sequences were deposited in BOLD (Barcode of Life Datasystem) under the project BEEEE, along with Syrphidae barcodes that were sequenced at the same time.

| Generating a test data set from field caught samples using HTS
The reference database was used for identification of specimens obtained through the National Pollinator and Pollination Monitoring Framework (NPPMF; Carvell et al., 2016). Mixed samples were collected with pan traps consisting of sets of water-filled bowls (painted UV-yellow, white and blue; after Westphal et al., 2008) from 14 sites across the UK, and further specimens were collected by netting along standardised transects running 200 m from each set of pan traps (Figure 1a; see Carvell et al., 2016 and Appendix S1 for a full description of the sampling protocol). Bees (Apiformes) were separated from other taxa in the field, stored in 99% ethanol, and transferred to -20°C as soon as possible after collection. Specimens were identified morphologically by taxonomists offering commercial identification services. In total, 762 bee specimens were processed and individually sequenced. All specimens were stored in 99% ethanol and deposited as voucher specimens in the Molecular Collection Facility at the NHMUK.
DNA was extracted from individual specimens by piercing the abdomen and submerging the whole specimen in lysis solution consisting 200 µl ATL/Proteinase K buffer for 12 hr on a 56°C shaking incubator. DNA extractions were performed using either the Qiagen BioSprint 96 DNA Blood Kit or DNeasy Blood and Tissue kits applied to the lysate. Each DNA extract was PCR amplified for a 418 bp portion of the cox1 barcode region (Andújar et al., 2018; Appendix S1).
Amplicons for each individual were tagged using a "double dual" PCR protocol (Shokralla et al., 2015) to generate unique tag combinations for each bee specimen, following the procedures of Arribas et al. (2016). Tags were added in the initial PCR by amplification using cox1 primers with different six bp tags with a Hamming distance of F I G U R E 1 Specimens and species used in this study. (a) The number of bee species of each family, data set and geographical source from which sequences were compiled to form the reference collection. Column colours denote whether species from each data set comprised any UK specimens, and numbers above bars give totals, The BEEEE columns denote the species sequenced as part of this study (165) three, with a total of eight different tagged primer sets. In all reactions, forward and reverse primers used the same tag, so that the products of tag jumping could be detected (Schnell, Bohmann, & Gilbert, 2015 were discarded. Read quality was reviewed using FASTQC (Andrews et al., 2012). Following demultiplexing, the NAPmerge script was used to generate a set of full-length reads for further analysis. The script invokes cutadapt (Martin, 2011), PEAR (Zhang, Kobert, Flouri, & Stamatakis, 2014) and USEARCH-fastq_filter (Edgar, 2010) to remove primer sequences, assemble read pairs, and perform quality filtering respectively. Any reads not containing a correct primer sequence, and their mates, were discarded, and any merged reads with one or more expected errors were removed with fastq_filter; otherwise, wrapped software used default parameters. This process generated a pool of complete cox1 amplicon sequences for each of the specimens.

| Testing the utility of the reference data set
From the set of reads obtained for each specimen a single putative "high-throughput barcode" (HT barcode) sequence was designated to represent the cox1 gene of that specimen. Three methods were used to identify this HT barcode from the read mixture.
Firstly, we generated operational taxonomic units (OTU) clusters and centroid sequences using USEARCH cluster_otus. This was performed using a metabarcoding pipeline, implemented in the NAPcluster script, which includes standard functions from the USEARCH suite (Edgar, 2010), starting with the data output from The third method also finds the most highly represented sequence among reads in an amplicon mix, but limits the selection to reads of the expected amplicon length of 418 bp, and validates this selection statistically and taxonomically in order to avoid these variants. The process starts with the length filtering (in this case, rejecting any sequence not 418 bp), and groups sequences by identity (i.e., dereplication), recording the abundance of reads representing each unique sequence. Starting with the most abundant, each unique sequence is assessed for the significance of the difference in read abundance by bootstrapping. Given the total number of reads in the sample and the number of unique sequences, the probability of a sequence occurring as frequently as the most abundant sequence by chance alone is estimated using 10,000 bootstrap iterations (pvalue). A p-value of 0 designates a sequence as significantly more frequent with high confidence, and <0.5 for low confidence, above which the entire sample is disregarded because a most abundant barcode sequence for the target specimen is not clearly defined.
Finally the most abundant sequences revealed by this procedure are subjected to a BLASTn search against the NCBI nt database and the hits assessed for the focal taxon (in this case, Hymenoptera). If the most abundant read matches a different taxon, the sample was removed from further consideration. If a sequence fails the bootstrapping test, it is merged with the next most frequent sequence if their similarity is above a given threshold (99% was used here). If this merge occurs, the process restarts; if they are not sufficiently similar, the sequence is output with "low confidence" if it passes the BLAST test, or discarded and the process restarted if not. Sequences passing both tests are output as "high confidence". The method was implemented in a purpose-built tool, NAPselect, and the process is visualised in Figure S1.
The success of these three methods, and the accuracy of the sequences they output, was tested by identifying the HT barcodes against the BEEEE reference collection generated above using BLASTn with default parameters. Only matches with > 95% identity and overlap with the reference sequences of > 400 bp were retained, and the match with the highest similarity was selected, using bitscore to break ties. The taxon identity of this hit was compared against the morphological identifications supplied by taxonomists.
For each HT barcode selection method and taxonomic level, the number of correct molecular identifications at the genus and species levels was tallied.

| Exploration of concomitant DNA in the testing data set
The OTUs generated with the NAPcluster script (see above) allowed the exploration of co-amplified DNA from each bee specimen other than the primary cox1 sequence, including contaminants.
Specifically, the OTUs that did not match the NAPselect HT barcode sequence for the target specimen were designated as "secondary OTUs". These OTUs were searched against the NCBI nt database using BLASTn, followed by taxonomic binning using MEGAN6 Community Edition with the weighted Lowest Common Ancestor algorithm (Huson et al., 2016). Any OTUs assigned to Apiformes were additionally identified using BLASTn (default parameters) against the BEEEE reference collection and the NAPselect HT barcodes (as above As a control, the rate of contamination from all possible sources together was also scored, i.e., the proportion of secondary OTUs in a sample that matched any HT barcodes, out of the total number of unique HT barcodes. One-sample t tests were used to assess if the mean contamination rate for each source or source combination was significantly greater than zero. To compare between sources against the control, the effect of source on contamination rate was fitted in a quasibinomial ANOVA, setting the control as the reference level.

| A reference database of UK bees
A total of 355 bee specimens were newly sequenced for the COI bar- Halictidae (4 of 62).
De novo species delimitation from the DNA sequences using the GMYC method were based on phylogenetic trees generated for each genus (see Figure 2 for the genus Nomada). In most cases of incongruence, the GMYC either split (42 cases) or lumped (14 cases) an existing nominal species, but in rare cases the patterns of splitting and lumping were more complex (Figure 3)

| Testing HTS data against the reference library
Illumina sequencing of 762 specimens from the UK survey resulted in an average of 9,025 read pairs per amplicon pool after demultiplexing (SD = 10,615; range = 18-88,241; 95% >1,000). After read merging and stringent quality filtering, this was reduced to an average of 5,851 cox1 sequences per specimen (SD = 6,921; range = 7-56,781; 87% >1,000). Three methods were used to designate a HT barcode from these sequences for each specimen (see Materials and Methods). The NAPselect method, which validates barcode selection by statistical significance of read abundance and taxonomy, obtained a barcode for 749 individuals, failing to do so for 13 specimens (Table 1a). The latter mainly comprised libraries with very low read numbers, which were removed based on the taxonomic (no matches to Hymenoptera) or statistical (low discrimination among top abundant reads) filtering (see Appendix S1). Given this result, we were able to leverage the wide variation in read numbers to explore the effectivity of NAPselect at different read values per sample. Figure 4 shows that while performance is poor below 500 reads per sample, the percentage of libraries producing HT barcodes based on the taxonomically validated top read reaches 90% at around 1,000 reads.
Out of the barcodes chosen by NAPselect, 734 (99.7%) produced a match to sequences in the BEEEE reference set ( To account for other causes of the inconsistencies in morphological versus molecular identifications, all of the 731 NAPselect HT barcodes were combined with the 335 sequences of the reference set and subjected to phylogenetic analysis ( Figure S3). The resulting tree generally grouped sequences in small clusters of close relatives that mostly correspond well with the Linnaean taxon names (similar to Figure 2), but a total of 136 NAPselect sequences were at least partially in conflict with the names assigned to sequences in each cluster. We found evidence for problems with both the molecular and morphological identifications that may account for most of the observed discrepancies. Library contamination at either the PCR level (within a plate of sequences) or secondary tagging (within a well) may be recognisable from a) identical sequences or clusters grouping distantly related species, or b) from the recovery of secondary OTUs with the correct molecular identification. This former inconsistency is observed in around 30 individuals, including a no-

| Exploration of concomitant DNA in the testing data set
When OTU clustering was carried out on the entire data set of reads from all 762 samples, USEARCH within NAPcluster generated 498 OTUs, of which 263 were identified as Apiformes using BLAST/  Parasitiformes. Finally, Homo sapiens DNA was detected in numerous samples. The Appendix S1 and Figure S4 report further details of secondary OTU community composition.
The high incidence of NAPselect barcode sequences (i.e., Apiformes) occurring as secondary OTUs raised the question about the origin of these nontarget specimens in the barcoding mix.
Potential sources of DNA may be carryover from the traps, mixing of specimens during handling for taxonomic identification, errors in various DNA laboratory procedures, and errors in tag sequencing.
In general, the level of direct contamination with DNA sequences that were the HT barcode in another sample was low, but significantly greater than zero for most sources and source combinations (Table S1). Altogether, 132 of the 180 Apiformes OTUs were recognised as secondary OTUs in at least one sample. Compared with the control, i.e., the background level of cross-contamination from any source, there was a significant increase in contamination rate for within-plate contamination and within-plate and trap contamination ( Figure 6, Table S1), indicating that the greatest rate of contamination may have been during primary PCR. The level of cross-contamination was much lower from those samples in the same well, i.e., from secondary PCR at the library construction stage.
The low level of contamination was reflected in the pattern of cross-contamination of individual species. OTUs identified to 23 different species were each found as secondary OTUs in at least one other sample of a different species from the same trap. The most frequent of these was Lasioglossum malachurum, of which there were 37 specimens in the study from 21 traps. We HT barcoded 63 specimens of other species from these 21 traps, and L. malachurum was found in 13 of these, a rate of 20%. At trap level, the average rate for the 23 species was 7.6% (SD = 4.5). The same analysis for plates and wells showed that L. calceatum was the most common cross-contamination here, being found in 7% of samples of other species sharing a plate (PCR tag) with specimens of L. calceatum, and 5% of samples of other species sharing a well (MID) with L. calceatum. 45 species cross-contaminated within plates, with a mean rate of 2.2% (SD = 1.7), and 13 species cross-contaminated within wells (mean = 2.5%, SD = 1.1).

| Sample collection methodology
Cost-effective species-level identifications of bees and other insect pollinators are required to provide robust evidence for population changes and to inform land use management and conservation (Gill et al., 2016). This study used specimens of bees obtained through mass-collection with pan traps, which was successful in providing a wide range of species for the generation of a reference database and for testing. It should be mentioned that in the wider context of pollinator declines (Powney et al., 2019), and invertebrate declines in general (Hallmann et al., 2017), careful consideration of the use of broad-target collection methods with high collateral catches should be made (Drinkwater, Robinson, & Hart, 2019), although Gezon, Wyman, Ascher, Inouye, and Irwin (2015) show that in the case of pan traps in particular, reasonable sampling does not affect long-term community structure. Our study protocol used a relatively short pan trap exposure period designed to sample sufficient individuals for long-term monitoring whilst minimising catch sizes (Carvell et al., 2016). In this study, we demonstrate that bulk-collection methods may generate unwanted levels of cross-contamination for downstream molecular analysis, although robust bioinformatic methods can minimise the impact. More broadly, the growing use of F I G U R E 6 Box and whisker plot showing the mean and 95% confidence range of recovery rates possible cross-contamination OTUs from different sources of cross contamination. X axis shows the different sources of cross contamination, and y axis shows the proportion of possible of possible cross-contamination OTUs recovered from that source. The rate of shared OTU recovery is significantly higher when considering samples from the same plate and same plate and trap compared with a background rate of cross contamination (all possible) Cross−contamination rate metabarcoding as a tool for arthropod community studies allows us to take fuller advantage of the depth of data produced by mass-collecting than ever before, including the "bycatch" of numerous other insect pollinators, mostly in the Diptera, which are taxonomically difficult and thus have not been part of conventional monitoring.

| The reference database
We conducted this analysis in two stages, by first building the reference database using Sanger technology, which was then trialled for species identification using high-throughput sequencing of samples from a proof-of-concept monitoring scheme. The almost all bee species in the UK this set is sufficiently discriminatory. In the remaining cases the molecular analysis lumped the Linnaean species, as evident in the de novo species delimitation using the GMYC method, while an even greater proportion were shown to be split into additional GMYC groups which, however, were not incongruent with the Linnaean species.
The overall reference database comprises a mixture of UK and non-UK sequences, as many species are more widely distributed in Europe and North America. We found generally high congruence of molecular groups with the Linnaean species, which shows that the mitochondrial "gene trees" are a good reflection of the species-level entities, as both morphological diagnostics and mitochondrial markers corroborate the species hypotheses (DeSalle, Egan, & Siddall, 2005). Species discrimination may be even clearer if performed with UK samples only, as the specieslevel differences tend to be exacerbated in local subsets (Bergsten et al., 2012). The UK sample contributed many new haplotypes that may add to the power of species discrimination locally. The high congruence with the BOLD database also suggests that the identifications have been correct, in some cases after secondary inspection of specimens. However, some problems remained with the reference database, which was apparent from the 136 HTS sequences with inconsistent morphological and molecular identifications. We attributed most of this failure to either contamination or misidentification of the specimens by the NPPMF taxonomists, rather than an issue with the reference set or with the NAPselect pipeline. While low rates of cross-contamination are certainly observed across the HTS data set, there is little evidence that this was substantially higher in the mismatched specimens, observing the correct sequences as secondary reads in only a small minority of cases and many of these at a read abundance no higher than the background rate. It appears most likely that most mismatches are due to incorrect or unclear morphological identifications, based on our small-scale re-identification. In part, these problems affected the known cases of taxonomically problematic groups, in particular in Lasioglossum, Nomada and Bombus, to which this study added a large number of sequences that may be useful for a refinement of the database. We note that in none of these cases did the HT barcodes add new sequence clusters beyond those already represented in the reference set. It is clear that the 86% rate of correct molecular species identifications for NAPselect (Table 1b) is an underestimate, and that most of the 136 "mismatched" HT barcodes can in fact be correctly identified through comparison to the reference set, as shown by our reassessment of a subset of these specimens. After removal of contaminants and correction for taxonomic misidentifications, the true rate of correct identification against the reference set is closer to 97%.
The reference includes DNA clusters (established by the BIN or GMYC methods) that lumped or (mostly) split the Linnaean taxa. The molecular data failed to separate a small number of species in four of the 27 genera studied ("lumped" in Figure 3). In some instances, such as the Colletes succinctus species group, which shared haplotypes with C. hederae, morphological identification of three named species is reliable, if challenging, now that there is a key covering all UK species (Falk & Lewington, 2015), and there are biological and distributional differences while cox1 sequences are not sufficient to delineate these species (Kuhlmann et al., 2007. Similarly, the separation of the Nomada goodeniana-succincta group relies on subtle colour variants (Falk & Lewington, 2015) and they were not separable in our analysis ( Figure 2). Additional genetic markers may be useful; e.g., the three recognised Colletes species lumped in cox1 exhibit fixed differences in EF-1a and ITS (Kuhlmann et al., 2007). Vice versa, divergent cox1 entities (splitting) may indicate the existence of hitherto unrecognised species. For example, a divergent haplotype in Dasypoda hirtipes has now been associated with a morphologically differentiated, continentally European species, although it is not part of the UK fauna (Schmidt et al., 2015). We have already curated the cox1 database extensively, in particular to remove morphological identification errors, and the remaining problems affect mostly a few species of Lasioglossum that also accounted for most of the inconsistencies of molecular and morphological identification (Appendix S1 text and Figure S3). In addition, the newly detected clusters may lead to the discovery of separate entities within the Linnaean species and may provide fertile ground for future morphological work. Since DNA extraction destroyed only one leg, morphological vouchers can be re-examined to refine the reference database.

| Generating high-throughput barcodes
High-throughput barcoding ("HT barcoding") was then used to identify species from a survey of pan traps. The methodology has great potential for sequencing mixed samples (metabarcoding) but was here applied on individual specimens to test the efficacy of this approach and our ability to confidently recover a sequence for the target specimen. We employed three methods for designating this sequence from a pool of anonymous amplicons. The most intuitive approach was to undertake a standard metabarcoding analysis using the USEARCH pipeline to designate the centroid sequence of the most highly represented OTU in each sample as the HT barcode.
However, the sequence obtained with this method did not produce a BLAST hit to the reference database in 27% of cases. An alternative method was to simply select the most frequent unique sequence in the amplicon pool, analogous to the sequence that would be generated by Sanger sequencing. However, while this method also designates a barcode for every sample, these sequences are only marginally more likely to find a match to the reference database (23% did not produce a BLAST hit).
The third method, implemented in the NAPselect script, also selects the top-abundant read, but requires that this read matches a specific taxonomic group (Hymenoptera), and that the read frequency is significantly greater than frequencies of other reads, besides the requirement for exact matching the predicted sequence length. If these conditions are not met, NAPselect discards the top read and checks other reads according to descending abundance.
This pipeline did not output a sequence for 13 specimens due to low read numbers or low differentiation among other abundant reads, although NAPselect generally worked very well at reasonable read numbers. The great majority of NAPselect sequences matched the reference database, and only 3.7% of specimens did not produce a sequence with a BLAST hit -a substantial improvement over the other methods (Table 1). The key improvement introduced by this script probably was that NAPselect conducts BLAST searches against GenBank and assesses the taxonomy of the hits.
This method is clearly very effective, with error rates determined largely by sequencing depth issues rather than an inability to select the correct sequence.

| Exploration of concomitant DNA in the testing data set
Unlike standard metabarcoding conducted on mixed samples, the current analysis permits a precise determination of amplicons derived from single specimens. A surprising finding was the high proportion of reads attributable to secondary OTUs, and their taxonomic diversity. Specimens from the monitoring program were not substantially different from those used in Sanger sequencing to build the reference database, which produced clean base calls consistent with a single predominant PCR product. However, the primers for Illumina sequencing were designed for broad amplification of arthropods (Arribas et al., 2016) and probably have a wider taxonomic amplitude than the Hymenoptera-specific primers used to amplify the standard barcode region. Besides coamplification of a broader range of associated species, this may also increase the potential for sequencing of nuclear mitochondrial DNA regions (Numts). Out of 509 OTUs recovered from all samples combined, 263 were identified as Apiformes initially, which greatly exceeds the number of species expected in this survey. Numts diverge without the constraints of coding regions and thus may deviate in length, but length filtering for the expected 418 bp fragment could not avoid these artefacts sufficiently. We therefore implemented a further filter based on the distribution of low-abundance OTUs that are codistributed with the true mitochondrial copies. We only removed OTUs that form a clade with the presumed true copy (close matches to the reference database), under the assumption that Numts are of limited evolutionary persistence (Pons & Vogler, 2005). Based on these criteria a total of 72 OTUs were identified as mitochondrial Numts. This method (and the removal of several other OTUs whose incorrect assignment was revealed with the phylogeny) reduced the total number of Apiformes OTU to 180, which is closer to the 154 species identified morphologically, in particular if OTU splitting (Figure 3) is taken into account. The procedure for identifying OTUs likely derived from Numts is a novel step in the metabarcoding filtering process; however it is dependent on the availability of "true" cox1 reference haplotypes and high variation in read abundance between target cox1 OTUs and their putative Numt(s): both situations that are common in HT barcoding studies but potentially less so in metabarcoding. Here, it proved to be a critical step preventing the overestimate of species richness frequently seen in metabarcoding studies.
Other secondary OTUs were assignable to a wide range of dis- In addition, widely observed "unknown" OTUs to which MEGAN could not confidently assign an identity may be members of taxa that were poorly represented in GenBank, or they may be chimeras or sequencing errors that escaped filtering. Yet, most secondary OTUs are plausible as true associates of the target specimens and the wider pollinator community. Thus, associated DNA can be used to detect local community composition and ecological associations, including parasites, symbionts and diet of the target (Lucas et al., 2018).
Cross contamination in the traps may also explain the large number of secondary OTUs assigned to Apiformes (beyond the Numts). The potential for DNA mixing was further increased as specimens from the same pan trap were stored together prior to morphological identification and DNA extraction. However, we find that the greatest rate of contamination may have been within a single plate, i.e., between samples with the same primer index but different library indices, which could be either due to physical mixing in the laboratory, tag-jumping (in the Nextera XT indices, not the PCR tags), or errors in index sequencing. Trap-level contamination may add to the problem, as the combined model (plate x trap) shows only marginally higher levels of contamination (Table S1). Because the contamination within the wells was much lower, we conclude that the primary PCR using 13 different primer tags before being combined in a single Nextera XT library was not greatly affected by these problems, indicating that our approach of using the same unique primer tags on forward and reverse strands can largely eliminate the problem of misassignment of PCR fragments. In addition, some types of contamination were less likely to be introduced during molecular lab processing, given the precautions with specimen handling and the strict protocols of the sequencing facility, in particular regarding the widely found human DNA, present in virtually every one of the specimens. As scientists using morphological and molecular methods work together, greater awareness of these issues is needed and the steps to avoid DNA contamination should be understood and implemented, such as the use of clean pans, bee nets and storage bottles, and use of latex gloves for specimen handling during morphological identification.

| CON CLUS IONS
High-throughput sequencing can greatly change the approach to monitoring of pollinators, through mass identification of sequence reads against reference databases verified by taxonomic specialists (Ji et al., 2013;Tang et al., 2015). We first established the power of the cox1 marker for species discrimination, which only left about 5% of UK species without an unequivocal identification at species level. The subsequent utilization of the database for UK bees monitoring shows high consistency with morphological identifications conducted in parallel, and accounting for the observed morphological identification errors the correct molecular identification rates exceeds 97%. However, the deep sequencing of single specimens also revealed the various pitfalls of metabarcoding. We detected surprisingly high levels of apparent mixing with other specimens from the same and other traps. In addition, we found numerous OTUs apparently contributed by Numts, which greatly inflate estimates of the total species diversity; they can be filtered out efficiently as their distribution "trails" the actual mitochondrial copies, which should be a routine part of the read filtering procedure. Lastly, the widely used OTU clustering may not produce the most accurate species detection, as shown by a comparison of OTU analyses against the most abundant read in each sample (after adequate taxonomic and numerical filtering), which revealed a full identification of the target specimen in approximately 25% more samples. Yet, if applied under stringent quality filtering, it is possible to use HTS data at the read level, i.e., to establish genotypic variation or for assignment to particular subgroups within the Linnaean species, and thus use them in the same way as data from Sanger sequencing, but scaled up by orders of magnitude. The method thus greatly increases the accuracy and speed of taxonomic identification in pollinator monitoring, at reduced cost, while also providing further information on species interactions and ecosystem composition through the secondary OTUs. The bioinformatics methodology and comprehensive barcode database can now be rolled out for the study of much larger number of specimens typically obtained by passive pan traps and can be extended to studies of pollinators in other parts of the world.

DATA AVA I L A B I L I T Y S TAT E M E N T
Sequence data available at BOLD under the BEEEE label. Perl scripts used for the sequence clustering and barcode selection are available at https ://github.com/tjcre edy/NAPtime. Additional methods are detailed in the Appendix S1.