Unmasking new intra-species diversity through K-mer count analysis

Pérez Cantalapiedra, Carlos; Contreras-Moreira, Bruno; Casas Cendoya, Ana María; Igartua Arregui, Ernesto

Por favor, use este identificador para citar o enlazar a este item: http://hdl.handle.net/10261/162254

COMPARTIR / EXPORTAR:

SHARE BASE	Comparte tu historia de Acceso Abierto
Visualizar otros formatos: MARC \| Dublin Core \| RDF \| ORE \| MODS \| METS \| DIDL \| DATACITE
Refman EndNote Bibtex RefWorks Excel CSV PDF DataCite Send via email

Título:	Unmasking new intra-species diversity through K-mer count analysis
Autor:	Pérez Cantalapiedra, Carlos CSIC ORCID; Contreras-Moreira, Bruno CSIC ORCID ; Casas Cendoya, Ana María CSIC ORCID ; Igartua Arregui, Ernesto CSIC ORCID
Palabras clave:	Copy Number Variations (CNV) Gene families Genotyping Barley Presence-Absence Variation Sequencing Plant Genomics NBS-LRR K-mer Analysis Pentotricopeptide Pangenomics Exome Capture
Fecha de publicación:	mar-2018
Citación:	EUCARPIA Cereal Section/ IWW2 Meetings (Polydome - Clermont-Ferrand, France. 19-22 Marzo 2018)
Resumen:	High-throughput sequencing is often used to examine intra-species diversity. Most studies are focused on calling and genotyping SNPs. Other kinds of genomic variation, such as copy-number variation (CNV), are more rarely exploited despite literature reports linking them to phenotypic differences. For some loci, it is difficult to identify reliable SNPs. For instance, reads from closely related sequences (e.g. paralog genes) will often map stacked to the same location if some of those loci are absent from the reference sequence. Such piled up mappings produce abundant fake heterozygous SNPs, and thus have been called apparent heterozygous mappings (AHMs). To avoid wrong conclusions from false positive calls, SNPs from AHMs are often discarded, either in early (e.g. samples expected to be homozygous), or in downstream steps of the analysis (e.g. when incoherent haplotype blocks are identified). This would lead to information loss at certain loci. AHMs can be seen as a kind of CNV which is specific to non-identical copies. Unmasking such variation could help to i) assess the completeness of a genome or pan-genome reference, ii) confirm results from other CNV genotyping methods, when the copies originate in non-identical loci, iii) provide hints about the history and behavior of duplicating DNA loci, and iv) reveal novel intra-species genetic diversity. Here we present a software pipeline, kmeleon, available at https://github.com/eead-csic-compbio/kmeleon, designed to identify regions harboring AHMs. kmeleon is based on mappings, and thus it can be used for both homozygous and heterozygous samples. First, the different k-mers (sequences of length k) mapping to a single locus are identified and counted. Then, loci are classified based on the presence or absence of AHMs. From those intervals, it is straightforward to perform comparisons between genotypes, or to translate existing annotation to the regions with AHMs. We used exome capture data to detect AHMs in a set of barley accessions. We included the cultivar Morex, the genotype of the genome reference, as a control sample. As expected, it had the lowest number of AHMs, although some were still detectable. For all accessions, AHMs were found both in inter- and intragenic loci. Enrichment analysis showed that NBS-LRR proteins were overrepresented at AHMs, whereas PPRs proteins were depleted. Also, we will show that AHMs can be used to infer phylogenetic trees which are congruent to those produced with SNP-based approaches, supporting the information value, of this hidden variability, to describe genetic relationships.
Descripción:	1 .pdf copy (3 Figs.) from the original poster of the Authors. Creative Commons License Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
URI:	http://hdl.handle.net/10261/162254
Aparece en las colecciones:	(EEAD) Comunicaciones congresos