Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries

López-Martínez, Elena; Manteca, Aitor; Ferruz, Noelia; Cortajarena, Aitziber L.

Por favor, use este identificador para citar o enlazar a este item: http://hdl.handle.net/10261/351661

COMPARTIR / EXPORTAR:

SHARE CORE BASE	Comparte tu historia de Acceso Abierto
Visualizar otros formatos: MARC \| Dublin Core \| RDF \| ORE \| MODS \| METS \| DIDL \| DATACITE
Refman EndNote Bibtex RefWorks Excel CSV PDF DataCite Send via email

Título:	Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries
Autor:	López-Martínez, Elena; Manteca, Aitor; Ferruz, Noelia; Cortajarena, Aitziber L. CSIC ORCID
Palabras clave:	Epitope analysis Library design Tokenization Natural language processing Byte pair encoding
Fecha de publicación:	2023
Editor:	American Chemical Society
Citación:	ACS Synthetic Biology 12(10): 2812-2818 (2023)
Resumen:	Epitopes are specific regions on an antigen’s surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.
Versión del editor:	https://doi.org/10.1021/acssynbio.3c00201
URI:	http://hdl.handle.net/10261/351661
DOI:	10.1021/acssynbio.3c00201
E-ISSN:	2161-5063
Aparece en las colecciones:	(IBMB) Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
StatisticalAnalysisand-Tokenization_López_Art_2023.pdf		2,88 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro completo

CORE Recommender

SCOPUS^TM
Citations

1

checked on 24-abr-2024

Page view(s)

7

checked on 27-abr-2024

Download(s)

1

checked on 27-abr-2024

Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries

Ficheros en este ítem:

SCOPUS^TM
Citations

Page view(s)

Download(s)

Google Scholar^TM

Altmetric

Altmetric

Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries

Ficheros en este ítem:

SCOPUSTM Citations

Page view(s)

Download(s)

Google ScholarTM

Altmetric

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM