Please use this identifier to cite or link to this item: http://hdl.handle.net/10261/269887
Share/Export:
logo share SHARE BASE
Visualizar otros formatos: MARC | Dublin Core | RDF | ORE | MODS | METS | DIDL | DATACITE
Title

CLARA-MeD corpus

AuthorsCampillos-Llanos, Leonardo; Terroba Reinares, Ana Rosa; Zakhir Puig, Sofía; Valverde Mateos, Ana; Capllonch Carrión, Adrián
KeywordsComparable corpus
Parallel sentences
Medical text simplification
Biomedical natural language processing
Issue Date19-May-2022
PublisherDIGITAL.CSIC
CitationCampillos-Llanos, Leonardo; Terroba Reinares, Ana Rosa; Zakhir Puig, Sofía; Valverde Mateos, Ana; Capllonch Carrión, Adrián; 2022; CLARA-MeD corpus [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/14644
AbstractA collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification.
DescriptionA collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022.
URIhttp://hdl.handle.net/10261/269887
DOIhttps://doi.org/10.20350/digitalCSIC/14644
ReferencesLeonardo Campillos-Llanos, Ana Rosa Terroba Reinares, Sofía Zakhir Puig, Ana Valverde-Mateos, and Adrián Capllonch-Carrión (2022) "Building a comparable corpus and a benchmark for Spanish medical text simplification". Procesamiento del lenguaje natural, nº 69. http://hdl.handle.net/10261/269888
Appears in Collections:(CCHS-ILLA) Conjuntos de datos

Files in This Item:
File Description SizeFormat
CLARA-MeD-corpus.zip196,13 MBzipView/Open
README.txt8,1 kBTextView/Open
Show full item record
Review this work

Page view(s)

68
checked on Aug 18, 2022

Download(s)

15
checked on Aug 18, 2022

Google ScholarTM

Check

Altmetric

Dimensions


WARNING: Items in Digital.CSIC are protected by copyright, with all rights reserved, unless otherwise indicated.