List of existing resources useful for natural language processing in the pharmacovigilance domain
- source: type of source, e.g., scientific paper, abstract, drug leaflet, patient forum, tweet, etc.
- lang = languages: comma-separated list of 2-letter ISO codes
- description: short characterization of the corpus
- noteworthiness: any specific feature of this dataset
- NER: are entities annotated, and for what types of entities
- linking: is entity linking provided, and to what ontologies
- REL: are relations annotated
- IE = information extraction style: between entity instances (one per pair of entity spans),
- KB = knowledge-base style: between entities (one per text and pair of [linked] entities),
- CL = text classification style: presence of a relation between entity types (one per text and pair of entity types); if only one type of relation is considered, this is a binary text classification task
- REL list: if REL is non null, list of annotated relations
- format: CONLL, BRAT, etc.
- size: number of language units such as documents, sentences, words (please no megabytes)
- publication: reference to a publication (peer-reviewed rather than preprint)
- URL: URL where the dataset can be downloaded or is described
name | source | lang | description | noteworthiness | NER | linking | REL | REL list | format | size | publication | URL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TLC | patient forum | de | dataset annotated with layman expressions: Fachterm, Laienbegriff, Abkürzung | layman terms, including their associated technical terms; technical term with a rather layman term | no | no | BRAT | 4000 documents | https://www.aclweb.org/anthology/2020.lrec-1.759/ | http://macss.dfki.de/data/LREC2020/TLC_v01.tar.gz |