CAS: corpus of clinical cases in French

Background Textual corpora are extremely important for various NLP applications as they provide information necessary for creating, setting and testing those applications and the corresponding tools. They are also crucial for designing reliable methods and reproducible results. Yet, in some areas, such as the medical area, due to confidentiality or to ethical reasons, it is complicated or even impossible to access representative textual data. We propose the CAS corpus built with clinical cases, such as they are reported in the published scientific literature in French. Results Currently, the corpus contains 4,900 clinical cases in French, totaling nearly 1.7M word occurrences. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (PoS-tagging, lemmatization) and semantic (the UMLS concepts, negation, uncertainty) annotations. The corpus is being continuously enriched with new clinical cases and annotations. The CAS corpus has been compared with similar clinical narratives. When computed on tokenized and lowercase words, the Jaccard index indicates that the similarity between clinical cases and narratives reaches up to 0.9727. Conclusion We assume that the CAS corpus can be effectively exploited for the development and testing of NLP tools and methods. Besides, the corpus will be used in NLP challenges and distributed to the research community.


Background
Textual corpora are central for various NLP applications as they provide information necessary for creating, setting, testing and validating these applications, the corresponding tools, and the results. Yet, in some areas, due to confidentiality or to ethical reasons, it is complicated or even impossible to access representative textual data typically created and used by the actors of these areas. For instance, medical and legal areas are concerned with these issues: in the legal area, information on lawsuits and trials remains confidential, while in the medical area, medical confidentiality must be respected by the medical staff. In both situations, personal data cannot be made publicly available, which prevents corpora from *Correspondence: natalia.grabar@univ-lille.fr † Natalia Grabar, Clé Dalloux and Vincent Claveau contributed equally to this work. 1 CNRS, UMR 8163, F-59000 Lille, France 2 Univ. Lille, UMR 8163 -STL -Savoirs Textes Langage, F-59000 Lille, France Full list of author information is available at the end of the article being released and makes experiments non-reproducible by other researchers and with other methods. To face such situations, Natural Language Processing (NLP) proposes specific methods and tools. Hence, for several years now, anonymization and de-identification methods and tools have been made available and provide competitive and reliable results [1][2][3][4] reaching up to 90% precision and recall. But it may still be difficult to access de-identified documents and use them for research. One reason is that there is a risk of re-identification of people, and more particularly of patients [5,6] because medical histories can be unique. In consequence, the application of deidentification tools on personal data often does not permit to make the data freely available and usable within the research context. Yet, there is a real need for the development of methods and tools for several applications suited for such restricted areas. For instance, in the medical area, it is important to design suitable tools for information retrieval and extraction, for recruiting patients for clinical trials, for performing several other important tasks such as indexing, study of temporality, negation, etc. [7][8][9][10][11][12][13]. Another important issue is related to the reliability of tools and to the reproducibility of study results across similar data from different sources. The scientific research and clinical communities are indeed increasingly coming under criticism for the lack of reproducibility in the biomedical area [14][15][16], but notice that, for instance, psychology is concerned with this issue as well [17][18][19]. The first step towards the reproducibility of results is the availability of freely usable tools and corpora. In the current contribution, we are mainly concerned with the construction of freely available corpora for the medical domain. Yet, we are aware that sharing tools and methods is also important. We assume that availability of corpora may boost the design and dissemination of other resources, methods and tools for biomedical tasks and applications.
The purpose of our work is to introduce the CAS corpus, that contains clinical cases in French such as those published in scientific literature or used in the education and training of medical students. In what follows, we first present some existing studies on medical corpora creation ("Existing work: freely available clinical corpora"), highlighting corpora which are freely available for research. We then present the methods used for building, annotation and analysis of the CAS corpus with clinical cases in French ("Methods"). The results are presented in "Results" and discussed in "Discussion". We conclude with some directions for future work ("Conclusion" sections). The work presented in this article is an extended and updated version of our previous publication [20].

Existing work: freely available clinical corpora
Within the medical area, we can distinguish two main types of medical corpora: scientific and clinical.
• Scientific corpora are issued from scientific publications and reporting. Such corpora are becoming increasingly available to researchers thanks to recent and less recent initiatives dedicated to open publication, such as those promoted by the NLM (National Library of Medicine) through the PUBMED portal 1 and specifically dedicated to the biomedical area, and by the HAL 2 and ISTEX 3 initiatives, which provide generic portals for accessing scientific publications from various areas, including medicine. Such corpora contain scientific publications that describe research studies: motivation, methods, results and issues on precise research questions. Other portals may also provide access to scientific literature aimed at specific purposes, namely indexing reliable literature, such as proposed by HON [21], CISMEF [22], and other similar initiatives [23]. Some existing scientific corpora also provide annotations and categorizations, such as PoS-tagging [24] and negation [25]. These are often built for the purposes of shared tasks [26,27]. • Clinical corpora are related to hospital and clinical events of patients. Such corpora typically contain documents that describe medical history of patients and the medical care they are undergoing. This kind of corpora is typically created and used in clinical context as part of the healthcare process. Even after de-identification, it is complicated to obtain free access to this kind of medical data and, for this reason, there are very few clinical corpora freely available for research.
In our work, we are mainly interested in clinical corpora: the proposed literature review of the existing work is aimed at clinical corpora that are freely available for research. We present here the main existing clinical corpora: • MIMIC (Medical Information Mart for Intensive Care), now available in its third version, provides the largest available set of structured and unstructured clinical data in English. MIMIC III is a single-center database comprising information pertaining to patients admitted in critical care units at a large tertiary care hospital. Those data include vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework [28]. Those data are widely used by researchers, for instance for predicting mortality [29,30], for diagnosis identification and encoding [31,32], for studies on temporality [33] or for identifying similar clinical notes [34], to cite just a few existing studies. Data from these corpora are also used in challenges, such as i2b2, n2c2 and CLEF-eHEALTH. • i2b2 (Informatics for Integrating Biology and the Bedside) 4 is an NIH-funded initiative promoting the development and test of NLP tools for English-language documents with the purpose of healthcare improvement. In order to enhance the ability of NLP tools to process fine-grained information from clinical records, i2b2 challenges provide sets of fully de-identified clinical notes enriched with specific annotations [9,11,35], such as: de-identification, smoking status, medication-related information, semantic relations between entities, or temporality. The clinical corpora and their annotations built for the i2b2 NLP challenges are available now for general research purposes. Finally, medical data, close to those handled in the clinical context, can be found in clinical trials protocols. One example is the corpus of clinical trials annotated with information on numerical values in English [36], and on negation in French and Brazilian Portuguese [37,38].

Methods
We first describe the specificity of the sources and clinical cases from which the CAS corpus was created ("Building the corpus"), then the annotation rationale ("Annotation of the corpus"), and the principles of its comparison with similar clinical narratives from Rennes University Hospital ("Comparison with clinical narratives" sections).

Building the corpus
The CAS corpus in French contains clinical cases as published in scientific literature, legal or training material. Hence, it is built using material freely available in online sources. The collected clinical cases are published in different journals and websites from Frenchspeaking countries in various continents. Those clinical cases are related to various medical specialties (e.g. cardiology, urology, oncology, obstetrics, pulmonology, gastroenterology...). The purpose of clinical cases is to describe clinical situations for real de-identified or fake patients. Common clinical cases are typically part of education programs used for training medical students, while rare cases are usually shared through scientific publications to illustrate less common clinical situations. As for clinical cases which can be found in legal sources, they usually report on situations which became complicated due to various reasons emanating from different healthcare levels: medical doctor, healthcare team, institution, health system and their interactions.
Similarly to clinical documents, the content of clinical cases depends on the clinical situations that are illustrated, and on the disorders, but also on the purpose of the presented cases: description of diagnoses, treatments or procedures, evolution, family history, adverse-drug reactions, expected audience, etc.
Data in published clinical cases are de-identified by the authors prior to their publication. Besides, publication is usually done with the written permission of patients. The case reports can be related to any medical situation (diagnosis, treatment, procedure, follow-up...), to any specialty and to any disorder. The typical structure of scientific publications with clinical cases starts by introducing the clinical situation, then one or more clinical cases are presented to support the situation. Schemes, imaging, examination results, patient history, lab results, clinical evolution, treatment, etc. can also be provided for the illustration of clinical cases. Finally, those clinical cases are discussed. Hence, such cases may present an extensive description of medical problems. Such publications gather medical information related to clinical discourse (clinical cases) and to scientific discourse (introduction and discussion). The related scientific references are also provided. Figure 1 shows one example of clinical case published in English in Archive of Clinical Cases 9 . We can see that this case first describes the patient involved and the reason of the consultation (complaints of the patient). Then it indicates the examination results and the history of the patient. After the diagnosis is done, the patient undergoes medical procedures. Finally, the issue of the intervention is indicated. As one can see, the clinical part of publications on clinical cases may be very similar to real clinical documents. Nevertheless, misspellings, which are quite frequent in clinical documents, may be less frequent in publications with clinical cases.

Annotation of the corpus
The purpose of the annotations is to enrich the corpus with morpho-syntactic and semantic information. Annotation at the morpho-syntactic level is performed with a tool developed in-house and freely available as an online web-service at https://allgo.inria.fr/app/tagex. Indeed, we take advantage of the clinical cases corpus to develop a PoS-tagger specific to the biomedical and clinical French texts. The purpose of this PoS-tagger is to improve how biomedical terminology is handled and to take into account idiomatic expressions with specific syntactic roles, such as nominal (point de vue (point of view)), adverbial (tout à fait (fully, completely), de temps en temps (sometimes)) or prepositional (en faveur de (in favor of), en l'absence de (in absence of) groups. PoS-tagging is done by training a CRF model [39] and adopting an Active Learning framework with a previously proposed strategy [40] to select new examples. The process starts with a small amount of manually annotated data which is used to train a first CRF model; this set is then used to annotate new data, which are partly manually corrected and used to train a second CRF model, etc. In this way, it provides a large set of annotated sentences with minimal human supervision. Lemmatization is done by learning rewriting rules relying on the final substring of the word, the size of the substring, and the PoS tag of the word. The application of the rules depends on the PoS tag and the size of the substring. For instance, for a word with a given PoS tag, if there exists a rule matching its 7 last letters, that rule is applied. If no such rule exists, a rule matching the 6 last letters is searched, etc.
In addition, several layers of semantic annotation, shown in Table 1, are performed automatically.
• Concept Unique Identifiers (CUI). The CUI tagging corresponds to the French terms from the UMLS [41]. This tagging is done for single and multi-word will be preferred to vitamine B12 (C0042845); • Negation. Negation indicates whether a given disorder, procedure or treatment is present or not in the medical history and care of a given patient. Therefore, detecting negation in biomedical texts has become one of the unavoidable pre-requisites in many information extraction tasks. As presented in [38], 200 clinical cases (87,487 word occurrences) from the CAS corpus were manually annotated by two annotators. In this subset of the CAS corpus, out of 3,811 sentences, 804 sentences contain at least one instance of negation, which corresponds to 21% of negated sentences. The inter-annotator agreement [42] is 0.90 for negation cues and 0.81 for negation scope. Besides, additional 6,601 sentences from another corpus (corpus ESSAI built with clinical trial protocols) were annotated as well to find 1,025 more negative sentences. Those manually annotated data were then exploited for training several supervised learning models. We first train a CRF model for negation cue detection, and secondly, we train a bidirectional long short-term memory neural network with a CRF prediction layer (BiLSTM-CRF) for negation scope detection. The results presented in [38] go up to 0.97 F-measure for cue detection and 0.91 for scope detection on sentences from the CAS corpus; • Uncertainty. Uncertainty is also an integral part of medical discourse and must be taken into account for a more precise computing of the status of disorders, procedures and treatments. Uncertainty cues correspond to simple and complex lexical cues like probablement, certainement (probably, certainly) and morphological cues like conditional verbs indiquerait, proviendrait (should indicate, may be caused by). Similarly to negation, 200 clinical cases have been annotated by one human annotator for marking up the uncertainty cues and their scope. Overall, out of 3,811 sentences, 226 sentences contain uncertainty, which corresponds to 6% of uncertain sentences. An additional 6,601 sentences from the ESSAI corpus have also been annotated to find 631 uncertain sentences. Similarly to negation, these annotated data have been used to train several supervised learning models for the detection of uncertainty cues and their scope. Hence, we obtain up to 91.30 F-measure for cue detection and 86.73 for scope detection in sentences from the CAS corpus [38].
Since there may be several markers of negation and uncertainty in a sentence, they are numbered with their scopes accordingly: the scope of each detected cue is processed independently by the model.

Comparison with clinical narratives
The purpose of the comparison between clinical cases and clinical narratives is to assess the degree of similarity Clinical cases are related to specific clinical situations and describe precise situation (diagnosis, surgery, chemotherapy...) for one patient mainly, but sometimes for more than one patient, and the clinical issue for that patient (improvement, stable, death...). The clinical narratives at our disposal come from Rennes University Hospital. Each narrative is related to a patient's hospitalization and can contain multiple notes depending on the length of stay and the medical specialties involved. We should keep this specificity in mind, because it may cause some statistical differences between the two corpora. Before being compared, the documents are split into sentences and words, the words are converted into lowercase and numerical sequences are removed. Thus, we expect to better capture the lexical similarity and variety of the two corpora.

Results
In this section, we first describe the content of the CAS corpus ("Content of the corpus"), then the annotation output and statistics ("Annotation of the corpus" sections).

Content of the corpus
Currently, the corpus contains 4,900 clinical cases in French, totaling nearly 1.7M word occurrences. This gives 350 word occurrences per clinical case on average. This corpus is continuously updated, as we are periodically adding new, non-annotated clinical cases. In the next section, we present some of the annotations performed on the version of this corpus containing 4,268 clinical cases (over 1.5M word occurrences).

Annotation of the corpus
The corpus contains several layers of morpho-syntactic and semantic annotations. The tags follow the IOB (Inside-Outside-Begin) format.
In Table 1, the second and third columns give the PoStagging and lemmatization for the sentence L'adolescent paraît triste et ne parle pas (The teenager seems to be sad and does not speak). As the Table shows, we have chosen to use explicit PoS-tags (determiner, common_noun, adjective, etc.). When a given tag corresponds to one word, the tag is prefixed with B for Beginning position. In the next columns, the words receive the CUI, negation and uncertainty annotations, when relevant, as well as the scope of negation and uncertainty. Table 3 summarizes the statistics of the corpus. We indicate the size of the corpus (number of clinical cases and words). When the clinical cases are issued from scientific publications, they are accompanied by their discussions. 2,626 such discussions have been collected together with the 4,268 cases. They contain over 1.8M word occurrences. We also indicate the average number of units automatically recognized in clinical cases for each category (CUIs, and uncertainty and negation cues).
The PoS-tagger has been evaluated on a 3,000 token excerpt from the CAS corpus and compared to Tree-Tagger [43], a commonly used PoS-tagger. Three human annotators independently annotated the tagger output in terms of errors of sentence segmenting and tokenization, errors of PoS and errors of lemmas. The annotation errors were then discussed to reach a consensus. Table 4 presents the results. We can see that our tagger seems to be more suitable for the processing of medical texts as it provides results with fewer errors.  In order to evaluate the automatic annotation of CUIs, we manually developed the reference annotation on a subset of the corpus concerning the anatomy concepts. For this, two human annotators independently annotated a 30,000-token excerpt of the corpus. Annotations were then discussed to reach a consensual annotation. This reference annotation was then used to evaluate our automatic CUI annotation, considering of course only the CUIs related to the ANATOMY semantic type of the UMLS. Table 5 sums up the performance obtained in terms of precision and recall. Note that most of the anatomy missed by our tool are in fact due to the absence of the corresponding concept in the UMLS (like cavités pyélo-calicielles (pyelocaliceal cavities), parenchyme (parenchyma), avant-pieds (metatarsal bones), or canal carpien (carpal tunnel)).

Discussion
In this Section, we first discuss the comparison of the CAS corpus with similar clinical corpus (clinical narratives from Rennes University Hospital ("Comparison of the CAS corpus with similar clinical narratives")) and then discuss the utility of this kind of corpus ("Utility of the corpus" sections). Table 6 provides with a global comparison between clinical cases and clinical narratives. For the same number of documents and patients (4,268 and 951), it indicates the number of sentences, the number of words (occurrences and types), the average number of sentences per document, and the average number of words per sentence. One can see that narratives are longer than clinical cases in terms of sentences, and word occurrences and types. As we explained, this is mainly due to the fact that each medical specialty involved produces at least one note. Depending on the length of stay, many notes may be added to the patient's record. Hence, we observe a greater variance of vocabulary and a higher number of sentences in  clinical narratives. Another difference, which may be due to the source of the corpora, is that the average number of words per sentence is higher in clinical cases. There are three reasons for this:

Comparison of the CAS corpus with similar clinical narratives
• clinical narratives may indeed contain table-like presentation of examinations, prescriptions and lab results, while in clinical cases this information is usually presented as part of sentences; • because clinical cases are part of scientific articles, the sentences they contain are usually well formed and provide exhaustive information about a given issue on patients; • clinical narratives contain many administrative information (address, dates, department contacts, politeness expressions, etc.) which are not present in clinical cases.
In Table 7, we indicate the textual similarity between the two corpora compared. Here again, the values are provided for the same number of documents and patients (4,268 and 951). The similarity is computed with the Jaccard index [44]. This index is a statistic measure used for evaluating the similarity and diversity of sample sets. When applied to textual data, it evaluates the textual and lexical similarity of the corpora compared. The index is defined as the size of the intersection divided by the size of the union of the corpora: J A,B = A∩B A+B−(A∩B) , where A and B are the two corpora compared. The values of the Jaccard index are between 0 (no intersection) and 1 (same content). Hence, the closer the value to 1 the higher the similarity of the two corpora. As can be seen in Table 7, the similarity values are high (0.6943 at lowest) and they We also computed the out of vocabulary (OOV) words in clinical cases. The purpose is to discover what the specificity of the vocabulary is, whether there are misspellings and what the types of those misspellings are. The comparison is done with Lexique380, 10 a lexicon created by psycholinguists. It contains over 135,000 entries, among which inflections for nearly 35,000 lemmas of nouns, adjectives and verbs. This lexicon can be considered as reference lexicon representing the average language performance in French. Among the OOV words (i.e. out of Lexique380), we can find technical medical terms, sequences with segmentation problems (léquipe (the team), dopplerhépatique (hepatic doppler) or maines from humaines (human)), missing or excessive diacritics (eruption (eruption), realisés (done), brulure (burn), révèlé (shown), acidóse (acidosis), téte (head)), and other misspellings (rhytme (rythm), diabèt (diabetes), cliniquemen (clinically), agricultureil (agricultural), infarctuse (infarction), éxtrimité (extremety)). The misspellings we find in the clinical cases corpus are comparable with the existing typologies [45,46] and fall into the proposed types: insertion, omission, substitution, transposition, multiple/mixed. Yet, a more exhaustive analysis 10 http://www.lexique.org/ and comparison of misspellings and their prevalence in French clinical narratives and clinical cases has yet to be done.

Utility of the corpus
We assume that this kind of corpus is useful for the development and test of automatic tools dedicated for the processing of clinical and medical documents. In addition, the fact that this corpus is freely available for research, will help the sustainability of automatic tools and reproducibility of their results. It may also encourage the competition and robustness of the proposed tools and methods. For instance, the CAS corpus has been used in the French NLP challenges DEFT 2019 11 and DEFT 2020. 12 Specific manual annotations have been prepared for this challenge (gender and age of patients, consultation reasons, healthcare outcome in 2019, procedures, signs and symptoms, anatomy, medication and other fine-grained information in 2020). Several teams have been attracted by the challenge and participated on tasks dedicated to information retrieval and extraction. This corpus may be exploited in other NLP challenges.
As it becomes more difficult to access clinical documents, corpora with published clinical cases, which are freely available online, may help researchers to work on this type of data. Let's also mention the existence of a similar corpus with clinical cases in German [47], whose purpose is also to help working on clinical and medical texts.

Conclusion
We presented a new corpus in French, called the CAS corpus, which provides medical data close to those produced in the clinical context: description of clinical cases of real or fake patients and their discussion. Overall, the corpus currently contains 4,900 clinical cases, totaling nearly 1.7M word occurrences. A subset of this corpus is currently annotated with several layers of information: morpho-syntactic (PoS-tagging, lemmas) and semantic (the UMLS concepts, uncertainty, negation and their scopes). The corpus is being enriched with more clinical cases published. Besides, other annotation layers are being added and their correctness cross-validated by human annotators.
This corpus has been compared with similar data from Rennes University Hospital. Our analysis showed that there is a strong lexical similarity between these two corpora, which increases when longer sentences are considered. An analysis of out of vocabulary words seems to indicate that misspellings are similar to those proposed in the existing typologies.
The very purpose of our work is to create annotated corpora with clinical cases in French and to make them freely available for research. We expect that this may encourage the development of robust NLP tools for medical narrative documents.