De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Background Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. Results We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. Conclusions The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.

these records and reports contain patient notes known as personal data that can challenge patient confidentiality and privacy, as provided for in the European Regulation on the protection of personal data [3]. All words that could identify a patient must be removed or de-identified before data analysts start their research or even more before the dataset is published.
From a legal point of view, Regulation (EU) 2016/67 on the protection of natural persons and with regard to the processing of personal data and on the free movement of such data [3] provides the regulatory framework in the European Union. Although its application is mandatory to all its member states, its concrete implementation varies depending on each of them. In Spain, the Organic Law 3/2018 [4] establishes the legal framework for data pro-tection in biomedical research. Reuse of personal data for medical research needs to be approved by an ethics committee, and data must be at least pseudonymized before the researchers get access to it.
Legal issues regarding data privacy are not the only source of concern. Direct consequences for patients are also a very important factor to be carefully considered. It is crucial to protect the private health details of a patient from any third party's access, and avoid exposing identifiable personal data such as identifier numbers or addresses. De-identification is therefore essential to ensure patient privacy and comply with legal requirements.
From a data management point of view, the deidentification methodology needs to be precise and recallable. Precision is needed to minimize the data loss of the de-identification process and to preserve the semantic meaning of the radiology report; recall allows getting the best de-identification possible and avoid leaving any identifiable information in the text [5].
Even though several de-identification or anonymization methodologies have been proposed in English, legislation differs on a national level worldwide and language-specific problems can arise, hence a different method for each language must be developed. These difficulties extend to any Natural Language Processing (NLP) implementation. In the biomedical field, NLP has been applied successfully in English, including for de-identification purposes [6], but many of these strategies rely on language-specific resources and are not extensible to other languages [7]. Apart from the English language, this problem has been assessed in French, where different strategies from machine learning to the use of dictionaries and lists have been proposed, along with protocols for corpus development [8,9]. In other languages such as German, Swedish, Dutch or Chinese some strategies and methodologies have also been proposed [5,[10][11][12][13], but there have been so far rather limited attempts in automatic de-identification for Spanish medical texts [14][15][16], such as the MEDDOCAN task [16]. For the sake of giving an insight on the different approaches proposed by these authors, the datasets used and the performance of each work, we have summarized this information in Table 1.
Most of the works around text de-identification are based on pattern matching or machine learning, or even a combination of both. Whereas pattern matching does not account for the context of a word and is unaware of typographical errors, machine learning techniques require a large corpus of annotated text [17]. Since our radiology reports were mostly free text with sensitive data outside headers, we opted for annotating our own corpus and developing a Named Entity Recognition (NER) based de-identification method.
NER is a sequence tagging task comprised inside the field of NLP, which focuses on assigning different tokens or words into specific predefined classes, such as persons, dates or organizations. NLP tasks are usually based on recurrent neural networks (RNNs), and NER approaches tend to employ long short-term memory units (LSTM) [18] combined with conditional random fields (CRF) [19,20]. LSTMs are variants of RNNs that can cope with long distance dependencies in the text, and for many applications it is beneficial to access to left and right context in the sentence through bi-directional LSTMs [20,21]. Moreover, the reference model for several stateof-the-art NER implementations in English language is the bidirectional LSTM (BiLSTM)-CRF model by Lample et al. [22][23][24]. Some implementations combine LSTM units with convolutional layers [24,25], and other architectures such as Bidirectional Encoder Representations for Transformers (BERT) [26] have been proposed for several NLP tasks, including NER. Although some contests and projects have been organized to exploit the content of unstructured clinical records in Spanish language using NLP tools, they are not focused on de-identification. For example, Cantemist (Cancer Text Mining SharedTask) is a project held to gather a community effort to create tools and models to perform text mining using NLP in oncological records [27]. The best performing models in this contest were based on BiLSTM with CRF. Nevertheless, regarding the de-identification of clinical text for secondary use, in 2019 the MEDDOCAN (Medical Document Anonymization) task was organized. The most successful models in this task employ deep learning-based methodologies to perform a NER detection task, for instance, the winner model presented by Lange et al. [28] used a network based on BiLSTM-CRF and achieved a recall and F1 score of 0.974. The second-best model for the de-identification task was designed by Jiang et al. [29] with a model based on BERT and Flair embeddings, and achieved a recall of 0.962 and a F1 score of 0.968. The third proposed model used a spaCy NER model achieving a recall of 0.953 and F1 score of 0.960 [30].
Having in mind that the best NER approaches in Spanish language and in the general literature are based on RNNs with LSTM units and CRF, we decided to focus our work on these architectures. Nevertheless, automatic deidentification approaches do not achieve a perfect recall score, meaning that sensitive information could be leaked. To address this issue, we have proposed and developed a methodology to combine both NER and the replacement of the named entities recognized with synthetic data.

Methods
The proposed methodology is based on a combination of NER and the substitution of the detected sensitive words with others randomly sampled from databases. The approach started with the definition of the named entities that contain sensitive information and the annotation of the corpus (Fig. 1a). Then, a randomizer script was created based on publicly available databases to create a synthetic corpus by substituting the manually annotated words by new ones extracted from the databases (Fig. 1b). This corpus was then fed to different NER neural networks to assess their performance and select the most suitable model for the desired application (Fig. 1c). Lastly, when a new radiology record needs to be de-identified, the trained model detects the named entities and the randomizer script substitutes them with random words of the same category (Fig. 1d).

Named entities
Given that there is no specific guidance in the Spanish legal system on what information has to be removed to de-identify medical texts, we decided to assess the presence in our corpus of the Protected Health Information (PHI) categories defined by the Health Insurance Portability and Accountability Act (HIPAA) in the United States of America [31]. After manual inspection of the data and considering the scope of this work, we performed a subselection of PHI categories and finally grouped them in 6 Named Entities (NEs) as shown in Table 2. Some NEs included other information that should be protected to preserve the privacy of patients or doctors but was not included in PHI categories, such as digital signatures or healthcare centres. The named entities selected were: • NAME (name): This NE includes names and surnames of any person mentioned in the radiology record, typically patients or medical staff. • DIR (address): Includes geographic data in form of full addresses, including streets and zip codes. • LOC (locations): Considers geographic data referring only to cities, villages and other populated areas. This is differentiated from the DIR named entity due to the possibility of a city to be mentioned out of the context of a full address, for example, next to a data as in "14 de Abril, Valencia". • NUM (numbers): Includes any number or alphanumeric string that might identify a person, such as patient record identification numbers, medical license numbers, digital signatures, fiscal identification numbers and others. • FECHA (dates): Any date available in the report, either numeric or written. • INST (institutions): Any healthcare facility or institution mentioned in the radiology record that could be used to narrow the location of a patient or medical staff.
Header sections (CAB) were included as a seventh NE to ensure that they were not removed from the final text. These headers are necessary for further analysis, being key to extract the most relevant information of a radiology report.

Corpus construction
The de-identification corpus consists of brain imaging radiology reports randomly extracted from the Medical Imaging Databank of the Valencian Region (BIMCV) database [32,33], distributed among 17 health Selection of databases to develop a randomizer script. The script is used to create the synthetic corpus. c Training and testing of different neural networks to select the best performing model. d When a new report needs to be de-identified, the selected model labels the words that belong to one of the defined named entities. Finally, the randomizer script creates a de-identified report with synthetic information departments of the Valencian Region (Fig. 2). A total of 7848 records were initially retrieved and automatically pre-annotated using the Spanish National Statistics Institute name and surname database [34], which includes those names with a frequency higher or equal to 20 in Spain, and a list of the hospital names in the Valencian Region. To ensure the presence of personal information in our corpus, a subset of reports with at least two "NAME" tags was extracted. This filter left out of the selected reports those including words like "cabeza", included in the text as an anatomical part although it can be also a surname, but containing no sensitive information. One-third of those reports were randomly selected to be manually corrected and annotated, with a final corpus of 692 records. The annotations were manually reviewed by three annotators, including finally all the NE tags.
Radiology reports were not pre-processed so that they remain unchanged after the de-identification, apart from the identifying information. Although our radiology reports were mostly free-text sections preceded by headers, the 7th health department lacked headers and had an increased number of entities entirely out of context: this is, a name or a surname with no more text in an independent line, as shown in Fig. 3. With this in mind, we divided our dataset into three sets: To assess the performance of our final model with external data, we decided to incorporate 100 randomly selected clinical records from the MEDDOCAN task [16]. These records have a different structure (Fig. 4) and are not related to radiology.
Whereas both training and validation sets present a similar distribution of NEs (Table 3), the test set shows an increase of addresses, locations and institutions. Having a separate test for department 7 allows us to check the performance of our method with highly unstructured data, with a distribution of NEs different from the training. As shown in Table 3, addresses and locations are the NEs with the lowest sample size.

NE randomization
We developed a methodology to randomize the PHIs found in a text, and applied it to the manually labelled dataset, obtaining a synthetic corpus. This methodology applies a set of rules depending on the NE associated with each tagged word. It is based on the substitution of tagged entities with new words randomly extracted from different databases available online: • Spanish National Statistics Institute name and surname database [34], weighted by frequency. This database includes foreign names and surnames, such as Xiaojing, Steven, Abdul or Harrison. • Spanish National Statistics Institute municipal Fig. 2 Data curation process and corpus preparation workflow. a 7848 radiology reports in total were retrieved from BIMCV database. b We used a custom Python script to automatically annotate the names, surnames and hospital names from radiology reports. c A subset of records was made meeting the condition that more than one 'name' tag was present, remaining 2214 reports. d Another subsetting was performed to randomly select one-third of reports to be manually annotated and corrected by three annotators. After the manual revision, 692 reports remain. e Ground Truth dataset was divided into 3 subsets: the training set included 447 reports, validation 213, and test 32 reports from healthcare department number 7 Fig. 3 Partial examples of radiology reports from validation and test. Validation set (a) has metadata headers clearly defined. In turn, test set (b) has metadata headers in Valencian language and metadata information detached from these headers by a line break. Both structures include identifiable information in new lines without metadata headers. Any name, surname, address, identification number or date presented in the figure are fictitious register database [35], weighted by population in 2019.
With the aim of avoiding the leakage of sensitive personal data, this methodology also checks that the randomly chosen word or number is not the same as the original one.

Networks
A variety of neural networks were tested and evaluated, all of them designed for NER tasks. Three network architectures were based on Bidirectional Long Short-Term Memory (BiLSTM) layers, obtained from Guillaume Genthial's GitHub repository [39]: • LSTM-CRF: GloVe vectors, BiLSTM and Conditional Random Fields (CRF) based on the work of Huang et al [20].  • LSTM-LSTM-CRF: GloVe vectors, character embeddings, BiLSTM for character embeddings, BiLSTM and CRF, based on the work of Lample et al [22]. • Conv-LSTM-CRF: GloVe vectors, character embeddings with 1D convolution and max pooling, BiLSTM and CRF, based on the work of Ma and Hovy [40].
These networks were trained with and without Exponential Moving Average (EMA) of the weights. We also trained a spaCy [24] NER model, based partly on the work of Lample et al [22] with Bloom embeddings along with Convolutional Neural Networks (CNNs) with an attention mechanism.

Evaluation metrics
To assess the performance of the different models trained we computed precision, recall and F1-score metrics. These metrics can be defined as: To compute the amount of de-identification achieved by the model, we did not only apply these metrics to each NE, but to the set of words that should have been labelled as an identifying NE. With this approach, we obtained quantitative indicators of global de-identification.

Results
First, models for each neural network were trained and then evaluated. Table 4 shows the mean global results of the different networks, given three replicates for each one.
The recall is one of the most relevant evaluation metrics in any de-identification process [5], to avoid the leakage of sensitive information. Taking this into account, LSTM-LSTM-CRF with EMA shows the highest recall in test, and Conv-LSTM-CRF with EMA in validation. Although these are the two best-performing networks in both sets, we decided to include also spaCy for further analysis and leave outside the worst-performing architecture: LSTM-CRF.
The performance stats of each NE for LSTM-LSTM-CRF with EMA, Conv-LSTM-CRF with EMA and spaCy are displayed in Tables 5, 6 and 7. Whereas in training set spaCy outperforms the other networks in every NE except for CAB, in validation and test sets the results are more contested. Evaluating F1-score in validation, LSTM-LSTM-CRF classifies better dates, locations, names and numbers, while spaCy stands out with institutions. On the other hand, Conv-LSTM-CRF performs better with addresses and shows higher recall in names than LSTM-LSTM-CRF. When analysing the results for the test set, the spaCy model shows better metrics in dates and better recall in institutions whereas LSTM-LSTM-CRF has a higher F1-score in institutions, locations and names. Conv-LSTM-CRF again performs better with addresses, but also with numbers and shows the highest recall in locations and names. When applying the models to MEDDOCAN dataset there's a decay of the performance, although spaCy has higher recall rates in addresses, dates, institutions and name, whilst Conv-LSTM-CRF outperforms in locations and numbers.
Given that our aim was not to correctly classify NE, but to completely remove sensitive information from the text, global de-identification metrics were computed (Table 8).  Conv-LSTM-CRF with EMA shows better recall in validation and test sets (Fig. 5), whilst LSTM-LSTM-CRF has higher F1-score on test. On MEDDOCAN data, the model that better maintains recall and F1-score is LSTM-LSTM-CRF (Fig. 5, Table 8). To assess the performance of our models with external data, we wanted to apply the models generated at MEDDOCAN to our data. Only one of the participants made their models available [30], being one of the implemented networks spaCy. Their spaCy model achieved a precision of 87.89% and 80.31%, a recall of 42.66% and 26.54%, and an F1-score of 57.44% and 39.89% in our validation and our test, respectively (Table 8).

Discussion
This work has defined and evaluated a methodology based on NER to de-identify radiology reports in Spanish language. In comparison with traditional approaches based on regular expressions, NLP and neural networks do not underperform due to human misspellings or the absence of a clear and repeated structure. Neural networks are also context-dependent, and words like Cabeza (head), a common surname in Spanish that also refers to an anatomical part, will be detected as a "NAME" entity when used as a surname but left unchanged when used as a medical word, avoiding the loss of meaningful information.
The main drawback of this methodology is the requi-rement of a learning corpus of de-identified reports, which is not necessary for regular expression-based strategies. Although the curation of a corpus is a tedious and methodical task, there is no need for a big dataset: with a training set of 447 texts, we achieved a suitable performance. Neural networks should be trained with a corpus diverse in structure to avoid overfitting. Machine learning models tend to learn the structure or format of the text, finding the position of words containing sensitive data when performing de-identification. If a model was trained with a corpus with a determined structure, it will only be able to de-identify similarly-formatted texts. By comparing our spaCy model with the spaCy model retrieved from MED-DOCAN [30], we show the high impact that text structure has in the outcome. The MEDDOCAN training set was similar in size to ours (500 and 447 texts with a median of 20 and 22 lines per text, respectively), but their text structure was highly defined and invariant (texts from both datasets are compared at Fig. 4). With a training set diverse in its structure we can obtain higher recall and precision in external data, generating a de-identification model better prepared to deal with new data. Figure 3 illustrates the structure and format diversity of radiological reports between health departments included in our dataset.   [30] Considering that the recall metric assesses the capability to avoid the leakage of sensitive information of a model, we propose LSTM-LSTM-CRF with EMA as the best neural network to address a de-identification task based on NER. This neural network showed higher F1score in the test and MEDDOCAN, and its recall in validation and test sets are comparable to those obtained with Conv-LSTM-CRF with EMA. Furthermore, its recall on MEDDOCAN outperforms the one obtained by other networks. Thus, we expect LSTM-LSTM-CRF with EMA to behave optimally when presenting new data to it. Although its recall is 99. 29  When new radiology reports from the Valencian Region are included in BIMCV database, 97.18% of recall in test set means that almost 3% of identifying words will remain in the text. It might not be enough to re-identify the patient: could be left only a surname, a city name, or a part of an address. In fact, the de-identification methodology proposed in this work was applied to the COVID-19 image dataset described by de la Iglesia Vayá et al. [41], that needed to be reused for research due to the medical emergency situation in 2020. The radiology records in this dataset were revised by radiologists, finding in 28 out of 11558 (0.24%) reports enough sensitive information to identify patients or medical staff. This included names, patient record identification numbers, Fig. 5 Global de-identification metrics for the three best performing architectures. Precision (a), recall (b) and F1-score (c) for the three best performing architectures, LSTM-LSTM-CRF with EMA (blue), Conv-LSTM-CRF with EMA (yellow) and spaCy (grey) by data subset birthdates or healthcare centre names. To ensure that the identity of a patient is not recoverable, a final check of the texts by an authorized person remains necessary. Nevertheless, we propose a randomization strategy to change the identified NEs for synthetic ones of the same category. This strategy masks the identifying words left by the neural network with synthetic information, making it more difficult to discern between real and synthetic identifying words than by simply erasing words (Fig. 6). Further efforts need to be done to validate whether this strategy makes original information irretrievable or not.

Conclusions
Medical texts hold great potential for research, but legal and privacy concerns arise with its use, even more, when institutions external to the hospital are involved. Realworld medical texts tend to be semi-structured with free text that includes sensitive information, thus classical de-identification approaches based on regular expressions are not good enough. We propose a robust and flexible framework based on NER for Spanish medical texts, tested on radiology reports from the Valencian Region. This framework is generic and relatively simple and can be easily generalizable to other Spanish medical texts by re-training the network with additional data. However, the applicability of the de-identification methodology to other languages needs to be evaluated. We consider that our approach can be replicated in other Romance derived languages, following the training of a BiLSTM-CRF network with suitable data and the application of the randomization strategy. The easiest network to implement for deep learning non-specialized teams would be spaCy, although it is not the best performing. The proposed deidentification methodology still missed identifiers after training, thus a final check of the texts by an authorized person remains necessary. Nevertheless, we believe a combination of NER with the generation of synthetic data Fig. 6 Anonymization strategies. When applying word elimination (a) errors are easily detectable whereas with synthetic substitution (b) any mistake is hidden with randomized synthetic information. Any name, surname, address, identification number or date presented in the figure are fictitious will make it virtually impossible to extract real identifying words from the text. Further efforts need to be done to assess and test this hypothesis.
Abbreviations NLP: Natural language processing; NER: Named entity recognition; PHI: Protected health information; HIPAA: Health insurance portability and accountability act; NE: Named entity; BIMCV: Medical imaging databank of the Valencian Region; BiLSTM: Bidirectional long short-term memory; CRF: Conditional random fields; EMA: Exponential moving average; CNN: Convolutional neural network