De-identifying free text of Japanese electronic health records

Background Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. Results Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. Conclusions Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

obtained better F1-scores (exact match, 0.883; overlap match, 0.887) using conditional random fields (CRF) than they obtained using their rule-based method (exact match, 0.843; overlap match, 0.847). However, their rule-based method was better for the photopathology corpus (exact match, 0.681; overlap match, 0.693) than their CRF-based method (exact match, 0.638; overlap match, 0.638) because the data were fewer than those of the cardiology corpus.
Grouin and Névéol [6] discussed annotation guidelines for French clinical records. After collecting 170,000 documents of 1000 patient records from five hospitals, they first prepared a rule-based system and their CRF-based system from their earlier study [5], which we described earlier. Their rule-based system relies on 80 patterns specifically designed to process the training corpus, and lists which they gathered from existing resources from the internet. They randomly selected 100 documents (Set 1) from their dataset and applied both systems. For each document, they randomly showed one output of the two systems to the annotators for revision. They applied their rule-based system to another set of 100 documents (Set 2), which were further reviewed and revised by a human annotator. They re-trained their CRF-based system using the revised Set 2 annotations, which is further applied to the other set of 100 documents (Set 3). Annotators reviewed these annotations in subsets for different agreement analyses. The study also compared human revision times among different annotation sets, which was a main objective of their study. They annotated 99 address tags, 101 zip_code tags, 462 date tags, 47 e-mail tags, 224 hospital tags, 59 identifier tags, 871 last_name tags, 750 first_name tags, 383 telephone tags, 218 city tags, in Set 1. They reported their rule-based method as better (0.813) in terms of the F1-score than their CRF-based method (0.519) when evaluated with 50 documents in Set 1. When trained with Set 2, the corpus of the same domain, their CRF-based system performed better, yielding 0.953 for Set 3 and 0.888 for Set 1 in their F1-scores.
From the Stockholm EPR [7], a Swedish database of more than one million patient records from two thousand clinics, Dalianis and Velupillai [8] extracted 100 patient records to create gold standard for automatic deidentifications based on HIPAA. They annotated 4423 tags, including 56 age tags, 710 date_part tags, 500 full_ date tags, 923 last_name tags, 1021 health_care_unit tags, 148 location tags, and 136 phone_number tags. They pointed out that Swedish morphology is more complex than that of English. It includes more inflections, making the de-identification task in Swedish more difficult.
Jian et al. [9] compiled a dataset of 3000 documents in Chinese. It comprises 1500 hospitalization records, 1000 summaries, 250 consulting records, and 250 death records. They extracted 300 documents from this dataset randomly, discussed a mode of de-identification with lower annotation cost. They annotated their tags to these 300 documents (kappa = 0.76 between two annotators for their 100 document subset). Then they applied their pattern-matching module to these 300 documents, yielding a dense set of 201 sentences that include PHI (Protected Health Information). These 201 sentences included 141 name tags, 51 address tags, and 22 hospital tags.
Du et al. [10] conducted de-identification experiments using 14,719 discharge summaries in Chinese: two students annotated 25,403 tags. This dataset includes 6403 institution tags, 11,301 date tags, 33 age tags, 2078 pa-tient_name tags, 3912 doctor_name tags, 326 province tags, 310 city tags, 774 country tags, 917 street tags, 277 admission_num tags, 21 pathological_num tags, 23 x-ray_num tags, 263 phone tags, 420 doctor_num tags, and 13 ultrasonic_num tags (inter-annotator agreement was 96%, kappa = 0.826). Their experiments demonstrated that their method of combining rules and CRF performed best, yielding a 98.78 F1-score. The Chinese language shares some issues with the Japanese language: they both require tokenization because no spaces exist between words. This issue makes de-identification tasks more difficult than they are in other languages.
The reports described above present a range of different evaluation scores. However they adopted different annotation criteria, which make direct comparison difficult. For instance, Grouin and Névéol used more detailed annotations than those used by Jian et al., as follows. Jian et al. introduced Doctor and Patient tags, but evaluated both simply as Name. Grouin and Névéol introduced ZipCode, Identifier, Telephone, and City tags, none of which is annotated in the work of Jian et al. Additionally, they assigned Last Name and First Name tags, where performance of First Name was better than Last Name by around 10 points. However, both are worse than the results reported by Jian et al., probably because Jian et al. applied their pattern-matching algorithm to filter their training data. Regarding Address tags, Jian et al. obtained a 94.2 point F-score, whereas the Grouin and Névéol CRF method obtained scores of fewer than 10 points. As Grouin and Névéol suggested, eliminating City tags in street names can greatly improve their results: their rule-based method yielded an 86 point F-score.
Unfortunately, automatic de-identification of EHRs has not been studied sufficiently for Japanese language. De-identification shared tasks for Japanese EHRs were held as tasks in MedNLP-1 [11]. Then named entity extraction was attempted in MedNLP-2 [12] tasks using datasets similar to MedNLP-1. We designate MedNLP-1 simply as MedNLP hereinafter because we specifically examine de-identification tasks but not other tasks held in the MedNLP shared task series.
Regarding machine learning methods, Support Vector Machine (SVM) [13] and CRF [14] were used often in earlier Named Entity Recognition (NER) tasks in addition to rule-based methods. Recent deep learning methods include Long-Short Term Memory (LSTM) [15] with character-embedding and word-embedding [16], which performed best for the CoNLL 2002 [17] (Spanish and Dutch) and CoNLL 2003 [18] (English and German) NER shared task data: these tasks require detection of "personal", "location", "organization", and "other" tag types. Another LSTM model, which is similar to earlier work [16], was also applied to a task of NER from Japanese newspapers [19]. Although deep neural network models have been showing better results recently, rule-based methods are still often better than machine learning methods, especially when insufficient annotated data are available.
To evaluate the effectiveness of such different methods for the Japanese language, we implemented two EHR deidentification systems for the Japanese language in our earlier work [20]. We used the MedNLP shared task dataset and our own dummy EHR dataset, which was written as a virtual database by medical professionals who hold medical doctor certification. Based on this earlier work, we added a new dataset of pathology reports to this study, for which we annotated the following tags. De-identification tags of age, hospital, sex, time, and person are annotated manually in all these datasets, following the annotation standard of the MedNLP shared task to facilitate comparison with earlier studies. We assume these annotations as our gold standard for our de-identification task. To these three datasets, we applied a rule-based method, a CRF-based method, and an LSTM-based method. Additionally, we have annotated our own tags to these three datasets by three annotators to calculate inter-annotator agreement. We have observed the coherency of the original annotations of the datasets. Overall, this study differs from our earlier work [20] in that we added a new pathology dataset and its annotations, trained and evaluated our machine learning models using the new dataset, and evaluated the results using newly created annotations by three annotators to observe characteristics of the original and our own annotations.

Datasets
Our datasets were derived from three sources: MedNLP, dummy EHRs, and pathology reports. Irrespective of the dataset source, de-identification tags of five types are annotated manually: age (numerical expressions of subject's ages including its numerical classifiers), hospital (hospital names), sex (male or female), time (subject related time expressions with its numerical classifiers), and person (person names). Characteristics of these datasets are presented in Table 1. It is noteworthy that texts of the MedNLP and dummy EHRs are not actual texts, but they were written by medical professionals, each of whom holds medical doctor certification. However, characteristics of the descriptions differ between these two sources, probably because of differences of the writers. The number of annotators is not described for the MedNLP dataset, but a single annotator created the annotations of the dummy EHR dataset and the Pathology Report dataset, individually.

MedNLP shared task dataset
We used the MedNLP de-identification task dataset for comparison with earlier studies that have used the same dataset. This dataset includes the dummy EHRs (discharge summaries) of 50 patients. Although the training dataset and test dataset were provided from the shared task organizers, the test dataset of the formal run is not publicly available now. It is not possible to compare results directly with earlier works in the MedNLP shared task formal run (Tables 2 and 3 show the formal run results). However, both training and test datasets were originally parts of a single dataset. Therefore, we can discuss their characteristics in comparison with those found in earlier works conducted using the training dataset only. We calculated inter-annotator agreement by three annotators for the training dataset. The average F1-score of three pairs among these three annotators was 86.1, in 500 sentences of this dataset.

Dummy EHRs
Another source is our original dummy EHRs. We built our own dummy EHRs of 32 patients, assuming that the patients were hospitalized. Documents of our dummy EHRs were written by medical professionals (doctors). We added manual annotations for deidentification following the guidelines of the MedNLP shared task. These annotations were originally assigned by a single annotator. Additionally, we added

Pathology reports
The other source is a dataset of 1000 short pathology reports, that differ greatly from the EHRs above. Pathology reports describe pathological findings by which personal information (names of patients, doctors, hospitals, and time expressions) frequently appears, but for which tags of sex and age rarely appear. Personal names, hospital names, and dates were manually de-identified beforehand by the dataset provider, and replaced with special characters. For machine learning methods to support realistic training and evaluation, we replaced these special characters with randomly assigned real entity names as follows. For the hospital names, we collected 96,167 hospital names which cover most of the Japanese hospital names, published by the Japanese government. For the person names, we manually created 20 dummy-family names and 20 dummy-first names using one of the last names only, or combining one of the last names and one of the first names. Additionally, we calculated the inter-annotator agreement by three annotators. The average F1-score of three pairs among these three annotators was 80.2 for 500 sentences of this dataset. This Pathology Report dataset is the only real (not dummy) dataset among our three datasets. Because we received manually de-identified version of the original real pathology reports, no ethical review was necessary.

Methods
We used a Japanese morphological analyzer, Kuromoji, 1 for tokenization and part-of-speech (POS) tagging. We registered our customized dictionary, derived from Wikipedia entry names and entries of the Japanese Standard Disease-code master [21], to this morphological analyzer in addition to the analyzer's default dictionary. We implemented rule-based, CRF-based, and LSTMbased methods.

Rule-based method
Unfortunately, the implementation of the best system for the MedNLP-1 de-identification task [22] is not publicly available. We implemented our own rule-based program based on the descriptions in their paper, to replicate the same system to the greatest extent possible. We present their rules below for a target word x for each tag type.
Age If x's detailed POS is "numeral", then apply the rules in Table 4.

Hospital
If one of following keywords appeared in x, then mark it as hospital: 近医 (a near clinic or hospital), 当院 (this clinic or hospital), or 同院 (same clinic or hospital).
If x's POS is "noun" and if detailed POS is not "non-autonomous word", or if x is either "•", "◯", "▲" or "■" (these symbols are used for manual de-identification because the datasets are dummy EHRs), and if suffix of x is one of the Sex If x is either 男性 (man), 女性 (woman), men, women, man, woman (in English), then mark it as sex.
Time If x's detailed POS is "numeral" and if x consists of four-digit-numbers+slash+two-or-one-digit-numbers (corresponds to "yyyy/mm") or two-or-one-digit-numbers+slash+two-or-one-digit-numbers (corresponds to "mm/dd"), then mark it as time. If x's detailed POS is "numeral" and followed by either of 歳 (old), 才 (old), or代 ('s), then mark it as time.

CRF-based method
We implemented a CRF-based system because many participants used CRFs in the MedNLP-1 de-identification task, including the second-best team and the baseline system. The best participant used a rule-based system, as described previously. We used the MALLET 2 library for CRF implementation. We defined five training features for each token 3 : part-of-speech (POS), detailed POS, character type (Hiragana, Katakana, Kanji, or Number), a binary feature whether a token is included in our user dictionary or not, and another binary feature whether a token is beginning of its sentence or not.

LSTM-based method
Our LSTM-based method combines bidirectional LSTM (bi-LSTM) and CRF, using character-based and wordbased embeddings (Fig. 1) following earlier work that had been reported as successful for other languages [16].
For word-based embedding, we used the existing Word2Vec [23] model, which was trained using Japanese Wikipedia. 4 We used bi-LSTM to embed characters; then we concatenated these two embeddings. This concatenated output was fed to another bi-LSTM and then sent to a CRF to output IOB tags.
Our implementation has been made publicly available in GitHub. 5 Table 5 presents the parameter settings.

Experiment settings and evaluation metrics
We followed the evaluation metrics of the MedNLP-1 shared task using IOB2 tagging [24]. We used four-fold cross validation, whereas the rule-based method requires no training data. We prepared five datasets: MedNLP (MedNLP), dummy EHRs (dummy), pathology reports (pathology), and MedNLP + dummy EHRs (MedNLP + dummy). We also prepared a dataset that comprises these three datasets (all). For each dataset, we applied cross validation. The CRF and LSTM are trained with three patterns of training data: the target dataset only, one of other datasets only, MedNLP + dummy, and all.
Our evaluation uses a strict match of named entity spans, calculating F1-scores, precisions, and recalls. Table 6 presents the evaluation results.

Results obtained using the MedNLP dataset
In this MedNLP dataset, the total number of sex is very small; that of person is zero. The rule-based system performed best in terms of the F1-score because its rules were tuned originally to the very MedNLP dataset. LSTM performed best for age and time, probably because these tags exhibit typical patterns of less variation. LSTM is superior to Rule, except for sex and hospital. Regarding sex, we observe better performance when LSTM uses more training data. Therefore, the data size is expected to have been the reason why LSTM was not good in sex.
Results obtained using the dummy EHR dataset LSTM (M + d) performed best in terms of the F1-score. CRF performed better when trained by M + d dataset than with the target dataset only. This performance increase consists of decrease of age and increase of all other tags, suggesting that these two datasets differ in their age tag annotation scheme.
The overall performance of this dummy EHR dataset is worse than the MedNLP dataset, suggesting that the dummy EHR dataset is more difficult to de-identify.

Results obtained using the pathology report dataset
The LSTM-based method was better (81.67) than the CRF-based method (74.26), as shown by the 7.41 point F1-score when applied to our Pathology Report dataset.
Our rule-based system achieved very high recall, but very low precision scores for time, exhibiting a difference by 38 points. The pathology reports include many clinical inspection values written in an "xx/yy" format, which might engender confusion with dates expressed in an "mm/dd" format. We applied a workaround to limit [1 < = mm < = 12] and [1 < = dd < = 31], but it was insufficient: we need contextual information, not just rules. In addition, hospital is better than time, with less difference (15 points) of precision and recall.
When trained with the Pathology Report dataset only, its performance is better than our rule-based system. When trained with the M + d dataset, which does not contain the pathology dataset, neither CRF nor LSTM works fine because the pathology reports differ greatly in terms of their styles of description and named entities.

Discussion
These results suggest that our datasets have quite different characteristics in what context and in what form their named entities appear, but LSTM adapted to these differences well. Adding the Pathological Report dataset to the training data seems to degrade the system performance for other target test datasets because of the different dataset characteristics (examples presented in Table 1). For example, when trained with the Pathological Report dataset, the hospital tags of the MedNLP dataset show lower performance because of the different descriptions of hospital names among these two datasets. The Pathological Report dataset has full hospital names such as "Shizudai Dermatology Clinic," but the other two datasets have more casual descriptions such as  Character embedding size 100 Hidden layer of character 100 Hidden layer of LSTM 300 Learning rate 0.001 Table 6 Evaluation results for each tag and in total, for different methods (rule, CRF, LSTM) and different evaluation datasets (MedNLP, dummy EHR, and pathology reports). M, d, and P respectively denote training data of MedNLP, dummy EHR, and Pathology reports; M + d denotes that training data consist of MedNLP+dummy EHR, all stands for all of these three datasets; other machine learning methods use the target evaluation dataset as its training data. In each cell, F1-score, precision, and recall are shown (in values multiplied by 100). The best scores for each tag type for each evaluation metric are presented in bold typeface. All evaluations were done by four-fold cross validations "近医 (hospital nearby)" and "当院 (our hospital)". The Pathology Report dataset has different contextual patterns that could have learned by machine learning methods such as "院外標本 (ex-hospital sample)" immediately before hospital tags, and a suffix/prefix such as "xx hospital" or "xx clinic". These words, "hospital" and "clinic", might have been learned as semantically similar by Word2Vec. Another difference of datasets is the coherence of annotations. We compared the original annotations of the datasets with our own new annotations created for this study by three annotators. These new annotations were created to calculate inter-annotator agreement as described in the Dataset section. The original versus new inter-annotator agreement (and inter-annotator agreement of the three annotators) in average F1-scores were 0.566 (0.861), 0.342 (0.761), and 0.772 (0.802), respectively, for the MedNLP, Dummy, and Pathology Report datasets. As these scores strongly suggest, the original annotations were insufficiently coherent. By contrast, our new annotations are much more coherent because we have included more detailed annotation guidelines. For example, our guidelines include specifications of prefixes, suffixes and classifiers.. Annotating larger datasets with this coherent guideline is anticipated as a subject for future work. It is particularly interesting that our system performance was better than the inter-annotator agreement in the Pathology Report dataset. One reason is expected to be the remaining vague part of the guideline, such as inclusion of particles when assigning named entities. We applied the automatic tagger for preannotation; then human annotators reviewed the results. However, annotators sometimes overly depend on automatically annotated parts-of-speech without considering the context and semantics; alternatively, the part-ofspeech tagger can simply fail. Therefore, an annotation guideline including precise part-of-speech specifications will be required.
An earlier study that applied a similar LSTM-based method to de-identify English medical data [25] found lower F1-scores for LOCATION and NAME tags on the i2b2 2014 dataset and MIMIC-III dataset [26], which includes records of 61,532 patients in an intensive care unit (ICU); performance of naïve CRF was very low. This LOCATION tag corresponds to our hospital tag, exhibiting similar characteristics among different languages. Table 6 Evaluation results for each tag and in total, for different methods (rule, CRF, LSTM) and different evaluation datasets (MedNLP, dummy EHR, and pathology reports). M, d, and P respectively denote training data of MedNLP, dummy EHR, and Pathology reports; M + d denotes that training data consist of MedNLP+dummy EHR, all stands for all of these three datasets; other machine learning methods use the target evaluation dataset as its training data. In each cell, F1-score, precision, and recall are shown (in values multiplied by 100). The best scores for each tag type for each evaluation metric are presented in bold typeface. All evaluations were done by four-fold cross validations (Continued) The LSTM-based method can be regarded as effective in Japanese medical de-identification tasks as well. If a larger dataset were available, then it would yield better performance. Japanese-specific issues include the following difficulties: Japanese (and Chinese) have no spaces between tokens, which makes tokenization much more difficult and ambiguous. The number of letter types is much greater than in other languages, including tens of thousands of kanji letters, 50 hiragana letters, 50 katakana letters, numerals, and alphabets. The languages also have more synonyms than in other languages.
Our system performance almost reaches to the interannotator agreement, which can be regarded as upper bound of system performance. The current performances are sufficiently high compared to other publicly available Japanese de-identification tools. Therefore, we plan to apply our system to actual de-identification tasks in hospitals.

Conclusions
We implemented three de-identification methods for Japanese EHRs and applied these methods to three datasets, which are derived from two dummy EHR sources and one real Pathology Report dataset. These datasets have manually annotated de-identification tags, following the MedNLP shared task annotation guideline.
Our best F1-scores over all the tag types are 84.23 (rulebased), 68.19 (LSTM), and 81.67 (LSTM) points, respectively, for the MedNLP dataset, the dummy EHR dataset, and the Pathology Report dataset. Our LSTM-based method performed best in two datasets, whereas our rulebased method performed best in the MedNLP dataset. However, our LSTM-based method also achieved a good score of 83.07 points in the MedNLP dataset, which only differs 1.16 points from the best score of the rule-based method. Our results demonstrate that the bi-LSTM based method with character-embedding and word-embedding tends to work better than other methods, exhibiting more robustness than CRF over different data sources. The LSTM-based method was better than the CRF-based method, exhibiting a 7.41 point F1-score difference when applied to our Pathology Report dataset. This report is the first describing a study applying this LSTM-based method to any de-identification task of Japanese EHRs.
Machine learning methods can extract named entities of de-identification comparable to a rule-based method that is tuned manually to specific target data. However, machine learning methods are still less adequate for application to expressions with low occurrence. Probably because of the insufficient data size, our methods yielded worse evaluation scores than were obtained with the other languages when applied to the i2b2 task and MIMIC-III. Combinations of LSTM and rule-based methods are left as a subject for future work.
The current performance is sufficiently high among publicly available Japanese de-identification tools. Therefore, we plan to apply our system to actual deidentification tasks in hospitals. Although it is still difficult to make real EHRs publicly available, we could use our large amount of EHRs inside our hospitals. Increasing the size of annotated datasets for such internal usage is left as another subject for future work.
Abbreviations NLP: natural language processing; LSTM: Long Term Short Memory: a kind of recurrent neural network; CRF: Conditional Random Field: a kind of machine learning method; POS: part-of-speech; EHR: electronic health record