Recently, more electronic data sources are becoming available in the healthcare domain. Utilization of electronic health records (EHRs), with their vast amounts of potentially useful data, is an important task in the healthcare domain. New legislation in Japan has addressed the treatment of medical data. The “Act on the Protection of Personal Information [1]” was revised in 2017 to stipulate that developers de-identify “special care-required personal information.” This legislation further restricts the use of personal identification codes including individual numbers (e.g. health insurance card numbers, driver’s license card numbers, and governmental personnel numbers), biometric information (e.g. fingerprints, DNA, voice, and appearances), and information related to disability. This legislation can be compared with the “Health Insurance Portability and Accountability Act (HIPAA) [2]” of the United States, in that the Japanese Act in 2017 includes additional codes, with abstract specifications such as “you should strive not to discriminate or impose improper burdens,” and with exclusion of birth dates and criminal histories, as stipulated by HIPAA. Another related act of Japanese legislation, the “Act on Anonymously Processed Medical Information to Contribute to Medical Research and Development [3]” was established in 2018. This legislation allows specific third-party institutes to handle EHRs, thereby promoting wider utilization of medical data.
De-identification of structured data in EHRs is easier than that of unstructured data because it is straightforward to apply de-identification methods to structured data such as numerical tables. Although de-identification of unstructured data in EHRs is necessary, it is virtually impossible to de-identify the huge number of documents manually.
Several earlier works have examined EHR de-identification. The Informatics for Integrating Biology & the Bedside (i2b2) task [4] in 2006 was intended for automatic de-identification of clinical records to satisfy HIPAA requirements [2]. An earlier study prepared 889 EHRs, comprising 669 EHRs for training and 220 EHRs for testing. Their annotations included 929 patient tags, 3751 doctor tags, 263 location tags, 2400 hospital tags, 7098 date tags, 4809 id tags, 232 phone_number tags, and 16 age tags. The best performing method of i2b2 incorporated diverse features such as a lexicon, part-of-speech identification, word frequencies, and dictionaries for learning using an ID3 tree learning algorithm.
Grouin and Zweigenbaum [5] prepared 312 cardiovascular EHRs in French, with 3142 tags annotated by two annotators (kappa = 0.87). Their tags include 238 date tags, 205 last_name tags, 109 first_name tags, 43 hospital tags, 22 town tags, 8 zip_code tags, 8 address tags, 8 phone tags, 8 med_device tags, 3 serial_number tags. Of the person tags, 75% were replaced with other French person names. The other 25% were replaced with international names. They also collected 10 photopathology documents, for which a single annotator assigned 29 date tags, 68 last_name tags, 53 first_name tags, 17 hospital tags, 17 town tags, 13 zip_code tags, 14 address tags, 1 phone tag, 1 med_device tag, and 7 serial_number tags. They performed de-identification experiments using 250 documents as their training data and 62 documents as their test data for the cardiology corpus. They obtained better F1-scores (exact match, 0.883; overlap match, 0.887) using conditional random fields (CRF) than they obtained using their rule-based method (exact match, 0.843; overlap match, 0.847). However, their rule-based method was better for the photopathology corpus (exact match, 0.681; overlap match, 0.693) than their CRF-based method (exact match, 0.638; overlap match, 0.638) because the data were fewer than those of the cardiology corpus.
Grouin and Névéol [6] discussed annotation guidelines for French clinical records. After collecting 170,000 documents of 1000 patient records from five hospitals, they first prepared a rule-based system and their CRF-based system from their earlier study [5], which we described earlier. Their rule-based system relies on 80 patterns specifically designed to process the training corpus, and lists which they gathered from existing resources from the internet. They randomly selected 100 documents (Set 1) from their dataset and applied both systems. For each document, they randomly showed one output of the two systems to the annotators for revision. They applied their rule-based system to another set of 100 documents (Set 2), which were further reviewed and revised by a human annotator. They re-trained their CRF-based system using the revised Set 2 annotations, which is further applied to the other set of 100 documents (Set 3). Annotators reviewed these annotations in subsets for different agreement analyses. The study also compared human revision times among different annotation sets, which was a main objective of their study. They annotated 99 address tags, 101 zip_code tags, 462 date tags, 47 e-mail tags, 224 hospital tags, 59 identifier tags, 871 last_name tags, 750 first_name tags, 383 telephone tags, 218 city tags, in Set 1. They reported their rule-based method as better (0.813) in terms of the F1-score than their CRF-based method (0.519) when evaluated with 50 documents in Set 1. When trained with Set 2, the corpus of the same domain, their CRF-based system performed better, yielding 0.953 for Set 3 and 0.888 for Set 1 in their F1-scores.
From the Stockholm EPR [7], a Swedish database of more than one million patient records from two thousand clinics, Dalianis and Velupillai [8] extracted 100 patient records to create gold standard for automatic de-identifications based on HIPAA. They annotated 4423 tags, including 56 age tags, 710 date_part tags, 500 full_date tags, 923 last_name tags, 1021 health_care_unit tags, 148 location tags, and 136 phone_number tags. They pointed out that Swedish morphology is more complex than that of English. It includes more inflections, making the de-identification task in Swedish more difficult.
Jian et al. [9] compiled a dataset of 3000 documents in Chinese. It comprises 1500 hospitalization records, 1000 summaries, 250 consulting records, and 250 death records. They extracted 300 documents from this dataset randomly, discussed a mode of de-identification with lower annotation cost. They annotated their tags to these 300 documents (kappa = 0.76 between two annotators for their 100 document subset). Then they applied their pattern-matching module to these 300 documents, yielding a dense set of 201 sentences that include PHI (Protected Health Information). These 201 sentences included 141 name tags, 51 address tags, and 22 hospital tags.
Du et al. [10] conducted de-identification experiments using 14,719 discharge summaries in Chinese: two students annotated 25,403 tags. This dataset includes 6403 institution tags, 11,301 date tags, 33 age tags, 2078 patient_name tags, 3912 doctor_name tags, 326 province tags, 310 city tags, 774 country tags, 917 street tags, 277 admission_num tags, 21 pathological_num tags, 23 x-ray_num tags, 263 phone tags, 420 doctor_num tags, and 13 ultrasonic_num tags (inter-annotator agreement was 96%, kappa = 0.826). Their experiments demonstrated that their method of combining rules and CRF performed best, yielding a 98.78 F1-score. The Chinese language shares some issues with the Japanese language: they both require tokenization because no spaces exist between words. This issue makes de-identification tasks more difficult than they are in other languages.
The reports described above present a range of different evaluation scores. However they adopted different annotation criteria, which make direct comparison difficult. For instance, Grouin and Névéol used more detailed annotations than those used by Jian et al., as follows. Jian et al. introduced Doctor and Patient tags, but evaluated both simply as Name. Grouin and Névéol introduced ZipCode, Identifier, Telephone, and City tags, none of which is annotated in the work of Jian et al. Additionally, they assigned Last Name and First Name tags, where performance of First Name was better than Last Name by around 10 points. However, both are worse than the results reported by Jian et al., probably because Jian et al. applied their pattern-matching algorithm to filter their training data. Regarding Address tags, Jian et al. obtained a 94.2 point F-score, whereas the Grouin and Névéol CRF method obtained scores of fewer than 10 points. As Grouin and Névéol suggested, eliminating City tags in street names can greatly improve their results: their rule-based method yielded an 86 point F-score.
Unfortunately, automatic de-identification of EHRs has not been studied sufficiently for Japanese language. De-identification shared tasks for Japanese EHRs were held as tasks in MedNLP-1 [11]. Then named entity extraction was attempted in MedNLP-2 [12] tasks using datasets similar to MedNLP-1. We designate MedNLP-1 simply as MedNLP hereinafter because we specifically examine de-identification tasks but not other tasks held in the MedNLP shared task series.
Regarding machine learning methods, Support Vector Machine (SVM) [13] and CRF [14] were used often in earlier Named Entity Recognition (NER) tasks in addition to rule-based methods. Recent deep learning methods include Long-Short Term Memory (LSTM) [15] with character-embedding and word-embedding [16], which performed best for the CoNLL 2002 [17] (Spanish and Dutch) and CoNLL 2003 [18] (English and German) NER shared task data: these tasks require detection of “personal”, “location”, “organization”, and “other” tag types. Another LSTM model, which is similar to earlier work [16], was also applied to a task of NER from Japanese newspapers [19]. Although deep neural network models have been showing better results recently, rule-based methods are still often better than machine learning methods, especially when insufficient annotated data are available.
To evaluate the effectiveness of such different methods for the Japanese language, we implemented two EHR de-identification systems for the Japanese language in our earlier work [20]. We used the MedNLP shared task dataset and our own dummy EHR dataset, which was written as a virtual database by medical professionals who hold medical doctor certification. Based on this earlier work, we added a new dataset of pathology reports to this study, for which we annotated the following tags. De-identification tags of age, hospital, sex, time, and person are annotated manually in all these datasets, following the annotation standard of the MedNLP shared task to facilitate comparison with earlier studies. We assume these annotations as our gold standard for our de-identification task. To these three datasets, we applied a rule-based method, a CRF-based method, and an LSTM-based method. Additionally, we have annotated our own tags to these three datasets by three annotators to calculate inter-annotator agreement. We have observed the coherency of the original annotations of the datasets. Overall, this study differs from our earlier work [20] in that we added a new pathology dataset and its annotations, trained and evaluated our machine learning models using the new dataset, and evaluated the results using newly created annotations by three annotators to observe characteristics of the original and our own annotations.