Refinement of a Gold Standard
We have previously started the process of creating a Gold standard for de-identification of the Stockholm EPR Corpus (Appendix). Three annotators annotated 100 patient records containing both free text and structured information, encompassing a total of 380 000 tokens. Identifiable instances were defined for the 18 Protected Health Information (PHI) classes given in  with some changes. In total 40 annotation classes were defined, including four nested classes and some additional classes, however only 28 of the defined 40 annotation classes were used for annotation, (see Additional file 2, Table S2 for the used 28 classes).
The creation of the Gold standard, the annotation guidelines and the resulting set of annotation classes is described in .
The average Inter-Annotator Agreement (IAA) result for all instances of the annotation classes on the Gold standard was 0.65 F-score. Some classes showed higher agreement than others, and the total number of annotations differed between the annotators. The approach taken for the creation of the Gold standard was deliberately coarse and loosely defined, for the purpose of getting an initial idea of what type of identifiable instances the EPRs actually contain. The Gold standard has been further analysed in the work presented here, and used for the creation of two refined consensus sets.
Automatic Consensus Gold Standard
Our first approach to refine the Gold standard was to automatically create a union of all three annotation sets. One requirement for evaluating a de-identification system is that high recall is preferable over high precision, therefore we took the union of all annotations. Whenever there was a mismatch found, majority decision was prioritized. If two annotations covered almost the same instance, the longest instance span was chosen.
Moreover, as many classes were mismatched, a semi-automatic decision on resolving these discrepancies was made (if it could not be resolved by majority decision). For example, if an instance was annotated only by two annotators, and one annotator annotated the instance as Clinician_First_Name and the other as First_Name, the instance was annotated as Clinician_First_Name. Rules for resolving such cases were written after analyzing common mismatches for all annotation classes. All instances that were annotated only by one annotator were also included in the final set of annotation instances. This process resulted in a total amount of 6 170 annotation instances.
As many of the annotation classes are conceptually similar, several variants of merging similar (and removing some infrequent) classes were also made. This was done in order to evaluate whether the automatic classifier would perform better on more general, merged annotation classes.
Manual Consensus Gold Standard
By creating pairwise matrices covering the total amount of annotations for each annotator, as well as an agreement table , covering all annotated instances and their number of assigned class judgments, a better overview of the class distributions, annotation instances and annotator judgements was obtained. In total, over 7 000 instances were annotated. However, the total amount of annotations per annotator could differ with over 1 000 instances. Many of these differences were due to boundary discrepancies and class mismatches.
In general, the distribution of annotation instances was very skewed. The annotation class Health_Care_Unit contained, by far, the largest amount of annotation instances. Some of the HIPAA classes were not present at all in the data set, such as Social_Security_Number and Medical_Record_Number. Only 28 of the defined 40 annotation classes were used for annotation. IAA was highest for the Name classes, see .
The analysis of the pairwise matrices and the agreement tables resulted in the identification of some differences in the interpretation of the guidelines. In particular, the use of the annotation class Health_Care_Unit differed greatly with a very low IAA, see . These discrepancies were discussed jointly by the group of annotators and resulted in a more refined set of guidelines.
The main changes to the guidelines were the following:
• An instance should never be sub-tokenized by the annotator. For example, 34-årig (Aged 34) should be annotated in its entirety.
• The Relation and Ethnicity classes were deleted. The annotators judged that these classes did not pose a high risk of identifying individual patients.
• The classes Street_Address, Town, Municipality, Country and Organization were merged into the more general class Location. Many of these classes were confused in the individual sets of annotations but covered the same instances. Moreover, the largest possible span should always be used for such instances. An address such as Storgatan 1, 114 44 Stockholm should be annotated in its entirety.
• Dates should never include weekdays. The division between Date_Part and Full_Date should be kept.
• Health care units should be annotated with the largest possible span, and should only be annotated if they denote a specific unit.
• General units that are not identifiable in themselves should not be annotated. A general unit such as Geriatriken (the Geriatrics department) should not be annotated if it was not specified by its hospital.
As stated above, the class Health_Care_Unit was the most problematic. In the EPRs, these instances could be mentioned in a variety of ways. Moreover, in the Stockholm area, many health care units have names that include their location. Karolinska Universitetssjukhuset (Karolinska University Hospital), for example, is located both in Huddinge and Solna, and the respective locations are included in their names. In the EPRs, these hospitals (and clinics within these hospitals) could be mentioned as for example:
Karolinska Univ. Sjukh, Huddinge
Avd. 11 på Karolinska
Moreover, in some cases, the hospital was referred to as Karolinska i Solna (Karolinska in Solna), where Solna in this case denotes a Location. Following the new guidelines, the longest span possible should always cover the instance, but only if the referred unit was specific. The definition of a general unit has, however, not been specified in detail but is left to be judged by the annotators. Such instances may still be a source of error.
A new, refined Gold standard was created semi-automatically after resolving these differences. Many annotations in the initial Gold standard did not conform to the new guidelines (weekdays annotated as Date_Part and generic health care units for instance) and were deleted. This resulted in a total amount of 4 423 annotation instances.
Using the Consensus Gold Standards with a CRF Classifier
We have used the two created Consensus Gold standards to train and evaluate a Conditional Random Fields (CRF) classifier. As discussed above, such classifiers have shown promising results for de-identification classification tasks.
We have used the Stanford Named Entity Recognizer  using the default settings for all experiments.
All experiments have been evaluated with four-fold cross-validation  where the total set has been split into four equally sized sub-sets used for training and evaluation. The reason to use four-fold cross validation for the evaluation was to have a reasonable processing time.
Seven experiments using the automatic Consensus Gold standard are reported, each with different mergings of the annotation classes into more general classes and two experiments using the manual Consensus Gold standard, one evaluated with ten-fold cross-validation. No nested annotation classes were used.