Text mining brain imaging reports

Background With the improvements to text mining technology and the availability of large unstructured Electronic Healthcare Records (EHR) datasets, it is now possible to extract structured information from raw text contained within EHR at reasonably high accuracy. We describe a text mining system for classifying radiologists’ reports of CT and MRI brain scans, assigning labels indicating occurrence and type of stroke, as well as other observations. Our system, the Edinburgh Information Extraction for Radiology reports (EdIE-R) system, which we describe here, was developed and tested on a collection of radiology reports.The work reported in this paper is based on 1168 radiology reports from the Edinburgh Stroke Study (ESS), a hospital-based register of stroke and transient ischaemic attack patients. We manually created annotations for this data in parallel with developing the rule-based EdIE-R system to identify phenotype information related to stroke in radiology reports. This process was iterative and domain expert feedback was considered at each iteration to adapt and tune the EdIE-R text mining system which identifies entities, negation and relations between entities in each report and determines report-level labels (phenotypes). Results The inter-annotator agreement (IAA) for all types of annotations is high at 96.96 for entities, 96.46 for negation, 95.84 for relations and 94.02 for labels. The equivalent system scores on the blind test set are equally high at 95.49 for entities, 94.41 for negation, 98.27 for relations and 96.39 for labels for the first annotator and 96.86, 96.01, 96.53 and 92.61, respectively for the second annotator. Conclusion Automated reading of such EHR data at such high levels of accuracies opens up avenues for population health monitoring and audit, and can provide a resource for epidemiological studies. We are in the process of validating EdIE-R in separate larger cohorts in NHS England and Scotland. The manually annotated ESS corpus will be available for research purposes on application.

Example of entity, relation and negation mark-up into two relations, one with a temporal modifier and one with a location modifier, while the second ischaemic stroke entity is in a relation with a temporal modifier. These latter two entities are marked as negative (crossed out) because they are in the scope of the negative word No. Annotations such as these are output by the text mining system and are then used as the basis for the assignment of labels to the reports.
In order to develop NER and RE components, decisions need to be made about which entities and which relations the system should identify. These decisions are best made through dialogue between the domain experts, who know what information they would ideally like to access, and text mining experts, who can judge which pieces of information can be identified with sufficient accuracy. In addition, manually annotated subsets of the data are needed to train and develop the components as well as to evaluate their performance.
In building EdIE-R, we used the process of annotation as a means to focus the radiologist/text miner dialogue at the same time as developing the prototype system. We used an agile development methodology where iterations of system development were interleaved with annotation iterations. After initial scoping, automatic annotations from the system were presented to the domain experts for correction using the BRAT annotation tool [2]. The system and manual annotations were compared and disagreements were resolved either by adjusting the manual annotation or by improving the system. We iterated over the process a number of times with both system and manual annotation improving in each cycle. This method has several advantages. First, it allows both teams to work simultaneously, unlike methods where all the annotation is done in advance of system development. Second, discussion of the system and manual disagreements allows the text miners to come to a much clearer understanding of the meaning of the domain language and the domain specialists to understand the limitations of the technology. Through negotiation, several changes to the annotation scheme were made during the iterative process. Third, doing annotation as correction tends to reduce insignificant differences between manual and system annotation.

Related work
Named entity recognition is a well-established task in NLP. The CoNLL shared-task evaluations [3] established benchmarks for NER evaluation and prompted research into supervised machine learning methods for NER, for example, the Stanford NER tagger [4]. Rule-based techniques are also still used for NER: see e.g. the ANNIE NER tagger which is part of GATE [5]. Relation extraction is often included as a subtask in text mining applications [6] with approaches to it ranging from rule-based through supervised to unsupervised machine learning.
Text mining technology for the biomedical domain has been a subject of research for two decades with several community initiatives to provide data and a forum for shared tasks, such as BioCreative [7] and BioNLP [8]. Both of these organised shared tasks in NE and RE: see [9,10] for our contributions. More recently the shared task approach has been used for electronic health records (EHRs) by the LOUHI workshops, e.g. LOUHI '17 [11] or LOUHI'18 [12]. There are many individual studies applying information extraction to EHRs, see [13] for a review of some of these. Negation detection has been recognised as an important step, particularly in medical text mining, with the NegEx algorithm [14] being frequently used.
Several researchers have applied NLP and text mining approaches to radiology reports. Pons et al. (2016) provide a useful systematic review of NLP in radiology [15]. They include 67 different studies which they group according to 5 distinct purposes, namely diagnostic surveillance, cohort building for epidemiological studies, query-based case retrieval, quality assessment of radiologic practice, and clinical support services. Conditions targeted by the systems are various and include appendicitis, pneumonia, renal cysts, pulmonary embolism, liver conditions and general metastases, to name but a few. Across all these application areas the NLP systems surveyed tend to have the same broad structure where a flow diagram showing the individual components looks much like our diagram of the EdIE-R system shown in Fig. 2 below.
Two recent studies by Hassanpour and Langlotz (2016) and by Cornegruta et al. (2016) describe machine learning methods for entity recognition from radiology reports [16,17]. Hassanpour and Langlotz [16] tested two existing feature-based machine learning classifiers for this task. Their annotation scheme contains four broad types of named entities (Anatomy, Anatomy modifier, Observation and Observation modifier) as well as strings expressing Uncertainty. They used NegEx to identify negation in the text as a feature feeding into their models. The machine learning classifier both result in an average F1-score of 85% for 10-fold cross-validation on a data set containing 150 manually annotated radiology reports from three different institutions.
Cornegruta et al. [17] describe work on analysing a large corpus of historical chest X-ray reports. Their system described is interestingly similar to ours in the way the report text is annotated with named entity and negation mark-up although the entity list (Body Location, Descriptor, Clinical Finding, Medical Device) is both smaller and more complex in that disjoint entities are permitted. No relation extraction is performed but negation mark-up is included. The NER method uses a bidirectional LSTM (BiLSTM) neural network architecture, which is contrasted with a baseline system which uses string matching look-up against RadLex [18] and Medical Subject Headings (MeSH) [19] concepts combined with parsing, plus NegEx for negation detection. The BiLSTM NER tagger significantly outperforms the baseline but it is worth noting that, in general, rule-based and machine learning approaches attain similar levels of performance on NER if the rule-based system uses more sophisticated techniques than string matching, as ours does.
There has also been some work on summarising radiology reports. Most recently, Zhang et al. [20] proposed a state-of-the-art neural network-based approach to summarisation of radiology impressions. An impression is the "Conclusion" section of a radiology report summarised by the radiologist after dictating or writing down their findings presented in the image. Automating this step is an extremely useful task that can save radiologists a lot of effort and time. Two different radiology reports describing similar symptoms and conditions, however, are not guaranteed to result in the same summary text. The output of summarisation therefore does not lend itself well for large-scale data analysis in the same way as classification of symptoms and conditions does, for example, for identifying patients with the same findings for epidemiological studies.
With a specific focus on stroke, Flynn et al. (2010) [21] developed a system for analysis of brain scan radiology reports from Tayside, Scotland, i.e. EHR reports which are very similar to the those in the ESS data set [22]. Their aim was to improve on the coding of the reports which were frequently given generic 'stroke' codes even when a more precise code could be determined by looking at the report. Their method used a keyword matching step looking for affirmative or negative uses of key words from a stroke lexicon. They report results which were acceptably accurate in identifying ischaemic stroke (94.7% positive predictive value (precision)) on a dataset of 150 reports manually classified as ischaemic stroke. Their method performed less reliably in identifying intracerebral haemorrhage (76.7% positive predictive value) on a dataset of 150 reports manually classified as intracerebral haemorrhage. The paper does not report sensitivity (recall) scores as the data only contains positive examples of either type.
To the best of our knowledge, EdIE-R is the first system that performs named entity extraction, negated entity detection, relation extraction and document level labelling with the goal to classify radiology report with types of stroke, tumours and other information. The extracted entities (positive or negative) and relations are all used to do the final classification (labelling) step. The information captured in and about the reports include a comprehensive set of entities and labels. We provide a detailed evaluation of EdIE-R for all the steps it is designed to perform using standard natural language processing evaluation metrics, including precision, recall and F1-score. Compared to the previous study [21] we therefore test on an unseen test set of random radiology reports which contain positive and negative examples of the information EdIE-R is designed to extract and label.

Annotation scheme
There are four aspects to the annotation of brain scan reports in our data: entities, relations, negation mark-up, and labels. These are all illustrated in Fig. 3, a screen grab of an annotated report loaded into the BRAT tool. As shown, each report is preceded by a list of all possible labels but only those that have been marked as selected are labels for the report. Entities, relations and negation have been annotated within the textual body of the report.
Entities are of two types, observations or modifiers. The full set of observation entities are: ischaemic stroke, haemorrhagic stroke, stroke (unknown type), tumour:meningioma, tumour:metastasis, tumour:glioma, tumour, subdural haematoma, small vessel disease, Relations link a subset of observation entities, namely stroke and microbleed entities, with modifier entities. Strokes may be associated with both a location and a time, while microbleeds are associated only with location. Some words or phrases, such as POCI (Posterior Circulation Infarct) in Fig. 2, carry both observation and modifier meaning and in these cases nested entities are used. Here there is a mod-loc relation between the loc:cortical entity and the ischaemic stroke entity but we do not require this to be made explicit in the annotation since the nesting implies it.
There is a close relationship between the entity and relation names and the labels. For example, the label Ischaemic stroke, cortical, old has been chosen and this clearly relates to the two occurrences of an ischaemic stroke entity in a relation with both a loc:cortical and a time:old modifier. The annotators are instructed not to select labels unless there is explicit linguistic evidence to support the choice. Occasionally they will be able to infer labels from implicit information but they are asked not to annotate these cases as the aim is to model linguistically explicit information not human expertise.
Proper identification of negation and its scope is essential to achieving high accuracy. We model negation in the annotation as an attribute on entities, which is visualized in BRAT as a crossing out. Wherever the text contains negation scoping over entities, the annotators must add the negative attribute. The negative example in Fig. 2, No acute haemorrhage, masses or extra-axial collections, is a clear and simple case but syntactically more complex cases occur, e.g. cases where the negation marker is distant from the entities within its scope. There are cases where the radiologist is unable to positively identify or exclude an observation, as for example in a small focus of acute infarct cannot be completely excluded. The annotators are asked to mark these cases as negative, as only clearly positive observations should contribute to the labels assigned to the reports.

The EdIE-R system
EdIE-R is a rule-based text mining system which we developed in tandem with manual data annotation in the form of correction of the system output. The presentation of the data in the BRAT tool, as illustrated in Fig. 2, is the view that the annotators see, but this is a format that has been derived from the data structure which the system manipulates and outputs, which is an XML data structure. We have developed the system's text analysis components using the LT-XML2 programs, which are the core of our XML rule-based text mining software [23]. Our most recent software release, the Edinburgh Geoparser [24], contains all of our general-purpose components, such as the tokeniser, NER tagger and chunker, which we have adapted to the brain scan report domain in EdIE-R.
As shown in Fig. 3, the EdIE-R system has a pipeline architecture. Scan reports are converted from their original format into an initial XML format and subsequent components incrementally add annotations to the XML structure, with each stage making computations over the annotations of previous stages. The document zoning step segments the reports into sections including clinical details, the report itself and the radiologist's conclusion. It also adds metadata which includes all of the possible labels that can be assigned; by the final stage of the pipeline an attribute on each label indicates whether that label has been selected. An example of a report in XML after document zoning is shown in Fig. 4. We combine NER and label mark-up in this way so that manual annotation of all levels of analysis can be done at the same time.
Subsequent steps of the pipeline do linguistic processing. The tokeniser splits textual content into paragraphs, sentences and word tokens, with punctuation characters also treated as tokens. The C&C POS tagger [25] labels each word with its syntactic category. The default C&C model has been trained on modern U.S. newspaper text and although it performs well on most text types, it is not wholly suitable for the medical text in our reports. For this  reason, we also use a model trained on the Genia biomedical corpus [26]. After running the POS tagger with each of the models we apply a correction stage to moderate disagreements between them. After POS tagging, we apply the morpha lemmatiser [27] to analyse inflected nouns and verbs and compute their lemma (morphological stem). The output of POS tagging and lemmatization is stored in attribute values on word token elements.
The fifth step in the pipeline is the NER component, which incorporates lexical lookup. From examples in the development set we manually curated two lexicons, one for observations (e.g. the atrophy entity inter-cerebral volume loss and the ischaemic stroke entity lacunar  event) and one for modifiers (e.g. the time:old entities old, previous and established), e.g. see Fig. 5. The process of lexical lookup results in the addition of further attributes to the word tokens of matching words and phrases. The lexicons are applied one after the other, first the observations lexicon and then the modifiers, so that some words or phrases can be marked as both observation and modifier to achieve the nested entity mark-up described above. The next stage of processing performs a shallow syntactic analysis using our chunker [28] to segment sentences into phrases or word groups, i.e. syntactic structures headed by nouns (noun groups), verbs (verb groups) etc. The purpose of doing this is to create a useful data structure for dealing with nested entities and coordinations of entities as well as to define the scope of negation markers in terms of structure rather than just word sequences. At this stage complex negative noun groups such as No acute haemorrhage, masses or extra-axial collections have an appropriate structure to allow information from the negative article No to be propagated through the group so that all three observation entities (haemorrhage, masses, extra-axial collections) are marked as negative.
Relation Extraction is the final stage of the text mining part of the system. In this component some pairs of entities are linked in relations held as structures in standoff XML mark-up as illustrated in Fig. 6. There are two possible relations, location and time, which hold between stroke entities (ischaemic, haemorrhagic or unknown type) and modifiers. In addition, a microbleed entity can be in a relation with a location modifier.
Negation arising from the verb particle not, for example in Very acute infarction may not be visible on CT, is handled as part of the relation extraction module because rules linking not with the entities it scopes over are similar to the other relation rules. The result, however, is not an explicit relation but an attribute on the negated entities (acute and infarction, in this case). This is the same format as for noun group negation detected during chunking.  The final labelling step of the pipeline uses the information from the previous steps to compute which labels should be associated with a record. Because the mark-up coming from the text mining is very detailed, the labeling rules can be fairly simple. For example, to choose the Small vessel disease label the rules need only to check that there is a non-negative small vessel disease entity in either the report or conclusions part of the report. To choose the label Ischaemic stroke, cortical, recent there needs to be a non-negative ischaemic stroke entity which is in a location relation (mod:loc) with a cortical location entity (loc:cortical) and in a time relation (mod:time) with a time:recent entity. There are a few added complexities to these rules, for example, a deep ischaemic stroke which is not in an explicit relationship with a time modifier is assumed to be old.

Evaluation
In order to evaluate system performance, we annotated development and test data as discussed in the "Annotation" section. For this we used 1168 reports from the Edinburgh Stroke Study (ESS) [22]. We reserved the first 500 reports as the development set and the remainder as the test set. ESS contains MRI, CT and Doppler Ultrasound reports but we used only the CT and MRI reports. We also discarded a few reports which contained non-brain results, e.g. combined brain and neck, chest, or abdomen scans. In total the annotated development set contains 322 CT and 42 MRI reports. We have annotated a random subset of the test set containing 238 CT and 28 MRI reports. Manual annotation of the development data was accomplished in six tranches, where annotation was correction of the system output. The system was modified and improved between the tranches. Table 1 provides information on the sizes of the data subsets. The first three tranches were doubly annotated by the radiology experts so that IAA could be monitored. For these three tranches only, disagreements between the annotators were reconciled to produce an agreed gold standard. The remaining development data was singly annotated. The test data was doubly annotated in three tranches but not reconciled. Table 2 provides details of the annotators and annotations in all the data sets.

Results
Following standard practice we measure both IAA and system performance using precision, recall and F1. Note that IAA represents an upper bound for system performance as an automatic method would not be expected to out-perform human capabilities. The overall results for IAA on the test data are shown in Table 3. Note that IAA measures for relations are only computed for those relations where the two annotators agree on both entities linked by the relation. Overall the IAA results are very high which indicates that the annotation task is well-defined.  Tables 4, 5 and 6 provide a more detailed breakdown of the IAA results per type on the entities, relations and labels across the three test sets. The majority of lower IAA scores for entity types are for low frequency ones, for example subarachnoid haemorrhage. This pattern is mirrored in the IAA scores for labels, for example for Haemorrhagic transformation and Microbleed. However, since these types are very infrequent their low IAA scores do not have a serious effect on the overall figures. Table 7 shows evaluation results for the EdIE-R system on the two annotators' versions of the test set. For labels and relations, the system agrees more with Annotator 1 than with Annotator 2, while the pattern is reversed for entities and negation. We would expect system scores to be lower than IAA (see final column), which is the case for entities and negation for Annotator 1, and for all but relations for Annotator 2. We speculate that these differences indicate that Annotator 1 focused more on entity mark-up and spotted and corrected more system entity errors while Annotator 2 focused more on the labels and made more corrections there. To improve the accuracy of the evaluation we would ideally arbitrate the annotators' disagreements and produce a consensus test set. Nevertheless, the overall evaluation results are reassuringly high, indicating that this method of labelling radiology reports is highly effective.
In Table 8 we provide a breakdown of system performance for the labelling task as compared with Annotator 2. This shows the comparative frequency of the different labels. Small vessel disease and Atrophy are the most frequent and the system performs well on both. The presence of these labels boosts the total precision, recall and F1 into the low 90s. With the exception of Ischaemic stroke, deep, old and Haemorrhagic stroke, deep, recent, performance is generally slightly lower for both Ischaemic and Haemorrhagic stroke labels than the total entity score. The comparative frequency of these labels (Ischaemic more frequent than Haemorrhagic) does not appear to make a difference in Table 8, but it may be that the number of Haemorrhagic stroke instances is too low for the sample to be representative. Similarly, other labels are so infrequent that their results may not be interpretable and it would be useful to acquire and annotate more data to improve the robustness of the evaluation results.

Conclusion
We have described the development and evaluation of the EdIE-R system on brain imaging radiology reports from the Edinburgh Stroke Study. The evaluation results are encouraging and the system is sufficiently accurate that we believe it can be used for its intended purpose of data provision for epidemiological studies. To that end, we are currently testing and revising the system on a dataset of over 150,000 routine brain scans from NHS Tayside collected between 1994 and 2015. We are also in the process of evaluating whether the system can reliably identify cases of intracerebral haemorrhage in patients in Greater Manchester.
The evaluation of EdIE-R against these larger datasets will show how robust it is against new data. The disadvantage of a rule-based system such as EdIE-R is that it takes time to write the rules. However, we found that with the help of the domain expert input we were able to get a first prototype running fairly quickly. For a small dataset such as ESS, we found this to work very well as we did not have any training data available at the start to test machine learning methods. Now that we have the annotated data ready we are evaluating machine learning approaches in parallel to investigate if we can obtain better results using them.
dataset. We also received permission from the NHS Tayside Caldicott Guardian to use the anonymised brain imaging reports for this work.