Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection

Background The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier. Methods We develop and evaluate a multi-task transformer architecture design where modifiers are learned and predicted jointly using the publicly available SemEval 2015 Task 14 corpus and a new Opioid Use Disorder (OUD) data set that contains modifiers shared with SemEval as well as novel modifiers specific for OUD. We evaluate the effectiveness of our multi-task learning approach versus previously published systems and assess the feasibility of transfer learning for clinical entity modifiers when only a portion of clinical modifiers are shared. Results Our approach achieved state-of-the-art results on the ShARe corpus from SemEval 2015 Task 14, showing an increase of 1.1% on weighted accuracy, 1.7% on unweighted accuracy, and 10% on micro F1 scores. Conclusions We show that learned weights from our shared model can be effectively transferred to a new partially matched data set, validating the use of transfer learning for clinical text modifiers. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-024-00311-4.


Background
Modifier prediction can be considered a subset of relation extraction where a limited set of modifying relations or attributes are predicted for a fixed entity type.Modifier prediction is a critical task in information extraction since the context of a particular mention can radically alter its meaning.In clinical text, the majority of frequently used clinical observations are "negated" [1], a problem which has led to the development of algorithms such as NegEx [2].This problem is not unique to English and additional implementations in languages besides English have been developed [3,4].Besides negation, algorithms have been developed to determine historical and hypothetical modifiers of clinical entities, as well as determine if the modified entity is referring to the patient or some other person.A well known example is the ConText algorithm [5], for which there are multiple implementations [4][5][6][7].
Within clinical text, algorithms for identifying modifier types associated with a particular class of clinical entities have also been developed, such as extraction of disease or medication modifiers.These algorithms can identify modifiers separately from the recognition of the clinical named entities or jointly, often with separate tracks at shared tasks to evaluate both approaches [8,9].Modifiers for medication mentions were included as part of the MADE 1.0 shared task and data set [8] and included dosage, route, duration, and frequency.Disease or problem specific modifiers have included modifiers for severity, a generic text, conditionality, and anatomical location.All of these modifiers were included in the SemEval 2015 Task 14 data set [9], and subsets of them can be extracted by tools such as MedLEE [10] (negation, uncertainty, and severity), ConText implementations [4][5][6][7] and cTAKES (negation, body site, severity) [11].What is not known is how effective transfer learning is on the detection of these modifiers.Given the wide range of modifiers that could be applied to an entity such as disease, an ideal transfer learning solution would be able to leverage previous data sets to identify existing and novel disease modifiers in a new data set.

Modifier Classification Problem
A more formal definition is as follows.Let X be a string sequence consisting of n tokens x 1 , x 2 , . . ., x n .Let E = {e 1 , e 2 , . . ., e s } be a string of entity mentions within X.Let m i ∈ M be a set of predefined modifiers types of modifier m i .The input to the problem is a combination of the string sequences (E and the context around it X), denoted as X.The modifier classification problem would be, for every input sequence X, predict a modifier type y mi ( X) ∈ m i for every modifier m i ∈ M for any and all that exists.

Related Work
Like other areas of clinical natural language processing (NLP), work on identifying modifiers of clinical entities has been hindered by the lack of readily available public data sets.Despite this, researchers have studied a variety of methods for solving the problem of entity modifier identification, primarily using the publicly available SemEval 2015 Task 14 data set [9], released as part of the ShARe corpus which contains modifiers for clinical problems.As discussed in the Background section, early work on modifier identification focused only on utilizing rule-based systems [6,7,10,11].Work in this area has continued, improving the speed and efficiency of modifier identification through systems such as FastContext [7] and applying this to clinical text in other languages such as French [4].This system had an average an average F-Score of 0.86 F1 on electronic clinical records for negation, temporality, and experiencer.However, the trend has been to move away from an exclusively rule-based approach.For example, an early version of the open source Clinical Text Analysis and Knowledge Extraction System (cTAKES) [11] used exclusively a rule-based approach to predict clinical entity modifiers, but this capability of cTAKES was expanded when Dligach et al. [12] implemented in cTAKES an SVM-based approach to identify disease severity and body location.
The first machine learning approach that was applied to predict multiple modifiers for a preidentified clinical entity was by Xu et al. [13] when the ShARe corpus [9] was released.The authors have systematically extracted multiple features, namely context of up to 8 words before and after an entity, dependency relation, section names, and other lexicon features to train an individual SVM classifier for each modifier.They achieve state-of-the-art (SOTA) results in identifying modifiers based on gold disorders.More recently, researchers have applied deep learning techniques such as long short-term memory (LSTM) to detect modifiers of medical entities in a clinical text [14,15].A bi-directional LSTM and a conditional random field (CRF) were used for Named Entity Recognition (NER) of modifiers by Xu et al [14].They used randomly initialized word embedding and position embedding because pre-trained word embeddings like Glove did not show much improvement in their experiments.If a sentence has more than one entity, they generate multiple training samples from the same sentenceone for each target entity and label all the modifiers of that entity using the BIO scheme.The authors used accuracy as the metric to evaluate the performance of their model.While this novel approach overcomes the problem of omitted annotations, its performance is much lower than the previous SOTA model.
A multi-task approach of extracting entities and their modifiers was implemented by Shi et al.They trained a Bi-LSTM-CRF model to extract clinical entities and modifiers and Bi-LSTM to predict the relation between every entity and modifier within the clinical text.Additionally, they have applied some constraints to limit relations between entities and modifiers when a modifier cannot be applied to an entity [15].
The advent of transformer models [16] has revolutionized various aspects of NLP through the application of transfer learning.A recent contribution in this field is the work by Khandelwal and Britto [17], who innovatively utilized transformer-based models for predicting negation and speculation modifiers in a multitask learning framework.Their methodology involved the fine-tuning of three different pre-trained transformer encoders: BERT [18], XLNet [19], and RoBERTa [20].This process was coupled with a unique approach of using a single classification head and injecting special tokens into the input text, guiding the model toward the specific task at hand-either negation or speculation detection.While their approach contributed to the broader understanding of multitask learning in NLP, their research did not extend to exploring these methods in the context of clinical texts, an area ripe for further investigation.

Contribution
In contrast to Khandelwal and Britto's [17] approach, our research extends the application of transfer learning and multi-task training to the specific area of entity modifier classification within clinical texts.We apply transfer learning to two diverse data sets, outperforming existing benchmarks on a publicly available data set.Our methodology differs notably from that of Khandelwal and Britto; we do not rely on the injection of special tokens into our input data.Instead, we employ a distinct classification head for each modifier, with simultaneous training of all heads to enhance model efficiency.This study also showcases the versatility of multi-task training in transferring weights across modifier heads from different data sets, even with only partial modifier overlap.Additionally, through ablation studies, we identify the most effective architectural configurations for our approach.A practical real-world application of our method is demonstrated in addressing the identification of modifiers for clinical entities specific to Opioid Use Disorder (OUD).

Data
We utilized two data sets for the evaluation of modifier detection in clinical text.The published ShARe corpus from SemEval 2015 Task 14 [9] and an unpublished corpus from the University of Alabama at Birmingham that was annotated for OUD-related entities (including modifiers).An overview of both corpora, including the number of documents, entities, modifier types, and counts, is shown in Table 1.

ShARe Data Set
Task 14 of the SemEval-2015 [9] provided a data set for two tasks, including clinical disorder name entity recognition and template slot filling.It consists of 531 de-identified clinical notes.For this publication, we focus only on the template slot filling task.This task requires the identification of negation, severity, course, subject, uncertainty, conditional, and generic modifiers of the clinical disease entity.The assigned training

Architecture
We created a single-task and a multi-task architecture as shown in Figure 1, where the single-task architecture uses a single classification head for all modifiers whereas the multi-task architecture has a head for each clinical modifier.Both architectures use BioBERT [21] as the base model.We chose BioBERT over other variants like BERT [18], ClinicalBERT [22], and PubMedBERT [23] since it performed better for detecting modifiers at initial experiments.No additional pre-training is performed using text from either the ShaRe Corpus or OUD corpus.

Models
We name our single-task fine-tuned BioBERT model ST, our multi-task fine-tuned model MT.For our transfer learning experiments, models have a dash separated suffix indicating which data set they were fine-tuned on.For example, SHR or OUD reference fine-tuning on the ShARe corpus or OUD corpus respectively.Fine-tuning operations are ordered from left (most distant) to right (most recent) based on the order in which they occurred.See Figure 2 for reference.The MT-SHR-OUD and MT-OUD-SHR have 5 and 7 classification heads, respectively.We also perform an experiment by combining the two data sets called MT-BOTH with a total of 9 distinct classification heads representing all clinical modifiers from both data sets.For all models we use the final hidden vector corresponding to the [CLS] token (h cls ∈ R H ) generated from BioBERT( X) as the common feature vector that is passed to each modifier classification head, which is a linear layer with learnable parameter W i ∈ R mi×H .Formally, the probability distribution of a modifier type is: where y mi is a label of the modifier m i .We train the model using cross-entropy loss for each classifier: where k is the length of the batch.The final loss is the average of all classifiers.
where n represents the number of modifier heads in the model.Additionally, we experiment with the use of focal loss [24] on our model, a type of loss function commonly used in deep learning for tasks involving imbalanced data sets.

Feature Extraction
We have adopted the question-answering input format described in the original BERT [18] to fine-tune BioBERT and adapt it to modifier prediction.As illustrated in Figure 1, the model gets two sequences as input.The lef tcontext sequence is a disorder mention and its context.We chose the context to be 200 characters before the mention and 50 characters after, an empirically driven hyper-parameter choice that achieved better performance than word-based and sentence-based context.We hypothesize this to the fact that sentence boundaries are not well-formed in clinical text [25].For instance, some clinical notes only use commas throughout whole paragraphs without any periods.The rightcontext sequence is the string of the entity itself.If the entity is discontiguous, only the strings that represent the entity are used.An example of what is passed to the model is: [CLS] The patent[sic] was found to be in fulminant liver failure.There she was having hallucinations, suicidal ideations and . . .[SEP] hallucinations.The second example would include the first sequence and suicidal ideations as a second sequence.The goal of this design is to help direct the model's attention to focus on the desired entity to extract its modifiers.Specifically, the entity is contextualized with the surrounding words, which generally include the modifiers in the first sequence.In addition, the second sequence redirects the aggregate sequence representation [CLS] attention to the entity under consideration (hallucinations in the first aforementioned example).

Model Training
For training, we follow standard procedures and use the curated training data set for both data sources to develop our models.Hyper-parameters are optimized using the designated development set.We have trained our model for 10 epochs with an empirically derived early stop.The maximum sequence length is 144, the learning rate is 2e-5, the weight decay is 1e-2, and the batch size is 64.AdamW is used as our optimizer.Similar to Xu et al. [13], the training and development set are combined to build our final model.We report the results on their respective test sets.We used a single Tesla P100 GPU with 16GB memory to run all experiments.The model will be made available upon the acceptance of the publication.

Experiments
We conducted the following experiments: • To assess the performance of transfer learning and multi-task training for clinical modifiers, we evaluate MT on the OUD corpus and the ShARe corpus.We compare our results to previously reported results for the ShARe corpus and against a generalized clinical modifier model that combines all training examples from both corpora.• To evaluate the feasibility of domain adaptation for clinical modifiers when only a portion of clinical modifiers match the target and source domain, we perform bidirectional fine-tuning between the OUD and ShARe corpus.We fine-tune on the source domain, then perform an additional round of fine-tuning on the target domain to create 2 models (MT-SHR-OUD) and (MT-OUD-SHR) that are fine-tuned first on the ShARe corpus or OUD corpus respectively.We include a classification head for each modifier from both data sets for a total of # heads.• We performed 2 ablation studies on both the OUD and ShARe corpus by (1) removing the disorder mention after the [SEP] or (2) replacing the MT model heads with a single-headed (ST) model, where a model is trained separately for each modifier.

Evaluation
To evaluate a system for identifying rare values of different modifiers, the original challenge for the ShARe corpus used weighted accuracy, where the prevalence of different values for each of the modifiers is considered.Specifically, for each modifier m i the weights are calculated as follows where k represents the different classes of the modifier m i as described in the task description paper [9].We used the evaluation script from the challenge organizers to compare with the previous state-of-the-art system [13].We have also used the standard unweighted accuracy and micro-averaged F1 to compare against other later works (Table 2).For the OUD data set, we used the standard unweighted accuracy and macro-averaged F1 (including the null class) and micro-averaged F1 for possible future comparison.

Chi-square test
We perform a Chi-square test [26] to compute the statistical difference between our model results and previous results through comparing the number of correct and incorrect predictions.We used accuracy and the total number of examples in the test set if these numbers were not available.

Results
Multi-task training and transfer learning performance are shown in Table 2 for the ShARe corpus and in Table 3 for the OUD corpus.Detailed explanations of the results are in the next subsections.

Comparative Performance of Multi-Task Training
We compare our results to previously reported results using the same metrics as the originally reported result.For the ShARe corpus, the top portion of Table 2 contains the results from using the rule-based cTAKES 4.0.0.1 system, the previous SOTA model [13] and our multi-task (MT-SHR) model on the ShARe corpus using the weighted accuracy measure.We also report for micro average F1 score and unweighted accuracy in the middle and bottom portion of Table 2 respectively.Our multi-task model (MT-SHR) performs better in all modifiers except the conditional modifier based on weighted accuracy and is on par with the SOTA model for the negation modifier.Our model shows a statistically significant improvement in all other modifiers by at least 1%.For example, the uncertainty modifier scored the highest improvement of 4% among other modifiers.On average, our model improved the performance by slightly over 1%.The low performance on the conditional modifier had a disproportionate impact on the final average results.In both corpora, training with cross-entropy loss was superior to focal loss despite class imbalance in modifier distribution.

Domain Adaptation
Results for domain adaption are shown in Table 2 for the ShARe corpus (MT-OUD-SHR) and in Table 3 for the OUD corpus (MT-SHR-OUD).Initial training on one data set followed by additional training on another data set (MT-OUD-SHR in Table 2 and MT-SHR-OUD in Table 3) with partially similar modifiers led to at least similar or even better performance compared to training on a single data set.On micro average F1 score, the performance improved by 3.3% compared to training only on the ShARe data set.However, merging the two data sets (MT-BOTH) decreased the average performance by 2.6%.

Ablation Study
Results for our ablation studies are shown for the ShARE corpus and OUD codrpus in Table 4 and Table 5 respectively.Removal of the mention after the [SEP] or "no hint" decreases the performance of the model for both data sets and has been seen in similar research experiments [27].The second to the last row in Table 4 and Table 5 show results evaluating the effectiveness of multi-task training MT versus standard fine-tuning ST on each modifier separately.For the ShARe data set, the performance dropped by an average of 2% compared with MT-SHR.It is even worse than the previous SOTA, which is based on SVM by 1% on average.

Comparison to Previous Work
The adoption of transfer learning and multi-task training (MT) yields an overall improvement of 10 points (MT-SHR) on the micro-F1 score when compared to previous work by Xu et al. [14] using a Bi-LSTM architecture as shown in Table 2.This improvement rises to 14 points when an initial round of fine-tuning is done on the OUD corpora (MT-OUD-SHR).This increase occurs even when there is only partial alignment of modifier types between the two data sets.Combining multiple data sets (MT-BOTH), similar to the work of Khandelwal and Britto work [17] yields a model that can predict all modifiers from both input data sets at the expense of performance.

Transfer Learning on Modifiers Common between Data Sets
Entity modifiers for negation, subject, severity and uncertainty are shared between the ShARe and OUD data sets.For these commons entity modifiers, our results indicate that the target data set can benefit from a previous round of fine-tuning on the source data set.This benefit is more pronounced when the set of modifiers shared between datasets have sufficient training instances.For example, micro F1 performance was improved on the ShARe corpus for the shared negation and subject modifiers, both of which have a proportionally higher number of training examples.We are uncertain as to why the same benefit was not shown on the OUD corpus for these high frequency common modifiers, but the additional round of fine-tuning did not decrease performance.For low frequency modifiers like the uncertainty modifier, results improved when transferring from ShARe to OUD (MT-SHR-OUD) but decreased when transferring from OUD to ShARe (MT-OUD-SHR) compared to MT since OUD has few examples with annotated uncertainty.Severity modifier instances are too low frequency in the OUD corpus to be evaluated for this transfer learning experiment, so we show performance on the ShARe and combined corpus only.

Transfer Learning on Modifiers Uncommon between Data Sets
One surprising result of our approach is that transfer learning performance was improved on modifiers found only in the target data set, even when they were not present in the source data set.We can see this for the result of course, generic, and conditional in MT-OUD-SHR and DT and IUD in MT-SHR-OUD.This could be the result of the initial round of training where the model has seen more examples to learn the task.However, combining the two data sets for training one model (MT-BOTH) had mixed effects on the two data sets.This may the result of relative class imbalance for entity modifier types between source and target data sets.

Efficiency of Transfer Learning between Data Sets
Multi-task training gave better performance compared to single-task training (see

Error Analysis
In our study, we carefully selected 10 errors from each modifier type for a total of 70 errors for the ShARe and 50 for the OUD test sets.This gave us a diverse range of errors to examine closely.We used the prediction of MT-SHR and MT-OUD for our analysis.A key finding from our analysis was that in 44% of cases from the ShARe, our model made correct predictions, but these appear to be have been annotated incorrectly in the data sets.For example, in the sentence "There was no rebound or guarding," our model correctly identified the negation, but this was not reflected in the gold data set.Another instance involved a discrepancy in severity grading: our model classified a case as moderate, while the data set labeled it as severe for the example "Mild degenerative changes are seen throughout the spine".This kind of inconsistency was also present in the initial version of the OUD dataset, but we have since rectified these issues, elevating its reliability to at least match, if not exceed, that of the ShARe data set.
In the context of modifier evaluations, false negatives occur when the system predicts the 'null', 'unmarked' or default class.This occurred in 46% of our examples in the ShARe dataset and 58% in the OUD data set.The default case covers an average of 92% of instances in both data sets, this class imbalance means our model tends to overlook or incorrectly classify under-represented modifiers.For instance, in "He had slight decreased sensation in his right upper extremity," the model failed to identify 'slight' as the severity level, opting instead for 'unmarked'.Within the OUD data set, the model struggles to identify the negation present in the clinical note section for the psychological state.It also struggles with abbreviations or alternatives for negations, like 'NEG', '0', 'Zero', and 'None'.For instance, Family History: 0 suicide attempts has a 0 that was not recognized by the model as a negation for suicide attempts.For the DocTime modifier, the model has some trouble recognizing the current annotated text as a part of the history section of the clinical note -deeming all information included as 'Before'.
False positives accounted for about 54% of ShARe and 42% for OUD errors.These errors often stemmed from confusing contexts.For example, in the phrase "Per her daughter she has been having shortness of breath," the model misinterpreted 'shortness of breath' as referring to the daughter, not the patient.Similarly, in the OUD dataset, the model was confused by drug/lab test contexts.In sentences like "Lab Results: U Methadone Negative U Opiates Positive U Oxycodone Negative," the presence of the word 'Negative' misled the model for the affirmed mention 'U Opiates'.Lastly, our system made unexpected errors about 10% of the time.On occasion, it failed to recognize terms such as "denies" as negations.This may reflect the varied contextual language and inconsistent use of denies by physicians where physicians suspected but patient denied conditions are marked as denies whereas annotation guidelines impose consistency on a more nuanced note.Other errors were due to fine distinctions between different classifications, such as failing to differentiate between 'increased' and 'worsened' in symptom descriptions.For instance, the model predicted that the fatigue increased while it was annotated as worsened in the example "Three days of progressive fatigue." For more examples and a detailed look at these points, please refer to Table 6.

Limitations
To date the only corpus containing clinical modifiers of entities that has been published to our knowledge is the ShARe corpus from SemEval 2015 Task 14 [9].This data set, in conjunction with the OUD data set, leaves only two data sets for evaluation.We do not evaluate large language models (LLMs) in this work, but do not believe this is needed, given our task is an information extraction task, fine-tuning was done and recent work suggests that domain models are capable of outperforming LLMs in this domain [28].Additionally, we do not evaluate the anatomy modifier of the ShARe corpus since our approach requires training data for all class types and more than onethird of the classes in the ShARe test set are not in the training set.We are exploring synthetic data to address this issue.Finally, for the ShARe corpus, the vast majority of the examples have a single clinical entity mention within the chosen context.However, a duplicate clinical entity mention occurs in the same context window in 4% of the examples and in 2% of the examples the same clinical entity mention occurs 3 or more times.This can cause ambiguity since the clinical entity modifiers can only be distinguished by slightly different ends to their context window.

Conclusion
Our results indicate that multi-task training can be beneficial for modifier identification and we show state-of-the-art performance on the ShARe corpus.Additionally, our experiments suggest that an additional round of fine-tuning on a similar data set can be more effective/efficient than training a transformer model on a combined data set, even if modifiers from the two data sets only partially overlap.

Fig. 1
Fig. 1 Overview our modifier predication model.The multi-task architecture contains a Classification head for each distinct modifier type.The single-task architecture has only a single head for the classification of one of the modifiers.

Fig. 2
Fig. 2 Overview of transfer learning.Thin arrows indicate training data flow, color-coded for the data source.Thick arrows indicate fine-tuning operations.

Table 1
Statistics of the ShARe and OUD CorpusAfter training by WC (OUD research co-coordinator) anotators created a corpus consisting of 3295 clinical notes from 59 patients (23 controls) from physician case referrals between 2016 and 2021.Annotation of 25478 OUD entity mentions and modifiers were done using BRAT 1.3 software.Annotators modified entities for negation, subject and assigned a DocTime value of before, overlaps, or after.Additionally, annotators anno- Default modifier values are not reported.Ents: entities, Neg: negation, Sev: severity, Cou: course, Sub: subject, Unc: uncertainty, Con: conditional, Gen: generic, DT: DocTime, IDU: Illicit Drug Use.and development set are combined to build our final model, and the test set results are reported.2.1.2OUD Data Set tated mentions of substance and opioid use, OUD and Substance Use Disorder (SUD) as illicit.To the best of our knowledge, illicitDrugUse is a unique event modifier in our data set.We split the data set to 80% for training, 10% for development, and 10% for testing based on entities, not documents.The training and development set are combined to build our final model, and the test set results are reported.We plan to de-identify and release this data set as part of a future shared task.

Table 2
Model Performance on the ShARe corpus Bold font means the best performance.Underline means statistically significant p-value relative to the model in the previous line.Ep: epochs, Neg: negation, Sev: severity, Cou: course, Sub: subject, Unc: uncertainty, Con: conditional, Gen: generic, Avg: average, fl: focal loss.* For micro averages of MT, MT-BOTH and MT-OUD-SHR, the subject modifier the 'other' class is excluded due to having only 4 examples.

Table 3
Model Performance on the OUD Corpus Modifiers.
Bold font means the best performance.Underline means statistically significant p-value relative to the model in the previous line.Ep: epochs, Neg: negation, Sev: severity, Sub: subject, Unc: uncertainty, DT: DocTime, IDU: Illicit Drug Use

Table 4
Ablation study results on the ShARe Corpus

Table 5
Ablation study results for the OUD corpus.

Table 4 )
and it was more efficient.On average, it took 6 minutes to train our MT model for one epoch, while it took 5 minutes to train an ST model, which had to be repeated for each modifier.Overall, the training time and resources cost for MT model was reduced by at least 60% compared to training all ST models.Multi-task training and transfer learning from one data set to the other was also more efficient than combining the two data sets (MT-BOTH).Training MT-BOTH required 20 epochs for both data sets whereas two consecutive rounds of multitask transfer learning (MT-OUD-SHR and MT-SHR-OUD) reduced that cost by almost 50%.We noticed that MT-BOTH needed significantly more time to learn the uncommon modifiers compared to learning the common modifiers.While combining the data sets is less efficient, it allowed MT-BOTH to overcome the problem of significantly low examples in a data set, such as the OUD data set severity modifier.

Table 6
TEST Error Analysis for the ShARe and OUD Data Set.His extremities also showed greater swelling in his left leg.During interview she stated she has not been using illicit substances since D/C, but after UDS came back pos for amphetamines and benzos, she admitted to using these illicitly about a week ago due to her high anxiety.Mention in each example is underlined.Longer context is ignored for the space limit.False negatives in this context are when the default modifier class is predicted.False positives are when the wrong modifier class is predicted.