Skip to main content

Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection



The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.


We develop and evaluate a multi-task transformer architecture design where modifiers are learned and predicted jointly using the publicly available SemEval 2015 Task 14 corpus and a new Opioid Use Disorder (OUD) data set that contains modifiers shared with SemEval as well as novel modifiers specific for OUD. We evaluate the effectiveness of our multi-task learning approach versus previously published systems and assess the feasibility of transfer learning for clinical entity modifiers when only a portion of clinical modifiers are shared.


Our approach achieved state-of-the-art results on the ShARe corpus from SemEval 2015 Task 14, showing an increase of 1.1% on weighted accuracy, 1.7% on unweighted accuracy, and 10% on micro F1 scores.


We show that learned weights from our shared model can be effectively transferred to a new partially matched data set, validating the use of transfer learning for clinical text modifiers.


Relation extraction is a powerful method for identifying and categorizing relationships between entities within text [1,2,3], however it requires explicit annotation of both entities and the relationships (or modifiers) between them. Unfortunately, there are few clinical data sets that meet this criteria [4]. Modifier prediction can be considered a subset of relation extraction where a limited set of modifying relations or attributes are predicted for a fixed entity type. Modifier prediction is a critical task in information extraction since the context of a particular mention can radically alter its meaning. In clinical text, the majority of frequently used clinical observations are “negated” [5], a problem which has led to the development of algorithms such as NegEx [6]. This problem is not unique to English and additional implementations in languages besides English have been developed [7, 8]. Besides negation, algorithms have been developed to determine historical and hypothetical modifiers of clinical entities, as well as determine if the modified entity is referring to the patient or some other person. A well known example is the ConText algorithm [9], for which there are multiple implementations [8,9,10,11].

Within clinical text, algorithms for identifying modifier types associated with a particular class of clinical entities have also been developed, such as extraction of disease or medication modifiers. These algorithms can identify modifiers separately from the recognition of the clinical named entities or jointly, often with separate tracks at shared tasks to evaluate both approaches [12, 13]. Modifiers for medication mentions were included as part of the MADE 1.0 shared task and data set [12] and included dosage, route, duration, and frequency. Disease or problem specific modifiers have included modifiers for severity, a generic text, conditionality, and anatomical location. All of these modifiers were included in the SemEval 2015 Task 14 data set [13], and subsets of them can be extracted by tools such as MedLEE [14] (negation, uncertainty, and severity), ConText implementations [8,9,10,11] and cTAKES (negation, body site, severity) [15]. What is not known is how effective transfer learning is in the detection of these modifiers. Given the wide range of modifiers that could be applied to an entity such as disease, an ideal transfer learning solution would be able to leverage previous data sets to identify existing and novel disease modifiers in a new data set.

Modifier classification problem

A more formal definition is as follows. Let X be a string sequence consisting of n tokens \(x_1, x_2, \dots , x_n\). Let \(E=\{e_1, e_2, \dots , e_s\}\) be a string of entity mentions within X. Let \({m_i} \in {M}\) be a set of predefined modifiers types of modifier \(m_i\). The input to the problem is a combination of the string sequences (E and the context around it X), denoted as \(\widehat{X}\). The modifier classification problem would be, for every input sequence \(\widehat{X}\), predict a modifier type \(y_{m_i}(\widehat{X}) \in {m_i}\) for every modifier \({m_i} \in {M}\) for any and all that exists.

Related work

Like other areas of clinical natural language processing (NLP), work on identifying modifiers of clinical entities has been hindered by the lack of readily available public data sets. Despite this, researchers have studied a variety of methods for solving the problem of entity modifier identification, primarily using the publicly available SemEval 2015 Task 14 data set [13], released as part of the ShARe corpus which contains modifiers for clinical problems. As discussed in the Background section, early work on modifier identification focused only on utilizing rule-based systems [10, 11, 14, 15]. Work in this area has continued, improving the speed and efficiency of modifier identification through systems such as FastContext [11] and applying this to clinical text in other languages such as French [8]. This system had an average an average F-Score of 0.86 F1 on electronic clinical records for negation, temporality, and experiencer. However, the trend has been to move away from an exclusively rule-based approach. For example, an early version of the open source Clinical Text Analysis and Knowledge Extraction System (cTAKES) [15] used exclusively a rule-based approach to predict clinical entity modifiers, but this capability of cTAKES was expanded when Dligach et al. [16] implemented in cTAKES an SVM-based approach to identify disease severity and body location.

The first machine learning approach that was applied to predict multiple modifiers for a preidentified clinical entity was by Xu et al. [17] when the ShARe corpus [13] was released. The authors have systematically extracted multiple features, namely context of up to 8 words before and after an entity, dependency relation, section names, and other lexicon features to train an individual SVM classifier for each modifier. They achieve state-of-the-art (SOTA) results in identifying modifiers based on gold disorders. More recently, researchers have applied deep learning techniques such as long short-term memory (LSTM) to detect modifiers of medical entities in a clinical text [18, 19]. A bi-directional LSTM and a conditional random field (CRF) were used for Named Entity Recognition (NER) of modifiers by Xu et al. [18]. They used randomly initialized word embedding and position embedding because pre-trained word embeddings like Glove did not show much improvement in their experiments. If a sentence has more than one entity, they generate multiple training samples from the same sentence - one for each target entity and label all the modifiers of that entity using the BIO scheme. The authors used accuracy as the metric to evaluate the performance of their model. While this novel approach overcomes the problem of omitted annotations, its performance is much lower than the previous SOTA model.

A multi-task approach of extracting entities and their modifiers was implemented by Shi et al. Multi-task learning works “by learning tasks in parallel while using a shared representation” such that “what is learned for each task can help other tasks be learned better” [20]. Shi et al. trained a Bi-LSTM-CRF model to extract clinical entities and modifiers and Bi-LSTM to predict the relation between every entity and modifier within the clinical text. Additionally, they have applied some constraints to limit relations between entities and modifiers when a modifier cannot be applied to an entity [19].

The advent of transformer models [21] has revolutionized various aspects of NLP through the application of transfer learning. A recent contribution in this field is the work by Khandelwal and Britto [22], who innovatively utilized transformer-based models for predicting negation and speculation modifiers in a multi-task learning framework. Their methodology involved the fine-tuning of three different pre-trained transformer encoders: BERT [23], XLNet [24], and RoBERTa [25]. This process was coupled with a unique approach of using a single classification head and injecting special tokens into the input text, guiding the model toward the specific task at hand-either negation or speculation detection. While their approach contributed to the broader understanding of multi-task learning in NLP, their research did not extend to exploring these methods in the context of clinical texts, an area ripe for further investigation.


In contrast to Khandelwal and Britto’s [22] approach, our research extends the application of transfer learning and multi-task training to the specific area of entity modifier classification within clinical texts. We apply transfer learning to two diverse data sets, outperforming existing benchmarks on a publicly available data set. Our methodology differs notably from that of Khandelwal and Britto; we do not rely on the injection of special tokens into our input data. Instead, we employ a distinct classification head for each modifier, with simultaneous training of all heads to enhance model efficiency. This study also showcases the versatility of multi-task training in transferring weights across modifier heads from different data sets, even with only partial modifier overlap. Additionally, through ablation studies, we identify the most effective architectural configurations for our approach. A practical real-world application of our method is demonstrated in addressing the identification of modifiers for clinical entities specific to Opioid Use Disorder (OUD).



We utilized two data sets for the evaluation of modifier detection in clinical text. The published ShARe corpus from SemEval 2015 Task 14 [13] and an unpublished corpus from the University of Alabama at Birmingham that was annotated for OUD-related entities (including modifiers). An overview of both corpora, including the number of documents, entities, modifier types, and counts, is shown in Table 1.

ShARe data set

Task 14 of the SemEval-2015 [13] provided a data set for two tasks, including clinical disorder name entity recognition and template slot filling. It consists of 531 de-identified clinical notes. For this publication, we focus only on the template slot filling task. This task requires the identification of negation, severity, course, subject, uncertainty, conditional, and generic modifiers of the clinical disease entity. The assigned training and development set are combined to build our final model, and the test set results are reported.

OUD data set

After training by WC (OUD research co-coordinator) annotators created a corpus consisting of 3295 clinical notes from 59 patients (23 controls) from physician case referrals between 2016 and 2021. Annotation of 25478 OUD entity mentions and modifiers were done using BRAT 1.3 software. Annotators modified entities for negation, subject and assigned a DocTime value of before, overlaps, or after. Additionally, annotators annotated mentions of substance and opioid use, OUD and Substance Use Disorder (SUD) as illicit. To the best of our knowledge, illicitDrugUse is a unique event modifier in our data set. We split the data set to 80% for training, 10% for development, and 10% for testing based on entities, not documents. The training and development set are combined to build our final model, and the test set results are reported. We plan to de-identify and release this data set as part of a future shared task.

Table 1 Statistics of the ShARe and OUD corpus


We created a single-task and a multi-task architecture as shown in Fig. 1, where the single-task architecture uses a single classification head for each modifier trained separately whereas the multi-task architecture has a head for each clinical modifier trained jointly. Both architectures use BioBERT [26] as the base model. We chose BioBERT over other variants like BERT [23], ClinicalBERT [27], and PubMedBERT [28] since it performed better for detecting modifiers at initial experiments. No additional pre-training is performed using text from either the ShaRe Corpus or OUD corpus.

Fig. 1
figure 1

Overview our modifier predication model. The multi-task (MT) architecture contains a classification head for each distinct modifier type. The single-task (ST) architecture has a single head for the classification of each modifier that is trained separately


We name our single-task fine-tuned BioBERT model ST, our multi-task fine-tuned model MT. For our transfer learning experiments, models have a dash separated suffix indicating which data set they were fine-tuned on. For example, SHR or OUD reference fine-tuning on the ShARe corpus or OUD corpus respectively. Fine-tuning operations are ordered from left (most distant) to right (most recent) based on the order in which they occurred. See Fig. 2 for reference. The MT-SHR-OUD and MT-OUD-SHR have 5 and 7 classification heads, respectively. We also perform an experiment by combining the two data sets called MT-BOTH with a total of 9 distinct classification heads representing all clinical modifiers from both data sets.

Fig. 2
figure 2

Overview of transfer learning process. Thick arrows indicate aggregation of training data. Thin arrows indicate the training data used for supervised fine-tuning. Medium width grey arrows indicate the use of a previously fine-tuned model. Model names are prefixed with the architecture type (ST or MT) and postfixed with the most recent training dataset the model has been fine-tuned on

For all models we use the final hidden vector corresponding to the [CLS] token (\(\mathbf {h_{cls}} \in \mathbb {R}^H\)) generated from BioBERT(\(\widehat{X}\)) as the common feature vector that is passed to each modifier classification head, which is a linear layer with learnable parameter \(\textbf{W}_i \in \mathbb {R}^{m_i \times H}\). Formally, the probability distribution of a modifier type is:

$$\begin{aligned} P\left(\hat{y}_{m_i}|\widehat{X}\right) = \textrm{softmax}(\textbf{W}_i \textbf{h}_{cls} + \textbf{b}_i) \end{aligned}$$

where \(y_{m_i}\) is a label of the modifier \(m_i\). We train the model using cross-entropy loss for each classifier:

$$\begin{aligned} L_i = \sum \limits _{1}^{k} P(y_{m_i}) \log P\left(\hat{y}_{m_i}|\widehat{X}\right) \end{aligned}$$

where k is the length of the batch. The final loss is the average of all classifiers.

$$\begin{aligned} L = 1/n \sum \limits _{1}^{n} L_i \end{aligned}$$

where n represents the number of modifier heads in the model. Additionally, we experiment with the use of focal loss [29] on our model, a type of loss function commonly used in deep learning for tasks involving imbalanced data sets.

Feature extraction

We have adopted the question-answering input format described in the original BERT [23] to fine-tune BioBERT and adapt it to modifier prediction. As illustrated in Fig. 1, the model gets two sequences as input. The leftcontext sequence is a disorder mention and its context and the rightcontext sequence is the string of the entity itself. We chose the context to be 200 characters before the mention and 50 characters after an empirically driven hyper-parameter choice that achieved better performance than word-based and sentence-based context. This choice was made after experimenting with 50, 100, 150, and 200 character offset combinations before and after the disorder mention. We did not experiment with sentence boundary offsets because sentence boundaries are not well-formed in clinical text [30], for instance some clinical notes contain paragraphs that use commas instead of periods. Additionally, OUD modifier annotations were not restricted to sentence boundaries. For discontiguous entities, only the strings that represent the entity are used. An example of what is passed to the model is: [CLS] The patent[sic] was found to be in fulminant liver failure. There she was having hallucinations, suicidal ideations and ... [SEP] hallucinations. The second example would include the first sequence and suicidal ideations as a second sequence.

The goal of this design is to help direct the model’s attention to focus on the desired entity to extract its modifiers. Specifically, the entity is contextualized with the surrounding words, which generally include the modifiers in the first sequence. In addition, the second sequence redirects the aggregate sequence representation [CLS] attention to the entity under consideration (hallucinations in the first aforementioned example).

Model training

For training, we follow standard procedures and use the curated training data set for both data sources to develop our models. Hyper-parameters are optimized using the designated development set. We have trained our model for 10 epochs with an empirically derived early stop. Details are in Fig. 1 of the Supplementary Materials. The maximum sequence length is 144, the learning rate is 2e-5, the weight decay is 1e-2, and the batch size is 64. AdamW is used as our optimizer. Similar to Xu et al. [17], the training and development set are combined to build our final model. We report the results on their respective test sets. We used a single Tesla P100 GPU with 16GB memory to run all experiments. The model will be made available upon the acceptance of the publication.


We conducted the following experiments:

  • To assess the performance of transfer learning and multi-task training for clinical modifiers, we evaluate MT on the OUD corpus and the ShARe corpus. We compare our results to previously reported results for the ShARe corpus and against a generalized clinical modifier model that combines all training examples from both corpora.

  • To evaluate the feasibility of domain adaptation for clinical modifiers when only a portion of clinical modifiers match the target and source domain, we perform bidirectional fine-tuning between the OUD and ShARe corpus. We fine-tune on the source domain, then perform an additional round of fine-tuning on the target domain to create 2 models (MT-SHR-OUD) and (MT-OUD-SHR) that are fine-tuned first on the ShARe corpus or OUD corpus respectively. We include a classification head for each modifier from both data sets for a total of # heads.

  • We performed 2 ablation studies on both the OUD and ShARe corpus by (1) removing the disorder mention after the [SEP] or (2) replacing the MT model heads with a single-headed (ST) model, where a model is trained separately for each modifier.


To evaluate a system for identifying rare values of different modifiers, the original challenge for the ShARe corpus used weighted accuracy, where the prevalence of different values for each of the modifiers is considered. Specifically, for each modifier \(m_i\) the weights are calculated as follows

$$\begin{aligned} weight \left( m_i^k \right) =1-prevalence \left( m_i^k\right) \end{aligned}$$

where k represents the different classes of the modifier \(m_i\) as described in the task description paper [13]. We used the evaluation script from the challenge organizers to compare with the previous state-of-the-art system [17]. We have also used the standard unweighted accuracy and micro-averaged F1 to compare against other later works (Table 2). For the OUD data set, we used the standard unweighted accuracy and macro-averaged F1 (including the null class) and micro-averaged F1 for possible future comparison.

Chi-square test

We perform a Chi-square test [31] to compute the statistical difference between our model results and previous results through comparing the number of correct and incorrect predictions. We used accuracy and the total number of examples in the test set if these numbers were not available.


Multi-task training and transfer learning performance are shown in Table 2 for the ShARe corpus and in Table 3 for the OUD corpus. Detailed explanations of the results are in the next subsections.

Table 2 Model Performance on the ShARe corpus

Comparative performance of multi-task training

We compare our results to previously reported results using the same metrics as the originally reported result. For the ShARe corpus, the top portion of Table 2 contains the results from using the rule-based cTAKES system, the previous SOTA model [17] and our multi-task (MT-SHR) model on the ShARe corpus using the weighted accuracy measure. We also report for micro average F1 score and unweighted accuracy in the middle and bottom portion of Table 2 respectively. Our multi-task model (MT-SHR) performs better in all modifiers except the conditional modifier based on weighted accuracy and is on par with the SOTA model for the negation modifier. Our model shows a statistically significant improvement in all other modifiers by at least 1%. For example, the uncertainty modifier scored the highest improvement of 4% among other modifiers. On average, our model improved the performance by slightly over 1%. The low performance on the conditional modifier had a disproportionate impact on the final average results. In both corpora, training with cross-entropy loss was superior to focal loss despite class imbalance in modifier distribution.

Domain adaptation

Results for domain adaption are shown in Table 2 for the ShARe corpus (MT-OUD-SHR) and in Table 3 for the OUD corpus (MT-SHR-OUD). Initial training on one data set followed by additional training on another data set (MT-OUD-SHR in Table 2 and MT-SHR-OUD in Table 3) with partially similar modifiers led to at least similar or even better performance compared to training on a single data set. On micro average F1 score, the performance improved by 3.3% compared to training only on the ShARe data set. However, merging the two data sets (MT-BOTH) decreased the average performance by 2.6%.

Table 3 Model Performance on the OUD Corpus Modifiers

Ablation study

Results for our ablation studies are shown for the ShARE corpus and OUD corpus in Tables 2 and 3 respectively. Results from our ablation studies are presented in Table 2 for the ShARe corpus and in Table  3 for the OUD corpus, as detailed at the end of the first section in both tables. Removal of the mention after the [SEP] or “no hint” decreases the performance of the model for both data sets and has been seen in similar research experiments  [32]. The second to the last row of the first section in Tables 2 and 3 show results evaluating the effectiveness of multi-task training MT versus standard fine-tuning ST on each modifier separately. For the ShARe data set, the performance dropped by an average of 2% compared with MT-SHR. It is even worse than the previous SOTA, which is based on SVM by 1% on average.

Error analysis

In our study, we carefully selected 10 errors from each modifier type for a total of 70 errors for the ShARe and 50 for the OUD test sets. This gave us a diverse range of errors to examine closely. We used the prediction of MT-SHR and MT-OUD for our analysis. A key finding from our analysis was that in 44% of cases from the ShARe, our model made correct predictions, but these appear to be have been annotated incorrectly in the data sets. For example, in the sentence “There was no rebound or guarding,” our model correctly identified the negation, but this was not reflected in the gold data set. This kind of inconsistency was also present in the initial version of the OUD dataset, but we have since rectified these issues, elevating its reliability to at least match, if not exceed, that of the ShARe data set.

In the context of modifier evaluations, false negatives occur when the system predicts the ‘null’, ‘unmarked’ or default class. This occurred in 46% of our examples in the ShARe dataset and 58% in the OUD data set. The default case covers an average of 92% of instances in both data sets, this class imbalance means our model tends to overlook or incorrectly classify under-represented modifiers. For instance, in “He had slight decreased sensation in his right upper extremity,” the model failed to identify ‘slight’ as the severity level, opting instead for ‘unmarked’. Within the OUD data set, the model struggles to identify the negation present in the clinical note section for the psychological state. It also struggles with abbreviations or alternatives for negations, like ‘NEG’, ‘0’, ‘Zero’, and ‘None’. For instance, Family History: 0 suicide attempts has a 0 that was not recognized by the model as a negation for suicide attempts. For the DocTime modifier, the model has some trouble recognizing the current annotated text as a part of the history section of the clinical note - deeming all information included as ‘Before’.

False positives accounted for about 54% of ShARe and 42% for OUD errors. These errors often stemmed from confusing contexts. For example, in the phrase “Per her daughter she has been having shortness of breath,” the model misinterpreted ‘shortness of breath’ as referring to the daughter, not the patient. Similarly, in the OUD dataset, the model was confused by drug/lab test contexts. In sentences like “Lab Results: U Methadone Negative U Opiates Positive U Oxycodone Negative,” the presence of the word ‘Negative’ misled the model for the affirmed mention ‘U Opiates’.

Lastly, our system made unexpected errors about 10% of the time. On occasion, it failed to recognize terms such as “denies” as negations. This may reflect the varied contextual language and inconsistent use of denies by physicians where physicians suspected but patient denied conditions are marked as denies whereas annotation guidelines impose consistency on a more nuanced note. Other errors were due to fine distinctions between different classifications, such as failing to differentiate between ‘increased’ and ‘worsened’ in symptom descriptions. For instance, the model predicted that the fatigue increased while it was annotated as worsened in the example “Three days of progressive fatigue.”

Finally, for the ShARe corpus, almost all of the examples have a single clinical entity mention within the chosen context. However, a duplicate clinical entity mention occurs in the same context window in 4% of the examples and in 2% of the examples the same clinical entity mention occurs 3 or more times. This can cause ambiguity since the clinical entity modifiers can only be distinguished by slightly different ends to their context window. For more examples and a detailed look at these points, please refer to Table 4.

Table 4 TEST Error Analysis for the ShARe and OUD Data Set


Multi-task transfer learning

Comparison to previous work

The adoption of transfer learning and multi-task training (MT) yields an overall improvement of 10 points (MT-SHR) on the micro-F1 score when compared to previous work by Xu et al.[18] using a Bi-LSTM architecture as shown in Table 2. This improvement rises to 14 points when an initial round of fine-tuning is done on the OUD corpora (MT-OUD-SHR) and occurs although there is only partial alignment of modifier types between the two data sets. Combining multiple data sets (MT-BOTH), similar to the work of Khandelwal and Britto work [22] yields a model that can predict all modifiers from both input data sets. This flexibility comes with a slight, but statistically significant overall performance penalty on the OUD data set as seen on the bottom row of Table 3.

Transfer learning on modifiers common between data sets

Entity modifiers for negation, subject, and uncertainty are shared between the ShARe and OUD data sets. For these common entity modifiers, our results indicate that the target data set can benefit from a previous round of fine-tuning on the source data set. This benefit is more pronounced when the set of modifiers shared between data sets have sufficient training instances. For example, micro F1 performance was improved on the ShARe corpus for the shared negation and subject modifiers, both of which have a proportionally higher number of training examples. We are uncertain as to why the same benefit was not shown on the OUD corpus for these high frequency common modifiers, but the additional round of fine-tuning did not decrease performance. For low frequency modifiers like the uncertainty modifier, results improved when transferring from ShARe to OUD (MT-SHR-OUD) but decreased when transferring from OUD to ShARe (MT-OUD-SHR) compared to MT since OUD has few examples with annotated uncertainty.

Transfer learning on modifiers uncommon between data sets

One surprising result of our approach is that transfer learning performance was improved on modifiers found only in the target data set, even when they were not present in the source data set. We can see this for the result of course, generic, and conditional in MT-OUD-SHR and DT and IUD in MT-SHR-OUD. This could be the result of the initial round of training where the model has seen more examples to learn the task. However, combining the two data sets for training one model (MT-BOTH) had mixed effects on the two data sets. This may be the result of relative class imbalance for entity modifier types between source and target data sets.

Efficiency of transfer learning between data sets

Multi-task training gave better performance compared to single-task training (see Table 2) and it was more efficient. On average, it took 6 minutes to train our MT model for one epoch, while it took 5 minutes to train an ST model, which had to be repeated for each modifier. Overall, the training time and resources cost for MT model was reduced by at least 60% compared to training all ST models. Multi-task training and transfer learning from one data set to the other was also more efficient than combining the two data sets (MT-BOTH). Training MT-BOTH required 20 epochs for both data sets whereas two consecutive rounds of multi-task transfer learning (MT-OUD-SHR and MT-SHR-OUD) reduced that cost by almost 50%. We noticed that MT-BOTH needed significantly more time to learn the uncommon modifiers compared to learning the common modifiers.


To date the only corpus containing clinical modifiers of entities that has been published to our knowledge is the ShARe corpus from SemEval 2015 Task 14 [13]. This data set, in conjunction with the OUD data set, leaves only two data sets for evaluation. We do not evaluate large language models (LLMs) in this work, but do not believe this is needed, given our task is an information extraction task, fine-tuning was done and recent work suggests that domain models are capable of outperforming LLMs in this domain [33]. Additionally, we do not evaluate the anatomy modifier of the ShARe corpus since our approach requires training data for all class types and more than one-third of the classes in the ShARe test set are not in the training set. We are exploring synthetic data to address this issue.


Our results indicate that multi-task training can be beneficial for modifier identification and we show state-of-the-art performance on the ShARe corpus. Finally, our experiments suggest that an additional round of fine-tuning on a similar data set can be more effective and efficient than training a transformer model on a combined data set, even if modifiers from the two data sets only partially overlap.

Availability of data and materials

The ShARe corpus is available upon request (see from its distributors and we will release the OUD corpus after de-identification is complete. The corpus will be released through Physionet [34] under the PhysioNet Credentialed Health Data License 1.5.0, using the same de-identification methodology [35] used previously. Software to replicate results is available at entity modifiers.





Bidirectional Encoder Representations from Transformers


Bidirectional LSTM




Biomedical BERT






Conditional Random Field




Focal Loss




Illicit Drug Use


Large Language Model


Long-Short-Term Memory






Named Entity Recognition


Natural Language Processing


Opioid Use Disorder


Robustly Optimized BERT Pre-training Approach




Shared Annotated Resources


SHARe data set








Support Vector Machine




Generalized Autoregressive Pretraining for Language Understanding


  1. Zhong Z, Chen D. A frustratingly easy approach for entity and relation extraction. 2020. arXiv preprint arXiv:2010.12812.

  2. Wadden D, Wennberg U, Luan Y, Hajishirzi H. Entity, relation, and event extraction with contextualized span representations. 2019. arXiv preprint arXiv:1909.03546.

  3. Soares LB, FitzGerald N, Ling J, Kwiatkowski T. Matching the blanks: distributional similarity for relation learning. 2019. arXiv preprint arXiv:1906.03158.

  4. Fraile Navarro D, Ijaz K, Rezazadegan D, Rahimi-Ardabili H, Dras M, Coiera E, Berkovsky S. Clinical named entity recognition and relation extraction using natural language processing of medical free text: A systematic review. Int J Med Inform. 2023;177:105122.

  5. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation of negation phrases in narrative clinical reports. In: Proceedings of the AMIA Symposium. American Medical Informatics Association; 2001. pp. 105.

  6. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001;34(5):301–10.

    Article  Google Scholar 

  7. Chapman WW, Hilert D, Velupillai S, Kvist M, Skeppstedt M, Chapman BE, et al. Extending the NegEx lexicon for multiple languages. Stud Health Technol Inform. 2013;192:677.

    Google Scholar 

  8. Mirzapour M, Abdaoui A, Tchechmedjiev A, Digan W, Bringay S, Jonquet C. French FastContext: A publicly accessible system for detecting negation, temporality and experiencer in French clinical notes. J Biomed Inform. 2021;117:103733.

    Article  Google Scholar 

  9. Chapman W, Dowling J, Chu D. ConText: An algorithm for identifying contextual features from clinical text." BioNLP 2007: Biological, translational, and clinical language processing. Prague: 2007 Association for Computational Linguistics; 2007, pp. 81–88.

  10. Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;42(5):839–51.

    Article  Google Scholar 

  11. Shi J, Hurdle JF. Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable. J Biomed Inform. 2018;85:106–13.

    Article  Google Scholar 

  12. Jagannatha A, Liu F, Liu W, Yu H. Overview of the first natural language processing challenge for extracting medication, indication, and adverse drug events from electronic health record notes (MADE 1.0). Drug Saf. 2019;42:99–111.

  13. Elhadad N, Pradhan S, Gorman S, Manandhar S, Chapman W, Savova G. SemEval-2015 Task 14: Analysis of Clinical Text. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver: Association for Computational Linguistics; 2015. pp. 303–310.

  14. Friedman C, Hripcsak G, et al. Natural language processing and its future in medicine. Acad Med. 1999;74(8):890–5.

    Article  Google Scholar 

  15. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.

    Article  Google Scholar 

  16. Dligach D, Bethard S, Becker L, Miller T, Savova GK. Discovering body site and severity modifiers in clinical texts. J Am Med Inform Assoc. 2014;21(3):448–54.

    Article  Google Scholar 

  17. Xu J, Zhang Y, Wang J, Wu Y, Jiang M, Soysal E, et al. UTH-CCB: The Participation of the SemEval 2015 Challenge – Task 14. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver: Association for Computational Linguistics; 2015. pp. 311–314.

  18. Xu J, Li Z, Wei Q, Wu Y, Xiang Y, Lee HJ, et al. Applying a deep learning-based sequence labeling approach to detect attributes of medical concepts in clinical text. BMC Med Inform Decis Mak. 2019;19(5):1–8.

    Google Scholar 

  19. Shi X, Yi Y, Xiong Y, Tang B, Chen Q, Wang X, et al. Extracting entities with attributes in clinical text via joint deep learning. J Am Med Inform Assoc. 2019;26(12):1584–91.

    Article  Google Scholar 

  20. Caruana R. Multitask learning. Mach Learn. 1997;28:41–75.

    Article  Google Scholar 

  21. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. "Attention is all you need." Advances in neural information processing systems 30 (2017). 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

  22. Khandelwal A, Britto BK. Multitask Learning of Negation and Speculation using Transformers. In: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis. Online: Association for Computational Linguistics; 2020. pp. 79–87.

  23. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805.

  24. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR. Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst; 2019. p. 32.

    Google Scholar 

  25. Liu Z, Lin W, Shi Y, Zhao J. A robustly optimized BERT pre-training approach with post-training. In: China National Conference on Chinese Computational Linguistics. Springer; 2021. pp. 471–484.

  26. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.

    Article  Google Scholar 

  27. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, et al. Publicly available clinical BERT embeddings. 2019. arXiv preprint arXiv:1904.03323.

  28. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc (HEALTH). 2021;3(1):1–23.

    Google Scholar 

  29. Lin TY, Goyal P, Girshick R, He K, Dollár P. "Focal loss for dense object detection." In Proceedings of the IEEE international conference on computer vision. New York: IEEE; 2017, pp. 2980–8.

  30. Griffis D, Shivade C, Fosler-Lussier E, Lai AM. A quantitative and qualitative evaluation of sentence boundary detection for the clinical domain. AMIA Summits Transl Sci Proc. 2016;2016:88.

    Google Scholar 

  31. Pearson KX. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci. 1900;50(302):157–75.

    Article  Google Scholar 

  32. Webson A, Pavlick E. Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Dublin: Association for Computational Linguistics; 2022. pp. 2300–2344.

  33. Lehman E, Hernandez E, Mahajan D, Wulff J, Smith MJ, Ziegler Z, et al. Do we still need clinical language models? 2023. arXiv preprint arXiv:2302.08091.

  34. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215–20.

    Article  Google Scholar 

  35. Osborne JD, Booth JS, O’Leary T, Mudano A, Rosas G, Foster PJ, et al. Identification of gout flares in chief complaint text using natural language processing. In: AMIA Annual Symposium Proceedings, vol. 2020. American Medical Informatics Association; 2020. pp. 973.

Download references


We would like to acknowledge Tobias O’Leary for OUD data set management and UAB Research Computing for use of their hardware for these experiments.


Funding for this work was provided by the Alabama Department of Mental Health OUD Center of Excellence, including support for SF, WC, AIA and JO. LW was funded by grant H79TI081609 from the Substance Abuse and Mental Health Services Administration SAMHSA and WB was funded by grant T32HS013852 from the Agency for Healthcare Research and Quality. Prior to OUD Center of Excellence funding, JO received funding from NIH grant P30AR072583 BIGDATA.

Author information

Authors and Affiliations



OUD project aspects and modifiers were conceived by LW, EE, and JO. The application of transfer and multi-task learning methods to these data sets was conceived by AIA and JO, who also wrote the original draft of the manuscript with AIA. Substantial edits were performed by SF, EE and WB. Project oversight was provided by JO and SF. Source code development was done by AIA and AA, overseen by JO. WC, CC, EC, ZD, and JH performed annotation work and revised annotation guidelines, with help from AIA, WB, and JO who were responsible for revisions. Error analysis was conducted by AIA, WC, and JH. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Abdullateef I. Almudaifer or John D. Osborne.

Ethics declarations

Ethics approval and consent to participate

Consent to participate was waived under IRB-121114001, “Using Text Mining to Extract Information from Text Documents in the Electronic Health Record”.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Almudaifer, A.I., Covington, W., Hairston, J. et al. Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection. J Biomed Semant 15, 11 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: