In this section, we describe our system in more detail. In our previous work, we combined a dictionary-based approach to give a set of synonyms for a given term, and then, to obtain their frequencies from a large collection of texts in order to propose the most frequent synonym as the simplest one. The novelty of this paper consists in using word embedding for finding the simplest synonym for a given term. This novel approach overcomes the limitations of the previous work because it does not depend on the existence of any dictionary of synonyms for a given domain and for a given language.
To evaluate our approach, we use the corpus EasyDPL (Easy Drug Package Leaflets), [13]. This corpus consists of 306 leaflets written in Spanish and manually annotated with 1400 adverse drug effects and their simplest synonyms.
As illustrated in Fig. 1, the overall architecture of our system comprises three separate components. Briefly, first, the leaflets are processed and their adverse drug effects are annotated using a dictionary-based approach. Second, for each identified effect, we obtain its vector from a pre-trained word embedding model. In a word embedding model, similar meanings usually have similar vectors. Therefore, we use this model to obtain the most similar vectors for a given term. In the following subsections we describe in detail each of the previous tasks.
Recognizing adverse drug effects
In our study, we focus on the simplification of adverse drug effects because evidence shows that patients often misinterpret or do not understand much of the information written in the section describing these effects. Therefore, the first task that we have to solve is the recognition of adverse drug effects in texts. To do this, we develop a NER (named entity recognition) module based on a dictionary-based approach that combines terminological resources such as the ATC system (a drug classification system developed by the World Health Organization), CIMA6 (a database that contains information on all drugs authorized in Spain, with a total of 16,418 brand drugs and 2,228 generic drugs) and several dictionaries gathered from websites about health and medicines such as MedlinePlus7, vademecum.es8 or prospectos.net9. Among the different resources used by the NER module, the MedDRA dictionary 10 stands out for its broad coverage of events associated with drugs. The main advantage of MedDRA is that its structured format allows easily obtaining a list of possible drug effects and their synonyms. MedDRA is composed of a five-level hierarchy. The most specific level, “Lowest Level Terms” (LLTs)”, contains a total of 72,072 terms that express how information is communicated in practice. Another important online resource for the NER module is MedlinePlus. It provides health information for patients, which contains more than 1000 articles about diseases and 6000 articles about medicines. The Spanish version is one of the most comprehensive and trusted Spanish language health websites at the moment. We developed a web crawler to browse and download pages related to drugs and diseases from its website. Each MedlinePlus article provides exhaustive information about a given medical concept, and also proposes a list of health-related topics, which can be considered as synonyms of this concept. Moreover, an article related to a given medical concept could be used to obtain the definition of this concept by getting its first sentence. The reader can find a detailed description of the NER module in [34].
Once we have already detected adverse drug effects in text, we can continue with the lexical simplification of these terms. We start describing our baseline approach based on dictionaries. Then, we describe our approach using word embedding.
Generating synonynm candidates
As mentioned above, our goal aims to simplify DPLs, in particular, replacing the terms describing adverse drug effects with synonyms that are easier to understand for patients. Once adverse drug effects are automatically identified in texts, the following step is to propose a set of synonyms for each one of them.
An important drawback of our previous work is that it required dictionaries that provide a set of synonym candidates for a given word. To remedy this, we employ Word2Vec [35], a predictive model for learning word embeddings from raw texts. In particular, this model represents each word in a corpus as a vector in a semantic space. Thus, it is possible to compute the similarity of two words by calculating the cosine of the angle between their corresponding word vectors.
To obtain our synonym candidates, we use Cardellino’s pre-trained model [36], which is available for research community and was built from several Spanish collection texts such as Spanish Wikipedia (2015), the OPUS corpora [37] or the Ancora corpus [38], among others. It contains nearly 1.5 billion words and the dimension of its word vectors is 300.
The simplest approach could be to select the synonym candidate with the highest semantic similarity to the original word, however, this approach may not work for polysemous words. As is well known, the context in which a word occurs plays a central role to identify the sense of this word [39, 40]. Because word embeddings are able to capture the semantic similarity between words based on their contexts, our hypothesis is that the best synonym candidate should also be semantically similar to the words that occur around of the original word. Therefore, we do not only consider the semantic similarity between the synonym candidate and the original word, but also between the synonym candidate and the context words of the original word. To calculate the semantic similarity between a synonym candidate and the context words of the original word to be simplified, we compute the average of the cosine distance between all context words and the synonym candidate, as was proposed in [31]:
$$ csim(s,w)= \frac{1}{|C(w)|}\sum\limits_{w^{\prime} \in C(w)}cos(v_{s},v_{w^{\prime}}) $$
(1)
where s is the synonym candidate, w is the original word to be simplified, C(w) is set of the context words of w (we use a window of size three around of w) and v
x
refers to the word vector of a word x.
In some cases, the word to simplify could be very simple, and therefore, it would not be necessary to replace it by any synonym. For example, adverse drug effects such as dolor de cabeza (headache), depresión (depression) or vómitos (gastric juices) are already very easy to understand and it is not necessary to replace them. Indeed, though the word embedding model is capable of proposing a set of synonyms for at least 72% of the adverse drug effects present in the EasyDPL corpus, these candidates are not always simpler than the original effect. Therefore, our system should be able to distinguish when a candidate is simpler than the original word. Based on the work performed by Devlin and Unthank [41], the complexity of a word seems to be directly related to its degree of informativeness. In other words, the more informative a word is, the more complex it tends to be. To measure the degree of informative of a word, we use the function defined in [41] and showed below:
$$ ci(w)= -log\left(\frac{freq(w)+1}{{\sum\nolimits}_{w^{\prime} \in C}freq(w^{\prime})+1}\right) $$
(2)
where freq(w) is the frequency of the word w in a collection of texts C. Thus, using this function, the system replaces an original word by one of its candidates only if the candidate is less informative than the original word. To obtain the frequencies of the words, we use the Spanish version of the Google Book Ngram corpus [30]. In the EasyDPL corpus, over 51% of the adverse drug effects have at least a candidate less informative than them.
In a preliminary evaluation of our system, we noted that many errors were due to over 48% of the gold standards synonyms proposed in the EasyDPL are compound names.
For example, the gold synonym for the effect anorexia (anorexy) is the noun phrase trastornos de la alimentación (eating disorders). Another example is acatisia (akathisia), whose gold standard synonym proposed in the EasyDPL corpus is incapacidad de quedarse quieto (inability to stand still). Our approach cannot propose multi-word candidates because it is based on a word embedding model that only calculates the semantic similarity between vectors of tokens. To overcome this problem, we propose a simple approach to obtain phrase embeddings. This approach consists of applying a set of patterns based on POS tags to detect noun phrases describing adverse drug effects. Some of these patterns are shown bellow:
-
NN ADJ. This pattern lets to recognize adverse drug effects such as sueño anormal (abnormal dream).
-
NN PREP NN. It lets to identify adverse drug effects such as enfermedad del estómago (stomach disease).
-
NN PREP VB. It detects adverse drug effects such as problemas para tragar (difficulty swallowing).
-
NN ADJ PREP NN. This identifies adverse drug effects such as azucar alta en sangre (high blood glucose).
These patterns could recognize a huge number of noun phrases that are not actually adverse drug effects. To reduce this noise, we only consider some noun phrases that contain at least a word belonging to the MedDRA dictionary. To obtain our set of phrase candidates for adverse drug effects, we process our collection of downloaded MedLinePlus articles. POS tagging was performed using the Python NLTK4 POS-tagger11 adapted to Spanish language. We gather a total of 3000 phrase candidates that could describe adverse drug effects. Then, for each of these phrases, we obtain a phrase embedding by averaging the word embedding vectors of their content words (nouns, lexical verbs, adjectives and adverbs). Therefore, when we obtain the synonym candidate for a given adverse drug effect, we do not only consider the most similar word embeddings from single words, but also calculate the semantic similarity between the original effect and all phrases collected from MedLinePlus.