 Research
 Open Access
 Published:
Largescale online semantic indexing of biomedical articles via an ensemble of multilabel classification models
Journal of Biomedical Semanticsvolume 8, Article number: 43 (2017)
Abstract
Background
In this paper we present the approach that we employed to deal with large scale multilabel semantic indexing of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge (2013–2017), a challenge concerned with biomedical semantic indexing and question answering.
Methods
Our main contribution is a MUltiLabel Ensemble method (MULE) that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ’s superset, the PubMed articles collection) and the proper parametrization of the algorithms used to deal with this challenging classification task.
Results
The ensemble method that we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. In our participation in the BioASQ challenge we obtained the first place in 2013 and the second place in the four following years, steadily outperforming MTI, the indexing system of the National Library of Medicine (NLM).
Conclusions
The results of our experimental comparisons, suggest that employing a statistical significance test to validate the ensemble method’s choices, is the optimal approach for ensembling multilabel classifiers, especially in contexts with many rare labels.
Background
Introduction
MEDLINE is the premier bibliographic database of the National Library of Medicine (NLM) of the United States. In June 2017 MEDLINE contained over 27 million references to articles in life sciences with a focus on biomedicine. Each of these articles is manually indexed by human experts with concepts of the MeSH (Medical Subject Headings) ontology (also curated by NLM), such as Neoplasms, Female and Newborn. This manual indexing process entails significant costs in time and money. Human annotators need on average 90 days to complete 75% of the citation assignment for new articles [1]. For a publication with novel and important scientific results, the first period of its lifetime is quite important, yet it is in this period that the publication remains semantically invisible. For instance, if a researcher is searching for a particular MeSH term (e.g. Myopathy), he/she will not be able to retrieve the latest nonindexed articles that are related to this term, if they do not contain it literally. Moreover, the average indexing cost for an article is $9.40^{1}.
MEDLINE’s demand in manual indexing is steadily increasing as evident from Fig. 1, which plots the number of articles being added to MEDLINE each year from 1950 to 2017. At the same time, the available indexing budget at NLM is flat or declining. This highlights the importance of tools for automatic semantic indexing of biomedical articles. Such tools can help increase the productivity of human indexers by recommending them a ranked list of MeSH descriptors relevant to the article they are currently examining. In addition, such tools could replace junior indexers (not senior revisers) for journals where these tools achieve a high level of accuracy. Both usages of such tools are currently adopted by NLM.
From a machine learning perspective, constructing an automatic semantic indexing tool for MEDLINE poses a number of important challenges. First of all, there is a large number of training documents and associated concepts. In 2017, MeSH contained 28,489 descriptors, while PubMed contained over 27 million annotated abstracts. Efficient yet accurate learning and inference with such large ontologies and training sets is nontrivial. Additionally, MEDLINE is growing at a nontrivial rate of more than one million articles per year, i.e. more than 100 articles per hour. This calls for learning algorithms that can work in an online fashion both in the sense of handling additional training data as well as in the sense of being efficient enough during prediction in order to cope with the fast rate that new articles arrive. Furthermore, MEDLINE contains abstracts from about 5000 journals covering very different topics. This increases the complexity of the target function to be learned, as concepts may be associated with different patterns of word distributions in different biomedical areas.
MeSH concepts are hierarchically structured as a directed acyclic graph indicating subsumption relations among parent and child concepts. This structure is quite complex, as it comprises 16 main hierarchies with depths up to 12 levels and many children nodes belong to more than one ancestors and to more than one of the main hierarchies. While some progress has been recently achieved on exploiting such relationships, it is not entirely clear when and how these relationships help accuracy. As MeSH evolves yearly on par with the medical knowledge it describes, automatic indexing models must deal with such changes, both explicit (i.e. addition, deletion, merging of concepts) and implicit (i.e. altered semantics of concepts) ones. Also, each scientific document is typically annotated with several MeSH concepts. Such data are known as multilabel [2] and present the additional challenge of exploiting label dependencies to improve accuracy. Figure 2 shows the distribution of the number of labels per document which is Gaussian with a mean of about 13 labels per document and a heavy tail on the right.
The distribution of positive and negative examples for most of the MeSH concepts is very imbalanced [3]. Figure 3 plots the frequencies of labels (xaxis) versus the number of labels having such frequency (yaxis) for a subset of 4.3 million references of MEDLINE. By employing the KolgomorovSmirnov test as proposed in [4] it can be seen that the data fits to the power law distribution with a significance level of 0.02. Less than half of the labels appearing in this subset (10,352 out of 26,509) have more than 500 positive examples and only 811 labels have more than 10,000 examples. This extreme imbalance and more precisely the fact that most MeSH labels have very few positive instances, greatly hinders learning an effective model for their automatic prediction.
The European project BioASQ [5] has organized five challenges on largescale online biomedical semantic indexing, from 2013 to 2017, focusing on MEDLINE’s indexing problem. The official results of the five BioASQ challenges (2013–2017) are publicly available^{2}. The challenge runs for fifteen weeks every year. Every week the participants are given a set of new, unannotated articles (∼ 5000) and should provide annotation for them within 21 hours. Evaluation is performed on a variety of flat and hierarchical performance measures. Our team achieved the first place in 2013 and the second place in years 2014–2017, surpassing in all cases the accuracy of the current production system of NLM, MTI.
In this paper, we present our efforts to deal with this task, from a purely machine learning perspective. The main contribution is a new multilabel ensemble method that incorporates a statistical test to combine the constituent models. As a secondary contribution, we describe the adaptation and application of already existing supervised learning algorithms to such a demanding task and a short study on the concept drift within the corpus. The rest of the paper is organized as follows: In “Multilabel ensemble methods” section we present some prior work in the field of multilabel ensemble methods and explain the differences to our method. “Methods” section describes our proposed multilabel ensemble method, MULE, and “Results” section contains the experiments, the results and a study on the concept drift that exists within the corpus. Finally, in “Discussion” section we discuss the implications of our findings and in “Conclusion” section we present the conclusions from our work with some possible future directions.
Multilabel ensemble methods
The area of multilabel learning [2] is closely related to that of ensemble methods [6], as the most basic multilabel learning method called binary relevance (BR) involves learning an ensemble of binary models, one for each label. Pairwise techniques for multilabel learning, such as [7], also involve learning an ensemble of binary models. Here, however we focus on multilabel ensemble methods, in the sense of methods that combine multiple predictions for all labels, i.e. multiple rankings of labels, multiple bipartitions of the set of labels into positive and negative ones for an instance, or even multiple joint distributions for all labels.
Ensemble methods can offer improvements compared to a single model in the following cases where a single model fails to deliver a good approximation of a true hypothesis [6]: a) insufficient training data (statistical reason), b) nonconvex search space with multiple locally optimal solutions (computational reason), c) the searched hypothesis space does not include the true hypothesis (representation reason). The problem that we are dealing with in this paper, fits in the first two of the above cases. First, the distribution of label frequencies follows the power law, i.e., the vast majority of the labels have very few positive instances. This means that for most labels there are no sufficient positive training instances, greatly impeding effective training of a learning model. Second, we deal with a largescale multilabel task (∼ 10^{4} labels), which, taking into account the interactions and dependencies among labels, leads to a complex search space with multiple locally optimal solutions.
An ensemble is called homogeneous if all its component models emerge from the same theory, e.g., an ensemble of Support Vector Machines (SVM) classifiers. When an ensemble consists of different types of models (e.g an ensemble with SVM and Naive Bayes models), then it is called heterogeneous. Multilabel ensembles can be considered homogeneous if they combine models derived from the same multilabel learning algorithm and the same underlying singlelearning algorithm in the case of problem transformation methods, e.g. an ensemble of BR models, all trained using SVMs.
There are three main approaches to combine the decisions of an ensemble’s models: (i) selection, where a single model is used, (ii) fusion, where all models are used, and (iii) ensemble pruning, where a subset of the models is used. In multilabel ensembles, decision combination could further be characterized as global if the same combination is used for all labels (e.g. the same subset of multilabel models is selected for all labels), or local if a different combination can be used for each label (e.g. models are fused with different weighting for each label).
In existing ensemble approaches in the literature, the authors of [8] proposed an Ensemble of Classifier Chains (ECC) in which multiple Classifier Chains (CC) are trained to model the label correlations and then they are combined through a simple global voting scheme. The Ensemble of Pruned Sets (EPS) [9] represents another approach with similar philosophy, the PS constituent models being combined again through a simple global voting scheme. In [10] three hierarchical ensemble methods are introduced in order to deal with the gene function prediction problem; Hierarchical TopDown (HTD), Hierarchical Bayesian (HBAYES) and Hierarchical True Path Rule (HTPR) along with their costsensitive versions. Here different predictions are due to heterogeneous data representation and voting is again used for combining the models.
An ensemble of Bayesian Networks is proposed in [11], combining multiple joint distributions for all labels by means of their geometric average. Tahir et al. [12] present a fusion method where the probabilistic outputs of heterogeneous classifiers are averaged and the labels above a threshold are chosen. Finally, Yepes et al. [13] propose a classifier selection scheme based on the Fmeasure. For each label and for each of the classifiers the Fmeasure is computed and the best performing one is chosen to predict that particular label. Table 1 presents the aforementioned methods and classifies them. The mUltilabel ensemble method presented in this paper, described later on in “Statistical significance MUltiLabel Ensemble (MULE)” section, is also included in order to make more evident the differences among the approaches.
The method proposed in this paper is closely related to [12, 13] in the sense that in all three cases, based on a validation data set, the ensemble combines directly the various models’ prediction outputs without entailing any training or interfering with the constituent models’ structure and background. This is particularly useful when wishing to combine models emerging from very different theories. Moreover, it ensures scalability of the method particularly when dealing with such large data. For instance, we tried to employ stacked BR [14] as well to our problem but without any success due to scalability issues. Regarding these three methods ([12, 13] and ours), it should be noted that [12] proposes an ensemble limited to combine only models with probabilistic outputs (i.e. outputs ranging between 0 and 1) and thus it is not appropriate for our case where models with diverse outputs need to be combined. Therefore, in the experiments presented in “Experiments” section our method is compared only to the one by [13] along with a more simplistic version of MULE.
Methods
Preprocessing of the data
The BioASQ corpus is a subset of a large collection of biomedical papers (∼ 12 million abstracts) curated by the NLM, through the PubMed framework. For each document, only the abstract is provided along with some other information (title, journal, year and the MeSH terms).
From the entire corpus we have considered only abstracts belonging to the journals covered by the test set, resulting to a subset of 4.3 million abstracts. Since every new article will belong to a journal and since this metainformation will always be available, it is valid to use this information for training. Moreover, this approach was motivated from the fact that different journals are expected to follow different data and label distributions and we desire the model to draw its respective training data from a data distribution as similar as possible to the data that is going to be predicted. Therefore, selecting only those abstracts that belong to the test data journal list for training, is expected to improve performance. One alternative way to think about that relates to the process of manual indexing: a human annotator will be assigned papers coming from a specific scientific field and therefore list of journals, rather than annotating papers from any field, randomly, since he will be specialized on a specific field.
The following steps in the preprocessing of the data included removal of duplicate instances, concatenation of abstract and title for every instance, removal of stopwords and selection of word tokens and pairs of word tokens (bigrams) as features. Word tokens and bigrams with less than five occurrences were omitted as well as those with a frequency higher than half the size of the corpus. In order to vectorize the data sets the tfidf representation was used for the features. We furthermore performed zoning of some features, i.e., increasing the tfidf value of features that are expected to have more influence than others in the classification task. More specifically, we multiplied the tfidf values from ngrams that belonged to the title with log2 and from those being equal to some MeSH label by log1.25.
Statistical significance MUltiLabel Ensemble (MULE)
MULE is a multilabel ensemble approach concerned with the problem of selecting the most appropriate model among its members for each different label. It assumes the existence of a heterogeneous ensemble and that different labels can be approximated better by different types of models. The standard way to approach this problem is the employment of a validation set, based on which the accuracy of each model of the ensemble is evaluated for each label.
One issue with this approach is how to compare the models based on a single label, when the goal is to optimize a global evaluation measure related to all labels whose estimation cannot be decomposed per label, such as the microaveraged fmeasure, or the example based fmeasure. Comparing the models based on a local evaluation measure such as the fmeasure for the particular label [13] is not guaranteed to optimize such measures. A more appropriate solution involves cyclically examining the labels, selecting the best model for each label according to the global evaluation measure until the selected models per each label do not change in two consecutive cycles. Such an approach has been followed in the past for tuning a different threshold perlabel with the goal of optimizing a global evaluation measure [15].
We argue that model selection approaches based on a validation set are brittle for multilabel data streams with a large number of rare labels, like the application we are focusing on in this paper, which involves a stream of scientific articles where the label frequency distribution follows the power law. For rare labels, selection of models is untrustworthy as it is eventually made based on very little data, despite a potentially large validation set. Realworld streaming data are also often characterized by concept drift, and therefore the larger the validation set, the higher the chance that model selection will not be valid for future incoming data.
The observations above motivated the development of our approach, whose main idea is to start by trusting the globally optimal model across all labels as the best model for each label and then select a different model for a label only if it is significantly better than the global one in this label based on an appropriate statistical test. Trusting the globally optimal model is justifiable, as its evaluation is based on much more data, compared to the data for a single label. When using a statistical test to compare each other model against the globally optimal one, the choices are expected to be less optimistic and more conservative, leading to an ensemble that will be more robust to differences between the validation and test set and to the lack of enough positive samples for rare labels in the validation set.
Formally, suppose that the multilabel task to be dealt with has L labels, l being a label, and D documents, d being a document. Also, D _{ TRAIN } will be the training set, while D _{ VAL } and D _{ TEST } the validation and test set respectively. We denote as M the learning models that are used and, without any loss of generality, assume that M _{1} is the best performing model globally, in terms of a multilabel evaluation measure. The goal is to be able to tell which of the models used is more suitable for each of those L labels, in terms of the same measure on some validation data set.
The general scheme for MULE then is to

1.
Predict labels with all M _{ i } on a validation data set

2.
Determine for each label which models predict it more accurately compared to M _{1}, i.e., which model brings an improvement with respect to a global evaluation measure.

3.
Compare the differences in performance of each one of these models against M _{1} using a McNemar test with significance level α and select the one for which the null hypothesis is rejected. If the null hypothesis is rejected for more than one models, choose the one for which the null hypothesis has the lowest probability.

4.
Predict accordingly on the test set for each label
In Algorithm 1 we present the pseudocode for MULE assuming the evaluation measure is the micro Fmeasure. Naturally, any other metric can be used instead. For instance, in the experiments we also use a variant of MULE, which optimizes the macro Fmeasure. In Appendix: McNemar’s statistical test we provide the implementation details for the McNemar significance test with respect to the ensemble method. The only parameter needed for MULE is the significance level α of the respective McNemar’s test. It should be noted that when performing multiple statistical comparisons (that is for more than two models) the familywise error rate (FWER) should be controlled in order for the statistical comparisons to be valid. In our case, as the tests performed were parametrical, the BonferroniHolmes step method was used. A detailed explanation of that method is given in [16].
Initially, we also tried a similar strategy comparing classifiers in terms of their precision and recall by applying a proportion significance test (an idea based on [17]). Given two models A and B and assuming that A is better than B globally, we predict each label l with A unless if:

1.
precision _{ Bl }>precision _{ Al } and recall _{ Bl }>=recall _{ Al } or

2.
recall _{ Bl }>recall _{ Al } and precision _{ Bl }>=precision _{ Al },
where > means significantly better and >= means not significantly worse. If any of the above holds we predict l with B. We experimented with various confidence intervals (0.99, 0.975, 0.95, 0,90) but this approach proved to be too conservative in all cases, by allowing very few labels to be chosen from the second system, leading in some cases to negative results.
Results
In this section we present the results obtained from our experiments. We first present the evaluation metrics used to assess performance and then we describe the data sets used in the experiments. Next, we provide the constituents models used for the ensemble methods along with their relative performance and then we present the results for the ensemble methods with the relevant discussion. In the last subsection we present a small study on the temporal aspects of the BioASQ data.
Evaluation measures
Through our experiments, we chose to use as a means of evaluation of performance two labelbased measures that are widely used in multilabel contexts; the microF and the macroF measure [2]. Our choice over other possible options (e.g. precision, recall, accuracy) is dictated by the fact that the Fmeasure, as well as its micro and macro variants, provide a satisfying balance between precision and recall. Moreover, the macroF measure tends to favor rare labels whereas the microF tends to smooth out their effect on total performance, hence being more influenced by frequent labels. For simplicity, we provide the F, microF and macroF definitions directly in terms of the true positives, false positives and false negative errors. First, let’s denote as tp the number of true positives of a model (i.e. the number of times an instance has a label and the model successfully assigned it), fp the number of false positive errors of the model (i.e. the number of times an instance does not have a label but the model assigned it erroneously) and fn the number of false negative errors (i.e. the number of times an instance has a label but the model did not succeed in assigning it). Equation 1 provides the F1 score used for a singlelabel classification problem:
In a multilabel context such the one we deal with and given that there are L labels, the microF measure is defined as
and the macroF respectively
Data sets
We conducted experiments on two different subsets of the BioASQ corpus; data set A consists of a training set of 850,000, a validation set of 100,000 and a testing set of 50,000 documents and data set B consists of a training set of 20,000, a validation set of 20,000 and a test set of 10,000 documents. Table 2 shows the periods covered by the two data sets. The motivation behind using two different data sets in size was mainly to study how the ensemble algorithms that are tested would behave under a small training/validation set and a large one.
Component models
In this section we present the algorithms that were used as components for the ensemble method, during the BioASQ challenge as well as in other experiments. Naturally, any other supervised learning model could have been used instead.
BR SVM We used the BR or onevsall approach, according which a multilabel task with L labels is split in L different binary classification problems, one for each label. A model is then trained for each one of the labels independently from the others. Although this strategy does not take into account the relations that exist among labels (e.g. hierarchies) it is particularly convenient for largescale setups as it allows full parallelization of the training and prediction procedure. The Liblinear package [18] was used with C and e parameters at default values (1, 0.01 respectively) and a bias value of 1. The selected solver type was L2regularized L2loss support vector classification (L2RL2LossSVCDual). Two variations were used, one with default parameters (Vanilla) and a tuned version (Tuned). For the latter, we adjusted C = 0.33 and changed the w1 parameter to handle class imbalance by penalizing more heavily false negative errors than false positive ones [19]. More specifically, for all labels with less than 100 positive instances in the data set the weight for the negative class is set as 1 (default value) and the weight for the positive class as
The above parameter values were chosen based on smaller scale experiments.
MetaLabeler The MetaLabeler [20] is a twolevel model. It comprises a firstlevel multilabel learning model capable of producing a ranking of the labels per instance and a secondlevel model capable of predicting the number of labels per instance. We here instantiate this model as follows. We first train the Vanilla SVM models as described in the previous subsection and then predict on the new data by assigning a score to each instancelabel pair, based on the distance of the instance from the hyperplane of the label’s SVM. This way, a ranking of the labels is obtained for each instance, from the most relevant one (with the highest score) to the least relevant one (lowest score). The secondorder model then serves to determine automatically the number of labels per instance by employing linear Support Vector Regression (SVR). Other thresholding techniques exist in the literature, but they either require a crossvalidation step which requires a long time for large data ([15]) or did not perform as well in preliminary experiments ([21]). For both levels of this algorithm the same parameters and the same feature space as for the binary models were used.
Labeled LDA
Labeled Latent Dirichlet Allocation (LLDA) [22, 23] is a supervised learning extension of the LDA algorithm, where each topic is equal to a label of the corpus in a onetoone correspondence.We have used Prior LLDA [23] that incorporates prior knowledge on the labels’ distributions (i.e., their frequencies) within the training corpus. Also, we employed the CGS _{ p } improved ϕ and θ estimators presented in [24]. We set β=0.01 both at training and inference and \(\alpha = \frac {50}{L}\) during training and \(\alpha = 50 \times \frac {f_{l}}{\sum {f_{l}}}+\frac {30}{L}\) during prediction, L being the number of labels and f _{ l } the frequency of label l in the training corpus. We used one Markov chain, a burnin of 50 iterations, a sampling lag of 5 iterations and a total of 20 samples to compute the ϕ parameters during training and the θ parameters during prediction.
Experiments
Performance of the component models
In Table 3 the performance of the constituent models on data sets A and B are shown in terms of the microF and macroF measures. The discriminative SVMbased models clearly outperform the probabilistic model (LLDA). More precisely, the MetaLabeler outperforms all other models in both data sets, exhibiting a notable difference in both metrics compared to them. The prevalence of this method over the other SVM variants, particularly if we take into account the challenging properties of the BioASQ indexing task, suggests that ranking the scores of the different labels for every instance followed by some thresholding strategy is clearly more successful than the traditional classification of instances for every label (i.e. assigning 0 or 1 to an instance for every label). It should be also noted that the MetaLabeler does not have any particular tuning to cope with the class imbalance (the base models are Vanilla SVMs), opposite to the Tuned SVMs, but still outperforms them.
The second place is steadily occupied by the Tuned SVMs, which outperform their Vanilla counterparts in all cases, a finding more or less expected given the imbalanced nature of the data (we remind that the only difference between the two algorithms is that the tuned SVMs are configured to handle class imbalance, that is rare labels, by penalizing more heavily false negative errors). The Labeled LDA model is worse in all cases except for the macroF measure in data set B, in which case it outperforms the Vanilla SVM algorithm. We should note though that we did no particular parameter tuning which seems crucial for this model. For instance, averaging over more Markov Chains for the model in data set B results in a clearly higher performance than in A which is contradictory to the fact that data set A has a much bigger training data set.
Comparison of the ensemble methods
As it has been stated before, the goal of an ensemble method is to achieve higher performance than its components, w.r.t. some evaluation metric. In this context, MULE in its original form seeks to optimize the microF measure so in the first round of experiments, MULE is compared to the method presented in [13] and to a simple version of microF optimization ensemble, that does not involve a statistical test (essentially this method is equivalent to MULE but omits the McNemar test). Secondly, a variant of MULE that optimizes the macroF measure is compared to the method presented in [13]. In this case, as optimizing the F measure and the macroF measure is equivalent, there is no third model in the comparison. In all cases, we used a significance level of 0.1 for the McNemar’s test.
During the comparisons of the ensembles, different combinations of the components were used. The motivation was to be able to capture different relations among the models and test how the ensembles would behave in this case. For instance, the MetaLabeler is significantly better than all other models, so in this case there is an asymmetry between the components. On the other hand, the two SVM variants show rather equivalent performance. Moreover, including the LLDA model offers the possibility to test if the model can contribute in improving performance even if it is not as successful as the other components, as it comes from a different theoretical background.
Table 4 shows the microF measure for the algorithms on both data sets and for five different combinations of the constituent models. A △ symbol near a value indicates that the highest value (in bold) is significantly better than it at a significance level of 0.95. MULE outperforms the two other ensemble methods in all component combinations and for both data sets except for one case. The difference is statistically significant compared to the Foptimization method in all cases, but in none concerning the microF optimization approach. Compared to the component models, MULE is able to improve the microF measure in all combinations with respect to the best performing model, the difference being statistically significant for the two last combinations (SVM _{ Tuned } + SVM _{ Vanilla } + LLDA and MetaLabeler+ SVM _{ Tuned }+ SVM _{ Vanilla }+ LLDA) in data set A. In data set B, the differences are significant in all combinations except for one for MULE and the improvement is between 0.9−7.8%. From the two other ensemble methods, “improveF” is better only in two cases (in data set B) compared to the component models while “improve microF” does so in three cases for data set A and in all cases for data set B. Odd though it may seem, “improve microF” and “improve F” do not seem able to benefit from the fact that the validation set is relatively large in A (100,000 instances) by demonstrating mostly negative results.
Similarly, Table 5 shows the results for the macroF measure on the same five combinations. In this case the MULE _{Macro} variant is used, which optimizes the macroF measure in an identical approach to the classic MULE method. Our method outperforms the other ensemble method in all cases except for two, with the differences being statistically significant in all cases. With respect to the best performing constituent model in each combination, MULE is able to improve the macroF measure in three combinations (SVM _{ Tuned } + SVM _{ Vanilla } + LLDA, SVM _{ Vanilla }+ LLDA and MetaLabeler + LLDA) for data set A and two combinations in data set B (SVM _{ Tuned } + SVM _{ Vanilla } + LLDA, SVM _{ Vanilla }+ LLDA), the differences being statistically significant in none of the cases. In the other cases, the ensemble is performing worse than the best performing model (MetaLabeler). This behavior could be due to the fact that, throughout the experiments we set the α value for the McNemar’s test to 0.1, which is rather liberal for a statistical test. It becomes clear that there is a tradeoff between the improvement that can be achieved by combining multiple models and the confidence level that we can have on this improvement. In other words, choosing a small α value for the statistical test is expected to lead to more reliable results but a smaller improvement over the baseline, while in the opposite case, we risk to obtain a large improvement on the validation set, that will nevertheless not be reliable (i.e. it will not be necessarily reproducible on a random test set). By completely omitting the McNemar’s test, we obtain the extreme case for the aforementioned tradeoff (this is equivalent to having an α value of 1.0).
The “improveF” method performs worse by improving over the component baseline only in two cases, none of which is statistically significant. This is a rather interesting observation as this measure is specifically designed in order to improve the Fmeasure locally in every label which, as pointed out before, is equivalent to optimize the macroF measure. These results give strong evidence for the necessity of a statistical validation of the choices an ensemble method does.
In order to study more deeply the behavior of the tested ensemble methods, Table 6 shows the number of labels that each ensemble assigns to every component model in the first series of experiments and Table 7 depicts the average frequency of labels selected by each algorithm. It is clear that the first two ensemble methods, “improve microF” and “improveF”, assign a lot more labels than MULE to those models that perform worse in overall. MULE on the other hand, secures its choices on the statistical test and therefore is a lot more conservative. For instance, in the last combination of models MULE assigns only five labels to the LLDA algorithm, around two orders of magnitude less than the other two models.
In the experiments above, it could be argued that the multilabel ensemble method we propose is not improving spectacularly the component models performance. This is generally true, especially for the MULE _{Macro} variant. Nevertheless, there is some evidence that this behavior may be connected to the component models themselves and the differences in performances they have or the theoretical background they come from. For instance, the greatest improvement in terms of the microF metric (∼ 8%) is obtained for data set B when combining the two worst performing models, LLDA and Vanilla SVM, which are rather equivalent in terms of their performance and emerging from different theories. Either way, our goal in this series of experiments is to show that an ensemble method can clearly benefit from the use of a statistical test that validates it, regardless of the size of the validation data set or the nature of its component models. The fact that the two other methods, that lack this statistical validation fail largely to improve over the components, exhibiting a rather unreliable behavior (e.g. assigning many labels to worse performing models) overall, suggests strongly that a significance test is actually needed in this case. Finally, the results for the “improveF” method indicate that optimizing locally the Fmeasure does not necessarily lead to an improvement over the total performance of the ensemble.
Temporal aspects of the data
When performing a supervised learning task, the goal is usually to train a model that will fit an underlying (i.e. hidden) distribution of the data. Crucially, during this process if the new, unseen data to be predicted follows a significantly different distribution than the data used for training, the model’s performance will of course be greatly compromised. In our case, a number of factors could lead to an important change in the data distribution along time; first, the data set expands over a great period (1946–2017) and thus variations are expected in what concepts actually “mean” or, in other words, what word tokens the concepts are related to (e.g. a disease in 1990 and 2017 can be linked to totally different factors). This affects the label  word distributions and consequently the model’s performance. Another aspect are trends in science publications. Scientific papers show nonnegligible trends for a particular scientific field or another. In [25] these interesting changes in trends are studied in the biological field; e.g. in 1991, 14% of the scientific papers indexed in Web of Science concerned Biochemistry while twenty years later (in 2010) this percentage has dropped to only 4%. Finally, it should be noted that NLM makes changes once every year to the MeSH ontology (i.e. the label set), a valid choice as science evolves, and the journals encapsulated by NLM change as well every year. In the 2008 MEDLINE data changes announcement for example, the NLM reports an addition of 456 new MeSH terms in the existing vocabulary^{3}.
Bearing in mind the aforementioned factors we performed a short study on the concept drift, designing two experiments. For all results presented below we used the MetaLabeler as a learning model.
First, we trained classifiers with increasing training set sizes and keeping the same test set. Table 8 shows the years covered by the aforementioned training sets and Fig. 4 shows the micro and macro Fmeasure evolution as training sets get larger going back in time. It is easily noticeable that there is no significant gain in performance for more than 1,000,000 documents and results are even getting worse for training sets containing documents before 2004. The macro Fmeasure seems to have a small gain going back in time (papers from 2001), probably because it favors rare labels more than the micro Fmeasure and with the increase of the data set size more positive examples will be observed for them. Nevertheless, for papers before 2001 a decrease is apparent in this measure as well.
The second experiment consisted of training a classifier on 500,000 documents and then splitting the following 1,000,000 documents in 20 equal consecutive data sets to study how performance is affected as time goes by. Figure 5 shows the results. We can notice a significant drop in performance for both measures as test sets move away from the training set.
The above results, validate the presence of a nonnegligible concept drift within the corpus, even for relatively small periods. Some direct conclusions are that a) it could be crucial for a learning model’s performance to choose a training data set as close (chronologically) as possible to the unseen data and b) in contrast to what usually is the case in a machine learning scenario, opting for a larger training data set can lead even to inferior results in case of the BioASQ corpus (or the PubMed corpus more generally).
Discussion
Semantic indexing of biomedical articles represents an important and challenging task, as explained in detail in “Background” section. We have taken a novel approach by combining a number of multilabel learning algorithms in an ensemble method that validates its choices through a statistical significance test. The results that we presented, as well as the official results from five BioASQ challenges (2013–2017) in which we achieved the first place in 2013 and the second place in the following years, show that our approach can leverage the virtues of the baseline algorithms and improve over them, as well as over similar methods. This work could be extended to include additional algorithms and other variants of the multilabel ensemble, optimizing different measures or employing an alternative statistical test. Additionally, the concept drift within BioASQ and PubMed should be investigated in more depth and methods should be developed to deal with it efficiently.
Conclusion
In this work, the different strategies that we used in order to tackle the largescale multilabel classification problem of the BioASQ challenge were presented. Several already existing supervised learning algorithms were used, both from a discriminative and probabilistic origin. The main contribution is a multilabel ensemble that validates its choices through a statistical test. The ensemble method, MULE has been compared to two other variants in two different experimental scenarios and for different component model combinations. The results show a strong advantage of MULE over two other similar methods. Concerning the other contribution of the paper, the short study on the temporal aspects of the data, the results show a significant change of the concepts over time that should be taken into account by researchers trying to perform semantic indexing of biomedical literature. Some possible future directions of this work could include the use of additional algorithms and other variants of the multilabel ensemble. Finally, we would like to further study the concept drift within BioASQ and PubMed and experiment on methods to deal with it efficiently.
Appendix: McNemar’s statistical test
The McNemar statistical test provides a way to test differences on paired data. It is essentially a paired version of a Chisquare test. Considering the comparison of two classifiers A and B, we denote:

n _{00} the number of examples correclty classified by both A and B

n _{01} the number of examples correclty classified by A but not by B

n _{10} the number of examples correclty classified by B but not by A

n _{11} the number of examples misclassified by both A and B
The McNemar’s test is then defined as
If n _{01}+n _{10}<20 the statistic is not approximated well by the chisquared distribution. In this case the binomial distribution is used to perform an exact test. Fagerland et al. [26] have demonstrated in a series of experiments that the midP McNemar test is performing better than its exactP counterpart, therefore we chose midP for the case of n _{01}+n _{10}<20:
midP \(= 2 \sum \limits _{i=0}^{n_{01}}\binom {n_{01} + n_{10}}{i}0.5^{i}\times 0.5^{ni}0.5\binom {n_{01} + n_{10}}{n_{01}}0.5^{n_{01}}\times 0.5^{n_{10}}\)
Abbreviations
 BR:

Binary relevance
 FWER:

Familywise error rate
 LLDA:

Labeled latent Dirichlet allocation
 LDA:

Latent Dirichlet allocation
 MULE:

Multilabel ensemble
 MeSH:

Medical subject heading
 NLM:

National library of medicine
 SVM:

Support vector machines
References
 1
Huang M, Névéol A, Lu Z. Recommending mesh terms for annotating biomedical articles. J Am Med Inform Assoc. 2011; 18(5):660–7.
 2
Tsoumakas G, Katakis I, Vlahavas I. Mining multilabel data In: Maimon O, Rokach L, editors. Data Mining and Knowledge Discovery Handbook, 2nd edn., Chap. 34. Springer: 2010. p. 667–85.
 3
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009; 21:1263–84. doi:10.1109/TKDE.2008.239.
 4
Clauset A, Shalizi CR, Newman MEJ. Powerlaw distributions in empirical data. SIAM Rev. 2009; 51(4):661–703. doi:10.1137/070710111.
 5
Tsatsaronis G, Schroeder M, Dresden TU, Paliouras G, Almirantis Y, Gaussier E, Gallinari P, Artieres T, Alvers MR, Zschunke M, Gmbh T, Ngomo ACN. BioASQ: A Challenge on LargeScale Biomedical Semantic Indexing and Question Answering, AAAI fall symposium: Information retrieval and knowledge discovery in biomedical text. 2012.
 6
Dietterich TG. Ensemble Methods in Machine Learning. In: Proceedings of the 1st International Workshop in Multiple Classifier Systems. Springer: 2000. p. 1–15.
 7
Fürnkranz J, Hüllermeier E, Mencia EL, Brinker K. Multilabel classification via calibrated label ranking. Mach Learn. 2008; 73(2):133–53.
 8
Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multilabel classification. In: Proc. 20th European Conference on Machine Learning (ECML 2009).2009. p. 254–69.
 9
Read J, Pfahringer B, Holmes G. Multilabel classification using ensembles of pruned sets. In: Proc. 8th IEEE International Conference on Data Mining (ICDM’08). IEEE: 2008. p. 995–1000.
 10
CesaBianchi N, Re M, Valentini G. Synergy of multilabel hierarchical ensembles, data fusion, and costsensitive methods for gene functional inference. Mach Learn. 2012; 88(12):209–41. doi:10.1007/s1099401152716.
 11
Alessandro A, Corani G, Mauá D, Gabaglio S. An ensemble of bayesian networks for multilabel classification. In: Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence, IJCAI’13. AAAI Press: 2013. p. 1220–5. http://dl.acm.org/citation.cfm?id=2540128.2540304.
 12
Tahir MA, Kittler J, Bouridane A. Multilabel classification using heterogeneous ensemble of multilabel classifiers. Pattern Recogn Lett. 2012; 33(5):513–23. doi:10.1016/j.patrec.2011.10.019.
 13
JimenoYepes A, Mork JG, DemnerFushman D, Aronson AR. A onesizefitsall indexing method does not exist: Automatic selection based on metalearning. JCSE. 2012; 6(2):151–60.
 14
Godbole S, Sarawagi S. Discriminative methods for multilabeled classification. In: Proceedings of the 8th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD 2004).2004. p. 22–30.
 15
Fan RE, Lin CJ. A study on threshold selection for multilabel classification. Technical report, National Taiwan University. 2007.
 16
Demsar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7:1–30.
 17
Joshi MV. On evaluating performance of classifiers for rare classes. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM ’02. Washington, DC: IEEE Computer Society: 2002. p. 641. http://dl.acm.org/citation.cfm?id=844380.844791.
 18
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. Liblinear: A library for large linear classification. J Mach Learn Res. 2008; 9:1871–4.
 19
Lewis DD, Yang Y, Rose TG, Li F. Rcv1: A new benchmark collection for text categorization research. J Mach Learn Res. 2004; 5:361–97.
 20
Tang L, Rajan S, Narayanan VK. Large scale multilabel classification via metalabeler. In: WWW ’09: Proceedings of the 18th International Conference on World Wide Web. New York: ACM: 2009. p. 211–20.
 21
Nam J, Kim J, Gurevych I, Fürnkranz J. Largescale multilabel text classification  revisiting neural networks.CoRR. 2013. abs/1312.5419.
 22
Ramage D, Hall D, Nallapati R, Manning CD. Labeled lda: A supervised topic model for credit attribution in multilabeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1  Volume 1, EMNLP ’09. Stroudsburg: Association for Computational Linguistics: 2009. p. 248–56. http://dl.acm.org/citation.cfm?id=1699510.1699543.
 23
Rubin TN, Chambers A, Smyth P, Steyvers M. Statistical topic models for multilabel document classification. Mach Learn. 2012; 88(12):157–208. doi:10.1007/s1099401152725.
 24
Papanikolaou Y, Foulds JR, Rubin TN, Tsoumakas G. Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA. J Mach Learn Res. 2017;18.
 25
Pautasso M. Publication growth in biological subfields: Patterns, predictability and sustainability. Sustainability. 2012; 4(12):3234–47. doi:10.3390/su4123234.
 26
Fagerland MW, Lydersen S, Laake P. The mcnemar test for binary matchedpairs data: midp and asymptotic are better than exact conditional. BMC Med Res Methodol. 2013;13.
Acknowledgments
Timothy Rubin has assisted us in implementing Labeled LDA. Grigorios Tsoumakas would like to thank Professor Charles Elkan for raising the problem of the statistical unreliability of perlabel accuracy estimates based on few true positives in an email discussion. Furthermore, We would like to thank the anonymous reviewers for their helpful and constructive comments.
Funding
This work has been partially funded by Atypon Systems Inc.
Availability of data and materials
The BioASQ data set used along the paper is publicly available at http://participantsarea.bioasq.org/general_information/Task5a/.
Author information
Affiliations
Contributions
GT came up with the original idea. YP under the supervision and with the assistance of GT implemented the necessary code, performed the experiments and wrote the paper. ML and NM assisted with implementation issues. IV acted as a coordinator and provided a higher level supervision. All authors participated in the analysis of the results and the relevant conclusions. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Yannis Papanikolaou.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Semantic indexing
 Multilabel ensemble
 Machine learning
 BioASQ
 Supervised learning
 Multilabel learning