Volume 2 Supplement 3
MedEval — A Swedish medical test collection with doctors and patients user groups
© Heppin; licensee BioMed Central Ltd. 2011
Published: 14 July 2011
Test collections for information retrieval are scarce. Domain specific test collections even more so, and medical test collections in the Swedish language non-existent prior to the making of the MedEval test collection. Most research in information retrieval has been performed in the English language, thus most test collections contain English documents. However, English is morphologically poor compared to many other European languages and a number of interesting and important aspects have not been investigated. Building a medical test collection in Swedish opens new research opportunities.
This article describes the making of and potential uses of MedEval, a Swedish medical test collection with assessments, not only for topical relevance, but also for target reader group: Doctors or Patients. A user of the test collection may choose if she wishes to search in the Doctors or the Patients scenario where the topical relevance assessments have been adjusted with consideration to user group, or to search in a scenario which regards only topical relevance.
In addition to having three user groups, MedEval, in its present form, has two indexes, one where the terms are lemmatized and one where the terms are lemmatized and the compounds split and the constituents indexed together with the whole compound.
Differences discovered between the documents written for medical professionals and documents written for laypersons are presented. These differences may be utilized in further studies of retrieval of documents aimed at certain groups of readers. Differences between the groups of documents are, for example, that professional documents have a higher ratio of compounds, have a greater average word length and contain more multi-word expressions.
An experiment is described where the user scenarios have been utilized, searching with expert terms and lay terms, separately and in combination in the different scenarios. The tendency discovered is that the medical expert gets best results using expert terms and the lay person best results using lay terms, but also quite good results using expert terms or lay and expert terms in combination.
The many features of MedEval gives a variety of research possibilities, such as comparing the effectiveness of search terms when it comes to retrieving documents aimed at the different user groups or to study the effect of compound decomposition in retrieval of documents. As Swedish, the language of MedEval, is a morphologically more complex language than English, it is possible to study additional aspects of the effect of natural language processing in information retrieval, for example utilizing different inflectional word forms in the retrieval of expert vs lay documents. MedEval is the first Swedish test collection of the medical domain.
The Department of Swedish at the University of Gothenburg is in the process of making the MedEval test collection available to academic researchers.
Building a test collection is a major undertaking, therefore test collections are scarce. But the long process of building a test collection gives many insights in the field of information retrieval. This article describes the process from collecting the documents in the underlying corpus, through the creation of search topics, the instructions to the relevance judges including the choice of categories in the assessments of documents for relevance and for intended reader group. The article also presents the structure of the recall bases and the representation of the collection documents in the two indexes, with and without split compounds. To show how a test collection such as MedEval can be used, the article presents a selection of substantial differences between the documents written for professionals and documents written for laypersons, and finally presents experimental runs for the study of retrieval of documents aimed at the two target reader groups.
When the decision was made to build a new test collection, the Department of Swedish at the University of Gothenburg was involved in projects of research in medical language processing. There was also a growing interest of research in information retrieval. As no Swedish medical test collection existed, creating one seemed to be a good investment in knowledge and resources, even though this involved a team of people during many months.
One existing medical test collection, albeit in English, is OHSUMED . It is built of nearly 350,000 references from MEDLINE, and thus the documents contained have medical professionals as intended readers. The OHSUMED documents are assessed on a three graded scale: definitely, possibly and not relevant. OHSUMED contains 106 topics generated by physicians from authentic situations. Each topic consists of information about a specific patient and an information request concerning this patient.
The collection documents
The genres of the documents in the MedEval document collection
Type of source
Number of documents
Percent of documents
Number of tokens
Percent of tokens
Journals and periodicals
Government, faculties, institutes, and hospitals
Health-care communication companies
Media (TV, daily newspapers)
Type and token frequencies of terms
Number of documents
Average word length
Full form types
Lemma type token ratio
Full form compound types
Lemma compound types
Ratio of compounds
For the creation of the MedEval information needs, also called topics, two medical students in their fourth year of studies were hired. Their instructions were to create information needs that would be plausible in real medical situations, by doctors or by patients. Guided by explorative searches, the topic creators were asked to adjust the complexity of the topics so that the plausible number of relevant documents for each topic would be not less than five but still not much more than 50. 100 topics were created in the first stage. 62 of these were used in the collection. The process of creating topics was inspired by  and is described in more detail in .
Selecting documents to assess
An ideal test collection would have a complete set of relevance judgments with every document assessed for relevance to every information need. With a collection of over 42,000 documents and with 62 information needs, as in MedEval, taking an estimated average of 8 minutes to assess each document, working 40 hours a week, it would take four persons over 42 years to finish the assessments.
Instead of assessing all documents for all topics, subsets of documents with a high probability of being relevant to each topic were extracted. These subsets were selected in a series of different runs using basic queries. Since there was limited time and economic resources creating MedEval, the extraction of documents was done on a small scale with only one search engine, namely Indri/Lemur .
Four different search methods were used in the extraction, that is, four runs for every information need. For each run, the 100 documents ranked most likely to be relevant were extracted, if in fact so many were retrieved. Two searches were done in each index: with and without decomposed compounds. One search was intended to be broad and one more specific. The number of documents assessed for each information need was between 115 and 358.
For each topic, the result of the extraction was four lists of document IDs. These were merged in one file per topic. The IDs were sorted in alphanumerical order and duplicates were removed. This is important to avoid bias, as the assessors must not know how the documents were ranked in the initial runs or in how many searches each document was retrieved. The documents corresponding to the extracted IDs were printed on paper and fixed in separate bundles for each topic. The papers were printed on only one side to avoid negative bias for short documents ending up on the left page of a spread. The method for selecting documents to assess was based on methods described in  and .
The extracted documents were assessed for relevance according to the corresponding information needs. Four medical students were hired to do the assessments, not the same students as the creators of the MedEval topics. Domain knowledge is essential for understanding the topics and the contents of the documents and also for consistency in judging .
It may be expected that the greater the judges’ subject knowledge, the higher will be their agreement on relevance judgments. Subject knowledge seems to be the most important factor affecting the relevance judgment as far as human characteristics are concerned. , p. 341
For each of the 62 topics, an assessor read through the documents to be assessed and decided, for each document, the intended group of readers and the degree of relevance to the topic. The documents for each individual need were assessed by one and the same assessor for reasons of consistency. It is not unusual for assessors to disagree on the relevance of a certain document. However, considering two documents, assessors tend to agree which one is the more relevant. As research in information retrieval according to the Cranfield paradigm  is based on relative relevance scores, and not absolute relevance scores, this would make the judging sufficiently consistent . This has been concluded in several studies, and already in .
It is most significant to note that the relative relevance score of documents in a group [...] may be expected to be remarkably consistent even when judges with differing backgrounds make the relevance judgments. Thus, it may be more profitable to compare the relative position of documents in a set than to compare the relevance ratings assigned to individual documents. , p. 341
The findings of Saracevic are supported by later studies conducted by Voorhees . She claims that the important question is not how well assessors agree with one another, but how the results change with these differences in assessement. Her conclusion is that despite differences in assessments between assessors, the evalutation behavior remains the same. Supported by , , and  the creators of the MedEval test collection came to the conclusion that one assessor per topic would be sufficient. More important for obtaining a consistent test collection was to not split any set of documents assessed for a certain topic between different assessors.
The MedEval relevance assessments were made on a four graded scale, 0–3, where 0 is ‘Not at all relevant’ and 3 is ‘Highly relevant’ . This scale is easily turned into a binary scale by stating that the documents with the lower grades are to be considered non-relevant and the ones with higher grades relevant. An impatient user, who is satisfied with one or a few documents, could have only documents with relevance score 3 considered relevant, while a user who is willing to take her time, and who wants as many documents as possible, could let all documents with relevance 1–3 be considered relevant.
The relevance judged by the assessors was topical relevance, how well a document corresponds to a topic. The assessors were instructed not to involve user relevance in this score. Each document was judged on its own merits. The novelty of the contents of a document should not be taken into account.
In addition to topical relevance the assessors judged each document for target reader group, that is which group of readers was the intended: Patients, if a document was written for laypersons, or Doctors, if it was written for medical professionals. This assessment was not based on any statistical or formal factors, only on the assessors’ judgments. Some documents were difficult to classify as they were not clearly aimed at a certain group. A number of these documents were labeled with different target groups when assessed for different topics (see Table 2).
For a classification of documents according to intended reader group to be useful, there must be a measureable difference between the document classes. Table 2 shows statistics for different categories of terms in different subsets of the collection. In each set, duplicates were removed in the case that a document had been assessed for more than one topic. The subsets considered are described below. Full form types are the original terms of the documents before lemmatization (with inflections) and lemma types are the same terms after lemmatization (reduced to base form).
Entire collection All documents of the MedEval collection.
Assessed documents All documents that have been assessed for any topic.
Doctors assessed All documents that for at least one topic have been assessed to have target group Doctors.
Patients assessed All documents that for at least one topic have been assessed to have target group Patients.
Common files All documents that for at least one topic have been assessed to have target group Doctors and for another to have target group Patients.
Doctors relevant All documents that for at least one topic have been assessed to have at least relevance grade 1 and to have target group Doctors.
Patients relevant All documents that for at least one topic have been assessed to have at least relevance grade 1 and to have target group Patients.
Before counting frequencies, the files were cleaned from tags, IDs, dates (in the date tag, not in the actual text), web information and punctuation marks. As the tokens were counted after the cleaning of the text, the number of tokens in this table is not consistent with the number of tokens in Table 1.
The number of tokens per document is significantly smaller for the entire collection, than for any subset. This means that there is a large number of short documents that were not retrieved by any query when the documents were extracted. This is not surprising, since short documents contain few terms which can match the queries. The finding that unjudged documents on average tend to be shorter than judged documents, both relevant and non-relevant, is consistent with the results of experiments described in . One reason, according to Karlgren, is that non-retrieved items often contain tables and numerical information. He also concludes that longer documents have a bigger chance of touching relevant subjects, but unfortunately also confusingly similar subjects which are non-relevant.
The documents in the set ‘Patients assessed’ had only 57% the number of tokens per document, compared to the documents in ‘Doctors assessed’. Even though there were over 1,000 more documents in ‘Patients assessed’ than in ‘Doctors assessed’, there were over 50,000 more lemma types in the doctor documents and almost 30,000 more lemma compound types. Type token ratio is a measure of the average times each type, or word form, is used. This measure grows as the size of the set of documents considered grows. This fact makes it even more noteworthy that the type token ratio for the patient documents is significantly higher than for the doctor documents, even though the doctor documents contain more tokens. What this signifies is that there are not as many different types of word forms in the lay texts, but each type is used a larger number of times.
The average word length in ‘Doctors assessed’ was 6.29 compared to 5.73 for ‘Patients assessed’. The ratio of compound tokens was also higher in the doctor documents, 0.128 compared to 0.098.
Additional file 1 illustrates the fact that the doctor documents contain more and longer terms and more compounds than patient documents. The file shows frequencies of all full form types of strings beginning with the random term förmak ‘atrium’ in ‘Patients assessed’ and ‘Doctors assessed’ respectively. The patient documents have 18 full form types beginning with förmak while doctor documents have 75, more than four times as many.
Looking at all instances of strings beginning with förmak in the two sets of documents, for professional and laypeople, there is a significant difference. In the patient documents 66 tokens of 372, or 17.7%, are nouns in the definite form, while the corresponding numbers for the doctor documents is 89 of 932 tokens, or 9.6%. A hypothesis for why this is so, is that medical professionals often discuss matters in a generic point of view, while laypeople discuss specific cases.
Frequencies of adjectives
Non-neuter singular indefinite
Plural and/or definite
Non-neuter singular indefinite
Plural and/or definite
The conclusion that professionals discuss generic cases while laypeople discuss specific cases is supported by a difference that can be seen in frequency tables of multi-word expressions in doctor and patient documents . High frequencies are found for phrases with meanings such as: in patients with or of patients with. Frequencies are also high for indefinite noun phrases such as: in treatment of or for treatment of The patient documents, on the other hand, contain phrases describing specific patients or specific cases, for example phrases that contain the pronoun you and noun phrases in definite form: when the treatment is completed.
Overall, the documents written for the doctor target group tend to be written in a more disassociated way compared with the patient documents which are more interactive in their approach, addressing the reader directly. While the professional documents tend to discuss research results or cases in general, the lay documents often discuss specific cases. This difference in approach manifests itself, for example in the features described above with the patient documents containing more nouns and adjectives in the definite form, and more pronouns in the first or second person, while doctor documents predominately have nouns and adjectives in the indefinite form, and pronouns in the third person. The professional documents also tend to be written in a more formal way with many multi-word phrases recurring with high frequencies. As there is an apparent difference between the documents written for the professional and layperson target groups, these differences could be used for a precategorization of documents according to genre. Such a categorization could be stored in a separate field in the document representations.
An interesting research question for future projects could be to study the benefit of lemmatizing inflected words, but keeping the inflectional information in tags, or recording the tendency of a text in terms of generic vs specific. This could be a way to keep the higher recall gained by lemmatization, but still use inflectional information for discrimination .
The MedEval test collection allows the user to state user group: None (no specified group), Doctors or Patients. This choice directs the user to one of three scenarios. The None scenario contains the topical relevance grades as made by the assessors. The Doctors scenario contains the same grades with the exception that the grades of the documents marked for Patients target group are downgraded by one. In the same way the Patients scenario has the documents marked for Doctors target group downgraded by one. This means that for a doctor user patient documents by the assessor given relevance 3, are graded with 2, documents given relevance 2 are graded 1 and documents given relevance 1 are graded 0. The same is done in the Patients scenario with the doctor documents. The idea is that a document that is written for a reader from one target group but retrieved for a user from the other group will not be non-relevant, but less useful than a document from the correct target group. Put differently, a document intended for patients would contain information that doctors (hopefully) already know. On the other hand, documents intended for doctors, even though they might be topically relevant for a patient’s need, run a great risk of being written in such a way that a patient will have problems grasping the whole content. This is a way of introducing utility without performing user studies.
The three topics of Figure 2 show different characteristics with reference to the number of relevant doctor and patient documents. Topic 36 has fairly similar cumulated gain curves for the Doctors and Patients scenarios. Topic 28 has a majority of doctor documents, while Topic 92 has no documents of any relevance grade for documents marked for target group Doctors. Thus the None and the Patients ideal gain vector coincide fully, while the cumulated gain for the Doctors scenario is very low, originating from downgraded patient documents.
To demonstrate the effectiveness of search terms from the different styles of language of the two target groups, a number of synonym pairs were used as search keys for corresponding topics. Each synonym pair consisted of one neoclassical term, belonging to the expert register, and one lay term. The terms of each pair were run separately as single search key queries, and also combined in one query. All queries, three for each topic, were run in the doctors scenario and in the patients scenario. Note that for each query the resulting ranked list of documents is the same for both scenarios. It is the recall bases, and thus the relevance grades of the retrieved documents, that differ.
As MedEval, to the authors’ knowledge, is the first medical test collection with user groups, there are no earlier equivalent tests. However,  address the fact that medical experts and non-experts express themselves in different ways, and that this affects search results. The authors are motivated by the empowerment of laypersons and discuss how to exchange information across user groups. The goal is that a search using non-expert terms should retrieve all types of documents written on the topic. They see the problem as a question of automatic alignment between specialized terminology and general terminology and enrich the information retrieval system with a set of links between corresponding concepts in lay and professional language.
The contrast between Swedish professional medical language and Swedish lay language is addressed in . The authors have selected documents concerning cardiovascular disorders from the MedLex corpus . Their findings may be used as a basis for future studies on how to differ searches with the purpose of retrieving documents for the different user groups. The findings have inspired the choice of entries in Table 2 showing differences between the sets of doctor and of patient documents.
Measures of effectiveness
The effectiveness of the queries described above was measured in recall after 10, 20, and 100 retrieved documents. This represents the impatient, the slightly less impatient, and the patient user. The effectiveness was also measured in normalized discounted cumulated gain, nDCG . The nDCG is based on the cumulated gain described earlier, but uses a discounting factor which reduces the amount of the relevance score added for each document in the ranked list. The relevance score is discounted by a logarithmic function of the position number. The assumption is that the later in the list a document is found, the less it is worth to the user. The normalization infers that the discounted cumulated gain is compared to the ideal discounted cumulated gain in each position. Thus the nDCG value summarizes the effectiveness in all positions earlier in the ranked list, and compares this summarized effectiveness to the maximum value possible in each position. As the nDCG value is relative to the maximum value possible, it varies between 0 and 1 and gives no bias to topics with small or large recall bases.
Even though recall and nDCG both measure effectiveness, there is not an absolute correlation between them. Recall is calculated on a binary scale. In this case documents with relevance score 1 are considered non-relevant. The nDCG, on the other hand is calculated on a four-graded scale, 0-3, and all scores from 1 to 3 are included in the measure. This entails that the nDCG value can seem high compared to the recall value if the ranked list includes documents with relevance score 1. On the other hand the recall value can seem high compared to the nDCG value if there are relevant documents late in the ranked list.
Runs for Topic 51
Runs for Topic 66
allergisk chock ‘allergic shock’
Runs for Topic 63
Runs for Topic 48
Runs for Topic 7
Runs for Topic 83
Runs for Topic 68
blodpropp ‘blood clot’
In most cases, and not surprising, the expert terms are most effective in the doctors scenario and the lay terms in the patient scenario, but there are both expert and lay terms that achieve best results in both scenarios. The expert terms tend to give better results in the patient scenario than the lay terms in the doctors scenario. However more extensive studies, including comparisons of search results with relative frequencies of lay and expert terms, are needed before definite conclusions can be drawn.
This article describes the process of building a test collection for information retrieval purposes. The process includes the collection of a corpus, creation of search topics, decisions about relevance assessments, such as selecting documents to assess and deciding on the assessment categories for the judges. Further the process includes how to represent the recall bases and how to represent the documents in the collection indexes.
The article goes on to show a number of aspects of medical information retrieval which can be studied utilizing the MedEval test collection. The main novelty of the collection is the marking of document target groups, Doctors and Patients, together with the possibility to choose user group. This opens for new areas of research in Swedish information retrieval such as how one can retrieve documents suited for different groups of users. As was shown in the example runs, search keys from different registers behave differently in the doctors and in the patients scenario.
A number of differences between the documents written for experts and for non-experts are presented along with the suggestion that these differences may be utilized in future studies of document retrieval for the different user groups.
Not least important is that MedEval is a Swedish domain specific test collection. A test collection in a language other than English allows a new range of research possibilities studying the impact of natural language processing in information retrieval.
The Department of Swedish at the University of Gothenburg is in the process of making the MedEval test collection available to academic researchers.
List of abbreviations used
Frequently Asked Questions
normalized Discounted Cumulated Gain.
The author would like to thank the FIRE (Finnish Information Retrieval Experts) research group at the University of Tampere, Finland, for their invaluable help in building the MedEval test collection.
This article has been published as part of Journal of Biomedical Semantics Volume 2 Supplement 2, 2011: Proceedings of the Second Louhi Workshop on Text and Data Mining of Health Documents. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/2/S3.
- OHSUMED: The OHSUMED test collection. 2007, [http://ir.ohsu.edu/ohsumed/ohsumed.html]Google Scholar
- Kokkinakis D: MEDLEX: Technical report. Tech. rep. 2004, Department of Swedish, University of Gothenburg, [http://demo.spraakdata.gu.se/svedk/pbl/MEDLEX_work2004.pdf]Google Scholar
- Harman DK: The TREC Test Collections. TREC – Experiment and evaluation in information retrieval. Edited by: Voorhees EM, Harman DK. 2005, Cambridge, Massachusetts: MIT PressGoogle Scholar
- Hedlund T: Compounds in Dictionary-Based Cross-Language Information Retrieval. Information Research. 2002, 7 (2):
- Larsen B, Trotman A: INEX 2006 guidelines for topic development. INEX 2006 Pre-proceedings. 2006, [http://www.inex.otago.ac.nz/data/proceedings/INEX2006-preproceedings.pdf]Google Scholar
- Friberg Heppin K: Resolving Power of Search Keys in MedEval a Swedish Medical Test Collection with User Groups: Doctors and Patients. PhD thesis. 2010, University of Gothenburg, [http://hdl.handle.net/2077/22709]Google Scholar
- Lemur: The Lemur Toolkit for language modeling and information retrieval. Carnegie Mellon University and the University of Massachusetts nd, [http://www.lemurproject.org/]
- Ahlgren P: The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database. PhD thesis. 2004, University College of Borås/Göteborg University, Publications from Valfrid, nr 28Google Scholar
- Saracevic T: Relevance: A review and a framework for the thinking on the notion in information science. Journal of the American Society for Information Science. 1975, 39 (3): 321-343.View ArticleGoogle Scholar
- Cleverdon C: The Cranfield tests on index language devices. Aslib Proceedings. 1967, 19: 173-192. 10.1108/eb050097.View ArticleGoogle Scholar
- Harter SP: Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science. 1996, 47: 37-49. 10.1002/(SICI)1097-4571(199601)47:1<37::AID-ASI4>3.0.CO;2-3.View ArticleGoogle Scholar
- Voorhees EM: Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management. 2000, 36: 697-716. 10.1016/S0306-4573(00)00010-8. [http://www.jasonmorrison.net/iakm/cited/Voorhees_E_variations_in_relevance_judgements.pdf]View ArticleGoogle Scholar
- Sormunen E: Liberal relevance criteria of TREC – Counting on negligible documents?. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2002Google Scholar
- Karlgren J: Stylistic Experiments for Information Retrieval. PhD thesis. 2000, Department of Linguistics, Stockholm UniversityGoogle Scholar
- Järvelin K, Kekäläinen J: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems. 2002, 20 (4): 422-446. 10.1145/582415.582418.View ArticleGoogle Scholar
- Dioşan L, Rogozan A, Pècuchet JP: Automatic Alignment of Medical Terminologies with General Dictionaries for an Efficient Information Retrieval. Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration. Edited by: Prince V, Roche M. 2009, Medical Information Science Reference, 78-105.Google Scholar
- Kokkinakis D, Toporowska Gronostaj M: Comparing Lay and Professional Language in Cardiovascular Disorders Corpora. WSEAS Transactions on biology and biomedicine. 2006, 3: 429-437.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.