Evaluation
In general it is extremely difficult to determine ground truth for the actual numbers and durations of disease outbreaks. As a silver standard we have chosen the best publicly available human network of reporters which is ProMED-mail [11]. ProMED-mail is a program of the International Society for Infectious Diseases with many expert volunteer reporters globally and a sophisticated staged editorial process. Outbreak reports are distributed to 40,000 subscribers by email, RSS feed and Web portal - precisely the audience we target in our automated system.
In this study we have used quite coarse-grained granularity by choosing countries and days as the units. This is due to the current limits of reliable location detection in the system and also the frequency of news that we observe. The recorded time for each event was normalized to system download time which takes place every hour of each day.
Evaluation uses the standard classification test measures of sensitivity (recall), specificity, positive predictive value (PPV or precision), negative predictive value (NPV) and timeliness. We also measured the average number of system alarms per 100 days and compared this to the silver standard. The F-measure (F1) is calculated in the usual way as the harmonic mean of sensitivity and PPV.
As in our previous study, the standard for a true positive was to obtain a system alert on a country-disease event on or before the silver standard alert. To allow for compatibility and comparison we kept the period for a qualifying system alert as up to 7 days prior to and including a qualifying ProMED report on the same topic. Other history period lengths might be more or less effective but were not the target of the investigation in this study. True positives were increased by 1 if there was any system alert that fell within the 7 day period. Multiple system alerts did not count twice. False positives were increased by 1 for each system alert that fell outside of the 7 day window. False negatives were counted as the number of qualifying alert periods when there were no system alerts. True negatives were counted as the number of days outside of any qualifying alert period when no system alert was given. In testing we tried to maximize F1 together with timeliness.
Data
Figure 1 shows the 16 event streams that we explored. The events chosen for this study were determined based on diversity of geographical and media coverage rather than random selection. The 16 event streams contain 2064 surveillance days with 153 events (7.4% of alerting days) (Note that system data from the study will be made publicly available online for re-use via the GENI database interface on BioCaster). Since we wanted to explore the hypothesis that linguistic coverage in multiple languages could strengthen detection rates and timeliness we compared English news coverage against all languages including English for each of 16 disease outbreaks. English was chosen as the baseline because of its overall geographic representativeness. An alternative and perhaps more realistic approach might have been to use the native language for each outbreak country as the baseline which we will consider in future investigations. Because cross-lingual events on the 13 languages were only available in our system from December 2009, the trial period was from January to May 2010.
ProMED reports used in the silver standard excluded those that fell outside our case definition, based on the International Health Regulations [8] decision tree instrument. For example, requests for information, reports primarily focussed on control measures and aggregated summary reports not arising from specific events.
Text mining system
The text mining system we explored involves a semantic pipeline of modules running on a high throughput cluster computer with 48 Xeon cores. Throughput is approximately 9000 articles per day. System news was gathered from multiple news sources through Google News and MeltWater News as well as specialized sources such as the European Media Monitor, IRIN and ReliefWeb. (Note that no ProMED-mail messages were included in the system data for this study using a block on the Internet domain and message title). In total this gives us access to over 80,000 news sources globally. The languages used in the study (in ISO-639-1) are: ar,zh,nl,en,fr,de,it,ko,pt,ru,es,vi and th.
Underlying the system is a publicly available multilingual application ontology [12] which is used within the rule books to make basic inferences such as countries from names of provinces, or diseases from causal pathogens. The BioCaster ontology (BCO) rules also allow us to unify variant forms of terms such as the 11 forms of A(H1N1).
After data sourcing, translation takes place from the twelve non-English languages used in this study using Google’s online translation system. As a quality reference point we refer to a recent large-scale evaluation of machine translation for European language pairs [13]. In this study on news texts it was found that across a wide variety of metrics Google’s online system consistently performed among the highest quality systems for Spanish-English, French-English and German-English language pairs.
Following machine translation, text classification using Naive Bayes (F1 0.93) removes non-disease outbreak news before text mining is applied. Rules are based on a regular expression matching toolkit called the Simple Rule Language [14] and divided between 18 entity types and template rules. The final structured event frames in XML includes slot values normalized to BCO root terms for disease, pathogen (virus or bacterium), time period, country and province. Additionally we also identify 15 aspects of public health events critical to risk assessment. For the purpose of this study we only made use of disease and country slots. Events in the 13 languages are treated in this study as being part of a univariate model for comparison purposes against English events.
Latitude and longitude of events down to the province level are found automatically using Google‘s API up to a limit of 15000 lookups per day, and then using lookup on 5000 country and province names harvested from Wikipedia.
Alerting models
We experimented with a range of popular models for early alerting used in the public health community: the Early Aberration and Reporting System (EARS) quality control chart models C3, C2 and W2 as well as the F-statistic and the Exponential Weighted Moving Average (EWMA). All were implemented in Excel for the purpose of this study. The models are what might be termed ‘snapshot’ models because they all use short 7 day baselines that assume a relatively stationary background, i.e. ignoring medium to long term periodic variations such as seasonal cycles. The baselines are used to predict future trends against which the current day values are compared. All models also use a 2 day ‘guard period’ just before the target day t to prevent the current day’s data from being included in the baseline. All models use a minimally supervised method by setting a threshold parameter which we determined using the same 5 held out data sets used by [4]. These were 0.2 (C2 and W2), 0.3 (C3), 0.6 (F-statistic) and 2.0 (EWMA). A minimum standard deviation was set at 0.2 and a frequency purge was applied to remove singleton events, i.e. those with counts of 1 per day.
C2
The EARS algorithms [15] are based on cumulative sum calculations commonly used in quality control. C2 triggers an alert when a test statistic S
t
exceeds a number k of standard deviations above the baseline mean:
where C
t
is the event count on the target day, µ
t
and σ
t
are the mean and standard deviation of the counts during the baseline period. We set k to 1 for all experiments.
C3
C3 is a modified version of C2 so that the previous 2 observations (within the guard period) are added to the test statistic if the counts on those days does not exceed a threshold of 3 standard deviations plus the mean on those days. The rationale here is to extend the sensitivity of C2.
W2
W2 [16] is a stratified version of C2 which compensates for weekend data outages by removing Saturday and Sunday data counts from the baseline. Alerting though can take place on any day.
F-statistic
The calculation for the F-statistic [17] is:
where
approximates the variance during the testing window and
approximates the variance during the baseline window.
Calculation is as follows:
EWMA
Unlike other models in our test, the EWMA provides for a non-uniformly weighted baseline by down-weighting counts that are on days further from the target day:
where 1 >λ > 0 is a parameter that controls the degree of smoothing. The optimal level found from held out data was found to be 0.2. The test statistic is calculated as:
As above, µ
t
and σ
t
are the mean and standard deviation on the baseline window.