AQ21 rule learning
AQ21 is a multi-task ML and data mining system for attributional rule learning and rule testing that can be applied to a wide range of classification problems [17]. It was developed in the Machine Learning and Inference Laboratory (MLI) at George Mason University. The system has been recently extended to include features specific for processing biomedical data [18]. AQ21 is a type of natural induction system that seeks to identify patterns represented as attributional rules [19] that are easily interpretable to end users. The basic form of an attributional rule is: CONSEQUENT < = PREMISE where both CONSEQUENT and PREMISE are conjunctions of attributional conditions. Each attributional condition involves attributes present in the data or constructed by the program. Additionally, AQ21 can learn rules with exceptions given by the formula CONSEQUENT < = PREMISE |_ EXCEPTION. The AQ21 system can also handle inconsistencies in data. The system learns standard rules and generates exception phrases that represent covered negative examples. EXCEPTION can be either an attributional conjunctive description or a list of examples constituting exceptions to the rule. In the medical datasets, the exceptions are always negative examples such as cancer recurrence and disease progression.
Learning rules generated by AQ21 consist of several steps, which can be classified as input preprocessing, rule generation, and rule optimization. The steps are generally executed in this order, although AQ21’s learning process is iterative in several ways. Input preprocessing includes rearranging data into classes, removing ambiguous examples, and modifying representation space through simple preprocessing methods (i.e., discretization, attribute selection) or more advanced ones that employ constructive induction algorithms [20]. At its core, rule learning implements modification of a simplified version of the algorithm quasi-optimal (Aq) for constructing rules, which is a well-known sequential covering algorithm [21]. The algorithm starts with a randomly selected positive example, called the seed, and generates all possible (high quality) rules that cover the seed and do not cover (or approximately do not cover) any of the negative examples. The best quality top rules are then selected and stored. Among positive examples not covered by these selected rules, another random seed is selected and the operation is repeated. This process results in a number of very general rules (typically more than needed) that need to be optimized and prepared for output. Optimization of rules includes their trimming, adjusting of generality through following hierarchies, selection, and mapping of attributes. The overall goal of AQ21 is to produce rules that maximize user-defined quality criteria that typically provide tradeoff between accuracy (precision/recall) and their simplicity and transparency. Finally, the program employs a number of methods designed to provide output in human-oriented forms, including the generation of the rules into a natural language representation (layman terms) [22].
AQ21 is the latest development from a series of AQ rule learners that dates back to the 1970s [23]. A number of well-known rule learners have been developed over the last decades [24,25,26], but many are not utilized in mainstream research at the present time. In the past few years the ML field has been dominated by statistical methods that focused primarily on providing highly accurate models. However, the community has begun to slowly transition back to understandability and transparency of models produced, which is particularly important in biomedical applications.
Ontology-guided AQ21 (AQ21-OG)
AQ21-OG is an extension of the AQ21 rule learning system. It applies hierarchical reasoning methods [27] to include UMLS and other ontologies when analyzing data. Currently, the program allows for mapping IS-A relationships. The implementation of the AQ21-OG includes:
-
Step 1: Mapping data to the UMLS CUIs. This step is used to identify the base CUIs. The candidate CUIs are identified automatically (SQL) and then reviewed by experts for the problematic mappings.
-
Step 2: Extracting complete sub-hierarchies by following IS-A relationships using base CUIs. This is done by following IS-A relationships in the UMLS for each concept until the complete parent, child, and sibling sub-hierarchy is extracted. The complete sub-hierarchy is defined as the path from base CUI (furthest child(ren) in the hierarchy) to the root (“super parent”, i.e. a parent that is not also a child). This extraction is the basis for the input file (in Step 4) that AQ21 will use to find the farthest common ancestors for base CUIs (in Step 5).
-
Step 3: Resolving inconsistencies in the hierarchy. Due to nature of the UMLS, a number of inconsistencies (e.g., cycles, duplicates) may happen when due to being constructed from multiple source terminologies [28,29,30,31]. Cycles are not permitted in AQ21, so they are resolved by breaking links that connect back to concepts higher in the hierarchy, as measured by distance from the root. Other types of inconsistencies are removed from the final hierarchy.
-
Step 4: Encoding extracted hierarchies into ML-software readable format. AQ21 requires a list of parent-child pairs for all relationships that form the hierarchy. The data is read from text files that include all semantic information required to correctly reason with the data. Specifically, in the AQ21, hierarchical relationships are part of the definition of attributes’ domains (set of possible values) that describe data.
-
Step 5: Optimizing the rules by using the extracted UMLS hierarchies from Step 2. AQ21-OG finds the highest level of generalization in the hierarchy, which is either consistent with data or maximizes the rule quality measures. This is particularly valuable when analyzing coded medical data with potentially hundreds of thousands of binary attributes. For example, ICD-9-CM diagnosis codes can result in the need to create close to 10,000 binary attributes. Therefore, the need to generalize those codes to reduce the number of features is a necessity.
Study population
SEER-MHOS (Surveillance, Epidemiology, and End Results - Medicare Health Outcomes Survey) data from 1998 to 2011 (1,849,311 records) were used to extract comorbidities and activities of daily living (ADLs), as well as cancer characteristics. This dataset links two large population-based data that provide detailed information about Medicare beneficiaries with cancer [32]. The SEER data extracted from the cancer registry contains clinical, demographic and cause of death information for persons with cancer, while the MHOS data is extracted from survey responses and provides information about the health-related quality of life (HRQOL) of Medicare Advantage Organization (MAO) enrollees.
A number of steps were followed to create the study population dataset. First, the study population was limited to those who completed at least one survey before their cancer diagnosis and one survey roughly one year after the diagnosis. If a patient completed multiple surveys, the surveys closest to before the cancer diagnosis and the 1-year follow-up were used. These very strict criteria significantly reduced the sample size and resulted in a cohort of 723 cancer patients.
Dependent/Output Variables: the primary outcomes were six ADLs (walking, dressing, bathing, moving in/out of chair, toileting, and eating) reported in a patient survey taken one year after the cancer diagnosis.
Independent/Input Variables: the potential predictors were selected based on the prior research [33,34,35,36,37] and are as follows:
-
(1)
Patient demographics: age, race and marital status
-
(2)
Six ADLs reported in a patient survey taken before the cancer diagnosis
-
(3)
Thirteen self-reported comorbidities extracted from a patient survey taken before the cancer diagnosis: Angina Pectoris/Coronary Artery Disease, Arthritis of Hand/Wrist, Arthritis of Hip/Knee, Back pain, Congestive heart failure, Emphysema/Asthma/Chronic obstructive pulmonary disease, Diabetes, Crohn’s Disease/Ulcerative Colitis/Inflammatory Bowel Disease, Hypertension, Myocardial Infarction, Other Heart Conditions, Sciatica and Stroke
-
(4)
Six cancer characteristics namely grade, staging, tumor size, histology, tumor extension, and behavior extracted from the SEER registry
-
(5)
Cancer radiation and surgery treatment indicators extracted from the SEER registry
Analysis of the SEER-MHOS data with AQ21 and AQ21-OG
The dataset was randomly divided into training (80%) and testing (20%) sets. The training set was used to create predictive models and the testing set was used to assess the model discrimination. Models were first created in order to find the predictor or set of predictors that could be used to predict the outcome (the six ADLs post cancer diagnosis). Two ML methods were used to create models: AQ21 and AQ21–OG as previously described above. The quality of the two methods were assessed using the number of positive (p), negative (n) cases covered by the generated rules and the quality of the rules Q(w). The rule R quality, Q(R,w) with weight w, or just Q(w) (denoted by q in the rule), is calculated using the following formula described by Michalski and Kaufman [38]. P and N indicate total numbers of positive and negative examples in data (here, disabled vs. functionally independent in terms of ADLs).
$$ {\displaystyle \begin{array}{l}\mathrm{Q}\left(\mathrm{R},\mathrm{w}\right)=\mathrm{compl}{{\left(\mathrm{R}\right)}^{\mathrm{w}}}^{\times }\ \mathrm{consig}{\left(\mathrm{R}\right)}^{1\hbox{-} \mathrm{w}}\hfill \\ {}\mathrm{where}\hfill \\ {}\mathrm{compl}\left(\mathrm{R}\right)=\mathrm{p}/\mathrm{P}\hfill \\ {}\mathrm{consig}\left(\mathrm{R}\right)={\left(\left(\mathrm{p}/\left(\mathrm{p}+\mathrm{n}\right)\right)\hbox{--} \left(\mathrm{P}/\left(\mathrm{P}+\mathrm{N}\right)\right)\right)}^{\times }\ \left(\mathrm{P}+\mathrm{N}\right)/\mathrm{N}\hfill \end{array}} $$
The w is a weight (from 0 to 1) that represents the tradeoff between completeness and consistency gain. The lower the w is, the more consistent the rules need to be (fewer negative examples covered). The higher the w is, the more complete the rules need to be (more positive examples covered). Based on experimental evaluation of the rules, we decided to select w = 0.3 which indicates slightly higher weight for more consistent rules. This value was used in both cases, with and without ontology. Completeness is frequently referred to as recall in machine learning. Consistency gain can be viewed as normalized precision that measures how much precision we gain over a random guess.