Core Ontology of Phenotypes (COP)
We developed the Core Ontology of Phenotypes (COP, Fig. 3) to model, classify and calculate phenotypes based on instance data sets (e.g., of a patient). In this article, we consider a phenotype as a dependent individual (in the sense of General Formal Ontology, GFO [20]), for example, the weight of a specific person. Hereinafter, abstract instantiable entities that are instantiated by phenotypes are called phenotype classes. For instance, the abstract property ‘weight’ possesses individual weights as instances. We distinguish between single and composite properties, and correspondingly, between single and composite phenotypes. A composite property is defined as a property that has single properties as parts [22]. Based on the definitions of single and composite properties [22], we define single phenotypes as single properties (e.g., age, weight, height) and composite phenotypes as composite properties (e.g., height and weight, BMI, SOFA score [33]) of an organism or of one of its subsystems. Properties of an organism are considered as all documentable information about it, whereby the modeller is left to decide what is relevant to the current situation. These can be, for example, observable characteristics or traits of an organism [1,2,3] or possible manifestations of clinical phenotypes, such as signs, symptoms or dispositions [4]. The corresponding data can be modelled using the FHIR Observation or Condition resources.
Composite phenotypes are divided into combined and derived phenotypes. A combined phenotype is only a combination of corresponding phenotypes (e.g., a combination of height and weight), whereas a derived phenotype is an additional property (e.g., BMI) derived from the corresponding phenotypes (height and weight). In the framework of GFO we modelled properties using the class gfo: Property. In the present article, composite phenotype classes are modelled using a Boolean expression based on has_part relation (e.g., weight and height: has_part some height and has_part some weight). Derived phenotype classes additionally define a calculation rule/mathematical formula (e.g., BMI = weight [kg] / height [m]2). Furthermore, combined phenotype classes can associate certain conditions with specific predefined values (scores), which can be used, e.g., in further formulas. For example, if bilirubin value is greater than 12 mg/dL, then the value 4 is used for the calculation of the SOFA score [33].
Additionally, we distinguish between restricted and non-restricted phenotype classes, depending on whether their extensions (set of instances) are restricted to a certain range of individual phenotypes by defined conditions or all instances are allowed. For example, the phenotype class ‘age’ is instantiated by the ages of all living beings (non-restricted), whereas the phenotype class ‘young age’ is instantiated by the ages of the young ones, e.g., if the age is below 30 years (restricted).
Phenotype Specification Ontologies (PheSO)
We consider a phenotype algorithm as a sequence of instructions (1) to classify phenotypes (single or composite) in phenotype classes or (2) to derive additional properties (derived phenotypes) from the phenotypes of an organism. Phenotype algorithms can be implemented, for example, using a programming language or a statistics software (e.g., SPSS or R). Our approach is to separate the specification of phenotypes (models) from the implementation of corresponding algorithms. The COP provides a basic model to specify phenotypes in a standardised way, while the PhenoMan implements the general approach, common for all COP-based specifications. It is not our aim to completely model the EHR. Instead, our approach can support the modelling and calculation of selected phenotypes in a user-friendly standardised manner.
Phenotypes are modelled in Phenotype Specification Ontologies (PheSO) using the COP. The phenotype classes and axioms (classification and calculation rules) contained in the PheSO are used by PhenoMan to execute the corresponding phenotype algorithm. PheSOs are embedded in the COP in such a way that the classes of the PheSO are subclasses of the COP classes. Every PheSO subclass of the COP classes cop: Single_Phenotype, cop: Combined_Phenotype or cop: Derived_Phenotype is a phenotype class and is instantiated by phenotypes. The direct subclasses are non-restricted (e.g., Fasting_Glucose, Fig. 4a), while the subclasses of the non-restricted phenotype classes are restricted (e.g., Fasting_Glucose_ABNORMAL, i.e., fasting glucose is greater equal 125 mg/dL, Fig. 4a3).
Phenotype classes possess various common attributes (e.g., labels, descriptions and codes of external concepts). Other attributes vary depending on the type of the phenotype class. The following are examples of such attributes:
-
Non-restricted single phenotype (NSiP) class: unit of measure and optional aggregate function.
-
Restricted single (RSiP) and derived phenotype (RDeP) class: restriction.
-
Restricted combined phenotype (RCoP) class: Boolean expression (based on RSiP, RCoP and RDeP classes) and optional score value.
-
Non-restricted derived phenotype (NDeP) class: mathematical formula and Boolean expression consisting of AND-linked variables used in the formula (NSiP and non-restricted combined phenotype (NCoP) classes). If a NCoP class is used as a variable, the RCoP classes (subclasses) of the NCoP class must have score values that should be used in the formula.
Simple attributes of the phenotype classes are defined as annotations. The logical relations between phenotype classes as well as range restrictions are represented in OWL by anonymous equivalent classes or general class axioms based on property restrictions.
Phenotype Manager (PhenoMan)
We developed the software Phenotype Manager (PhenoMan), which implements a multistage reasoning approach combining standard reasoners (e.g., Pellet or HermiT) and mathematical calculations. This section briefly outlines the main functionality of our solution.
Specification of phenotypes
The PhenoMan Editor is an interactive user interface for managing and developing PheSOs. The user is able to create a new PheSO or to load an existing ontology. The PhenoMan Editor provides appropriate forms to browse, create and edit categories and phenotype classes of the ontology. Value range restrictions, for example, are defined by selecting a comparison operator and entering the corresponding values (Fig. 5). Boolean expressions are built by drag-and-dropping the phenotype classes from the left side into the expression form field and entering relevant operators (Fig. 6). After submission, the form data is transmitted to the PhenoMan API and is stored in the actual PheSO.
Furthermore, an ART-DECOR specification (XML) of relevant data elements can be imported in the PheSO. For each data element, a NSiP class is generated. All relevant attributes (name, codes, FHIR resource type, data type, unit, etc.) specified in ART-DECOR are defined as annotations of corresponding classes (Fig. 2a, b).
Data procurement
After starting the PhenoMan Service, FHIR subscriptions (rest-hooks) [34] are generated and transmitted to the FHIR Server. The structure of the subscription resource is very simple. The main parts of the resource are the criteria and the channel. The FHIR Server uses the criteria (FHIR Search query) to determine resources for which notifications have to be generated. When resources are identified (after creating or updating) meeting the criteria, a notification is sent to the address (‘endpoint’) specified in the section ‘channel’.
In the configuration file of the PhenoMan, a directory containing all available phenotype specifications (PheSOs) as well as the address (URL) of the PhenoMan service (including the PheSO name) are defined. For each NSiP class of each available PheSO (in the defined directory) a subscription is created. To generate the subscription criteria (FHIR Search query), the PhenoMan uses the resource type and codes specified in the corresponding NSiP class as annotations (Fig. 2b, c). The ‘endpoint’ attribute is automatically filled with the URL of the PhenoMan service defined in the configuration file. The remaining parts of the subscription resource (‘status’, ‘type’ and ‘payload’) take default values (‘active’, ‘rest-hook’ and ‘application/json’) (Fig. 2c).
After receiving a notification (including the complete resource, Fig. 2d), the PhenoMan Service requests further resources (for all other NSiP classes of the corresponding PheSO) using FHIR Search. The generated FHIR Search queries are primarily based upon the codes specified for the NSiP classes (similarly to subscription criteria), contain a reference to the patient and can additionally support possible aggregate functions.
Classification and calculation of phenotypes
After receiving required resources, the PhenoMan starts inferring phenotypes.
First, the relevant information is extracted from received resources and inserted into the ontology. On the one hand, the individual properties (single phenotypes) are inserted as instances of the direct subclasses of cop: Single_Phenotype and the values are modelled as property assertions based on the has_value relation. On the other hand, a composite phenotype is defined as an instance of the class cop: Composite_Phenotype, which combines all the single phenotype instances using property assertions based on has_part relation. Then, our multistage reasoning algorithm is executed. The algorithm consists of the following steps:
-
1.
Classification step. A standard reasoner classifies the existing instances (assignment to classes).
-
a.
Single phenotype instances are classified in RSiP classes based on property restrictions.
-
b.
The composite phenotype instance is classified in RCoP classes based on the specified Boolean expression and inferred RSiP, RCoP and RDeP classes.
-
c.
The composite phenotype instance is classified in NDeP classes based on the specified Boolean expression and corresponding NSiP and NCoP classes. In this case, all variable values required for calculating formulas are present.
-
d.
Available instances of NDeP classes (representing calculated values) are classified in RDeP classes based on property restrictions.
-
2.
If no new NDeP classes get an instance, the execution of the algorithm stops. All inferred phenotypes (inferred classes including metadata and calculated values) are returned.
-
3.
Calculation step. The formula of each NDeP class having an instance is calculated based on variable values (values of the corresponding single phenotype instances or scores of the inferred RCoP classes). A new instance of the NDeP class representing the calculated value is created and associated with the composite phenotype instance (using has_part).
The algorithm goes back to step 1.
In the case of complex phenotypes, the classification and calculation steps can be executed several times (in a loop). That is the case if a NDeP class has subclasses, i.e., RDeP classes, which are in turn used in combined phenotypes. Both steps are repeated until all formulas are calculated and all phenotypes are classified.
The PhenoMan supports 4 primitive data types xsd:decimal, xsd:string, xsd:boolean and xsd:date. All other complex data types (e.g., FHIR code or quantity) are mapped to the primitive data types (e.g., code to xsd:string and quantity to xsd:decimal with additional unit attribute, Fig. 2a, b). Furthermore, the PhenoMan provides, inter alia, aggregate functions, Boolean, date and measurement unit arithmetic, integration of external terminologies as well as reading and writing FHIR resources.
Transmission of inferred phenotype classes to the FHIR Server
PhenoMan returns all inferred combined and derived phenotype classes as Observation resources and transmits them to the FHIR Server. The generated resources can have a numeric or a code data type. Numeric observations are used for storing numeric values (i.e., calculated values of derived phenotypes or score values of combined phenotypes), whereas the code observations are intended for concepts of a terminology or a value set. The specified codes of the non-restricted phenotype classes are utilised in the resulting resources to complete the ‘code’ attribute. The calculated values, scores or codes of the inferred restricted phenotype classes are used in ‘valueQuantity’ or ‘valueCodeableConcept’ attributes (Fig. 7).
Export capabilities
The PhenoMan can export the complete PheSO or generate reasoner reports (structured descriptions of all inferred phenotypes) in Excel format. A tabular reasoner report contains three columns: ‘Type’, ‘Non-restricted’ and ‘Restricted’. In the first column the type (single, combined or derived) of the inferred phenotype is presented. In the next two columns, the specified or derived information about the resulting phenotypes (restricted and non-restricted) is displayed (Fig. 8).
Additionally, the PhenoMan can generate the graphical representation of combined phenotype classes in the form of decision tree (or flowchart) diagrams. The PhenoMan translates the Boolean expressions of every RCoP class into disjunctive normal form and considers each conjunction as a possible path of the tree. Then, the paths are grouped on shared nodes and form a tree structure (Fig. 9).
Example
We illustrate our approach by means of an example algorithm for determining Type 2 Diabetes Mellitus (T2DM) cases presented by PheKB.org [37]. The T2DM algorithm requires the following data elements to be extracted from the EHR:
-
Counts of T1DM and T2DM diagnoses (identified by ICD-9 billing codes),
-
The earliest dates of T1DM and T2DM medications (identified by RxNorm codes),
-
Laboratory values (fasting blood glucose, random blood glucose and hemoglobin A1c, identified by LOINC codes) as well as
-
Physician-entered diagnoses (derived from encounter or problem list sources only).
Modelling of single data elements using ART-DECOR
As a first step, required data elements (items, concepts, variables, single phenotype classes) representing single patient characteristics relevant for determining the T2DM (T1DM and T2DM diagnoses, T1DM and T2DM medications, fasting blood glucose, random blood glucose, hemoglobin A1c as well as physician-entered diagnoses) must be modelled using ART-DECOR. Labels, descriptions, codes from external terminologies, data types, units, etc. are specified. Additionally, every data element must be associated (as property) with a FHIR resource type in order to ensure the correct search and extraction of the instance values from the respective resource. Laboratory values are represented in FHIR as Observations, while diagnoses are usually specified using the Condition resource. The Fig. 2a shows the filled form of fasting glucose in ART-DECOR.
The resulting specification is provided by ART_DECOR as a XML or JSON file.
Modelling of phenotypes using PhenoMan Editor
The user imports the ART-DECOR specification in the ontology (PheSO) utilizing the PhenoMan Editor. For each data element, a NSiP class (Fasting_Glucose, HBA1c, Random_Glucose, T1DM_Diagnosis, T2DM_Diagnosis, T1DM_Medication, T2DM_Medication and T2DM_Diagnosis_by_Physician) including relevant annotations is generated (Fig. 4a, a1, a2).
Furthermore, aggregate functions (e.g., COUNT, FIRST, LAST, MIN, MAX) can be defined for NSiP classes. For instance, the T2DM algorithm does not require to process the complete data of all diagnosis resources, it is sufficient to count the resources. Therefore, the function COUNT is defined for the class T2DM_Diagnosis (Fig. 4a2). The both medication classes (T1DM_Medication and T2DM_Medication) are associated with the function FIRST, because only the earliest date of medications is relevant for the algorithm.
Next, the RSiP classes (Fasting_Glucose_ABNORMAL, Random_Glucose_ABNORMAL and HBA1c_ABNORMAL) for value range restrictions are defined as subclasses of the NSiP classes (Fig. 4a, a3, Fig. 5). For every RSiP class, the anonymous equivalent class is created that represents the corresponding restriction. The restrictions of other NSiP classes (e.g., T2DM_Diagnosis or T2DM_Diagnosis_by_Physician) are based on the counts of the corresponding resources (e.g., T2DM_Diagnosis_NO: if the count of T2DM_Diagnosis = 0 or T2DM_Diagnosis_by_Physician_YES: if the count of T2DM_Diagnosis_by_Physician > = 2, Fig. 4a4).
Mathematical calculations are modelled using NDeP classes. The formula ‘GT($T1DM_Medication, $T2DM_Medication)’ (‘GT’ stands for Greater Than) defined for the class T2DM_precedes_T1DM_Medication expresses a comparison of the T1DM and T2DM medication dates (Fig. 4b, b1). The dollar sign in the variable name indicates that not the medication value itself but the entry date of the medication is used in the formula. The formula returns − 1, 0 or 1 depending on whether the first operand is less than, equal to or greater than the second operand. The value − 1 is also returned if one of the operands is missing. The RDeP class T2DM_precedes_T1DM_Medication_YES and the corresponding restriction are specified similarly to RSiP classes (Fig. 4b, b2).
The next step is to model the abnormal lab, i.e., if either random glucose or fasting glucose or HBA1c is abnormal. For this purpose, we define the non-restricted combined phenotype (NCoP) class Abnormal_Lab and the RCoP class Abnormal_Lab_YES with the corresponding Boolean restriction (disjunction) formalised as general class axiom (Fig. 4c, c1).
Finally, the T2DM case selection rules are modelled using the NCoP class T2DM_Case as well as the five RCoP classes (one for each case type) including Boolean restrictions (Fig. 4c, c2, Fig. 6). For instance, case 1 occurs when the T1DM diagnosis is missing but a T2DM diagnosis as well as both medications (T1DM and T2DM) are present and the first T2DM medication precedes the first T1DM medication. The abnormal lab, the presence of a T2DM diagnosis and the absence of T1DM diagnosis and both medications indicate the case 3.
The resulting ontology and additional material (graphical and tabular representation, reasoner reports) are publicly available [38].
Execution of the PhenoMan Service
The PheSO (OWL file) created by the PhenoMan Editor is saved in the directory specified in the configuration file. After starting the PhenoMan Service, subscriptions for each NSiP class of the T2DM PheSO are created. The subscription for the NSiP class Fasting_Glucose, for example, is intended to identify Observation resources with the LOINC code ‘1558–6’ (criteria) (Fig. 2c). After receiving a fasting glucose resource, other required resources are queried. The query for T2DM diagnosis, for instance, includes the additional FHIR Search parameter _summary = count to express the aggregate function COUNT:
Condition?code=http://hl7.org/fhir/sid/icd-9-cm|250.00, http://hl7.org/fhir/sid/icd-9-cm|250.02&subject=Patient/103&_summary=count.
In this case, the server returns a bundle with only the number of resources matching the query. To realise the function FIRST, the combination of _sort (sorting by date) and _count (_count = 1) is used. The following T2DM medication query returns only the first resource matching the criteria:
MedicationStatement?code=http://www.nlm.nih.gov/research/umls/rxnorm|25789,http://www.nlm.nih.gov/research/umls/rxnorm|10633&subject=Patient/103&_sort=effective&_count=1
(Some codes were omitted in both queries).
As soon as all required FHIR resources are present, the PhenoMan starts the phenotype computing. Suppose, the input resource set consists of only a fasting glucose (Fig. 2d) and a T2DM diagnosis resource. In this case, the single phenotype instances of the classes Fasting_Glucose and T2DM_Diagnosis are created and the values are modelled as property assertions based on the has_value relation (e.g., ‘has_value 130’ for Fasting_Glucose and ‘has_value 1’ for T2DM_Diagnosis (due to the COUNT function)). Then, a composite phenotype instance is defined, which combines both single phenotype instances using property assertions based on has_part relation. In the first step (classification step), a standard reasoner classifies the single phenotype instances in restricted classes. In our example, the instance of Fasting_Glucose is classified in the class Fasting_Glucose_ABNORMAL (i.e., the fasting glucose value is > = 125 mg/dL, Fig. 4a, a3) and the instance of T2DM_Diagnosis in the class T2DM_Diagnosis_YES (because the count of the T2DM diagnoses is > 0, Fig. 4a, a4). Next, the composite phenotype instance is classified in the RCoP classes Abnormal_Lab_YES (because the fasting glucose is abnormal, Fig. 4c, c1) and T2DM_Case_3 (because all conditions of the corresponding Boolean expression are fulfilled, Fig. 4c, c2). In this case, further phenotypes can not be derived or calculated.
Let us consider another example. Suppose, the input data set contains a T2DM diagnosis, a T1DM and a T2DM medication (T2DM precedes T1DM medication). The classification step is similar to the first example. The corresponding single phenotype instances are classified in the classes T1DM_Diagnosis_NO, T2DM_Diagnosis_YES, T1DM_Medication_YES and T2DM_Medication_YES. In the next step (calculation step), the formula of the NDeP class T2DM_precedes_T1DM_Medication (Fig. 4b, b1) can be calculated by PhenoMan. It inserts the variable values (the dates of the both medications) in the formula and starts the calculation. After the calculation step, the classification step must be performed again. The calculated instance of T2DM_precedes_T1DM_Medication is classified in the class T2DM_precedes_T1DM_Medication_YES (because the formula returns 1, Fig. 4b, b2). Then, the composite phenotype instance is classified in the RCoP class T2DM_Case_1 (Fig. 4c, c2) and the PhenoMan finishes the calculation.
Finally, the PhenoMan generates the Observation resource for the resulting T2DM case and transmits it to the FHIR Server. In our example, we encode the possible T2DM cases using a code system (https://www.smith.care/phenoman/t2dm_case_selection_algorithm), one code identifying the observation (t2dm_case_calculated) and five codes for possible values (t2dm_case_1, t2dm_case_2, t2dm_case_3, t2dm_case_4 and t2dm_case_5). The resulting Observation resource for case 3 is illustrated in Fig. 7.
Additionally, we can generate a tabular reasoner report or a decision tree diagram.
An example reasoner report for case 1 is shown in Fig. 8.
The class T2DM_Case and its subclasses (T2DM_Case_1, T2DM_Case_2, etc.) are very suitable for the representation as a decision tree. The decision tree generated by PhenoMan (Fig. 9) looks similar to the flowchart specified by PheKB.org [37].
Evaluation results
The evaluation has demonstrated that all components of our solution function correctly. All JUnit tests were successful and showed no difference between specified and calculated results. The comparison between the PhenoMan and the SPSS calculation has also succeeded. Although the performance of our approach is not a critical issue in our use case, we measured the execution time of the PhenoMan. The calculation of the socio-economic status (complex algorithm requiring multiple reasoner runs), for example, takes approximately 0.5 s per dataset. This performance is completely sufficient for our use case.
In summary, PhenoMan correctly computes phenotypes based on valid phenotype specifications (PheSO) and input data. Additionally, validating phenotype specifications (PheSOs) before deploying them in a productive environment is an extremely useful feature of the PhenoMan.