Drug target ontology to classify and integrate drug discovery data

Background One of the most successful approaches to develop new small molecule therapeutics has been to start from a validated druggable protein target. However, only a small subset of potentially druggable targets has attracted significant research and development resources. The Illuminating the Druggable Genome (IDG) project develops resources to catalyze the development of likely targetable, yet currently understudied prospective drug targets. A central component of the IDG program is a comprehensive knowledge resource of the druggable genome. Results As part of that effort, we have developed a framework to integrate, navigate, and analyze drug discovery data based on formalized and standardized classifications and annotations of druggable protein targets, the Drug Target Ontology (DTO). DTO was constructed by extensive curation and consolidation of various resources. DTO classifies the four major drug target protein families, GPCRs, kinases, ion channels and nuclear receptors, based on phylogenecity, function, target development level, disease association, tissue expression, chemical ligand and substrate characteristics, and target-family specific characteristics. The formal ontology was built using a new software tool to auto-generate most axioms from a database while supporting manual knowledge acquisition. A modular, hierarchical implementation facilitate ontology development and maintenance and makes use of various external ontologies, thus integrating the DTO into the ecosystem of biomedical ontologies. As a formal OWL-DL ontology, DTO contains asserted and inferred axioms. Modeling data from the Library of Integrated Network-based Cellular Signatures (LINCS) program illustrates the potential of DTO for contextual data integration and nuanced definition of important drug target characteristics. DTO has been implemented in the IDG user interface Portal, Pharos and the TIN-X explorer of protein target disease relationships. Conclusions DTO was built based on the need for a formal semantic model for druggable targets including various related information such as protein, gene, protein domain, protein structure, binding site, small molecule drug, mechanism of action, protein tissue localization, disease association, and many other types of information. DTO will further facilitate the otherwise challenging integration and formal linking to biological assays, phenotypes, disease models, drug poly-pharmacology, binding kinetics and many other processes, functions and qualities that are at the core of drug discovery. The first version of DTO is publically available via the website http://drugtargetontology.org/, Github (http://github.com/DrugTargetOntology/DTO), and the NCBO Bioportal (http://bioportal.bioontology.org/ontologies/DTO). The long-term goal of DTO is to provide such an integrative framework and to populate the ontology with this information as a community resource. Electronic supplementary material The online version of this article (10.1186/s13326-017-0161-x) contains supplementary material, which is available to authorized users.


Background
The development and approval of novel small molecule therapeutics (drugs) is highly complex and exceedingly resource intensive, being estimated at over one billion dollars for a new FDA approved drug. The primary reason for attrition in clinical trials is the lack of efficacy, which has been associated with poor or biased target selection [1]. Although the drug target mechanism of action is not required for FDA approval, a target-based mechanistic understanding of diseases and drug action is highly desirable and a preferred approach of drug development in the pharmaceutical industry. Following the advent of the Human Genome, several research groups in academia as well as industry have focused on "the druggable genome" i.e. the subsets of genes in the human genome that express proteins that have the ability to bind drug-like small molecules [2]. The researchers have estimated the number of druggable targets ranging from few hundreds to several thousands [3]. Furthermore, it has been suggested by several analyses that only a small fraction of likely relevant druggable targets are extensively studied, leaving a potentially huge treasure trove of promising, yet understudied ("dark") drug targets to be explored by pharmaceutical companies and academic drug discovery researchers. Not only is there ambiguity about the number of the druggable targets, but there is also a need of systematic characterization and annotation of the druggable genome. A few research groups have made efforts to address these issues and have indeed developed several useful resources, e.g. IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb/IUPHAR) [4], PANTHER [5], Therapeutic Target Database (TTD) [6], Potential Drug Target Database (PDTD) [7], covering important aspects of the drug targets. However, to the best of our knowledge, a publically available structured knowledge resource of drug target classifications and relevant annotations for the most important protein families, one that facilitates querying, data integration, re-use, and analysis does not currently exist. Content in the above-mentioned databases is scattered and in some cases inconsistent and duplicated, complicating data integration and analysis.
The Illuminating the Druggable Genome (IDG) project (http://targetcentral.ws/) has the goal to identify and prioritize new prospective drug targets among likely targetable, yet currently poorly or not at all annotated proteins; and by doing so to catalyze the development of novel drugs with new mechanisms of action. Data compiled and analyzed by the IDG Knowledge Management Center (IDG-KMC) shows that the globally marketed drugs stem from only 3% of the human proteome. These results also suggest that the substantial knowledge deficit for understudied drug targets may be due to an uneven distribution of information and resources [8].
In the context of the IDG program we have been developing the Drug Target Ontology (DTO). Formal ontologies have been quite useful to facilitate harmonization, integration, and analysis of diverse data in the biomedical and other domains. DTO integrates and harmonizes knowledge of the most important druggable protein families: kinases, GPCRs, ion channels and nuclear hormone receptors. DTO content was curated from several resources and the literature, and includes detailed hierarchical classifications of proteins and genes, tissue localization, disease association, drug target development level, protein domain information, ligands, substrates, and other types of relevant information. DTO content sources were chosen by domain experts based on relevance, coverage and completeness of the information available through them. Most resources had been peer reviewed (references are included in the respective sections), published and were therefore considered reliable. DTO is aimed towards the drug discovery and clinical communities and was built to align with other ontologies including BioAssay Ontology (BAO) [9][10][11] and GPCR Ontology [12]. By providing a semantic framework of diverse information related to druggable proteins, DTO facilitates the otherwise challenging integration and formal linking of heterogeneous and diverse data important for drug discovery. DTO is particularly relevant for big data, systems-level models of diseases and drug action as well as precision medicine. The long-term goal of DTO is to provide such an integrative framework and to populate the ontology with this information as a community resource. Here we describe the development, content, architecture, modeling and use of the DTO. DTO has already been implemented in end-user software tools to facilitate the browsing [11] and navigation of drug target data [13].

Drug target data curation and classification
DTO places special emphasis on the four protein families that are central to the NIH IDG initiative: non-olfactory GPCRs (oGPCRs), Kinases, Ion Channels and Nuclear Receptors. The classifications and annotations of these four protein families were extracted, aggregated, harmonized, and manually curated from various resources as described below, and further enriched using the recent research literature. Proteins and their classification and annotations were aligned with the Target Central Resource Databases (TCRD) database [11] developed by the IDG project (http://targetcentral.ws/ProteinFam). In particular, the Target Development Level (TDL) classification was obtained from the TCRD database.

Kinase classification
Kinases have been classified primarily into protein and non-protein kinases. Protein kinases have been further classified into several groups, families, subfamilies. Nonprotein kinases have been classified in several groups, based on the type of substrates (lipid, carbohydrate, nucleoside, other small molecule, etc.). Classification information has been extracted and curated from various resources e.g. UniProt, ChEMBL, PhosphoSitePlus® (PSP) [14], Sugen Kinase website (http://www.kinase.com/web/current/), and the literature, and was organized manually, consolidated and checked for consistency. Kinase substrates were manually curated from UniProt and the literature. Pseudokinases, which lack key functional residues and are (to current knowledge) not catalytically active, were annotated based on the Sugen kinase domain sequences and the literature.

Ion-channel classification
Ion channels have been classified primarily into family, subfamily, sub-subfamily. Most of the information has been taken from the Transporter Classification Database (http://www.tcdb.org/) [15], UniProt and several linked databases therein. The classification is based on both the phylogenetic and functional information. Additional information regarding the gating mechanism (voltage gated, ligand gated, etc.), transported ions, protein structural and topological information has also been captured and included as separate annotations. Moreover, the transported ions, such as chloride, sodium, etc. have been mapped to the "Chemical entity" of the ChEBI reference database [16].

GPCR classification
GPCRs have been classified based on phylogenetic, functional and the endogenous ligand information. The primary classification included class, group, family, and subfamily. Most of the information has been taken from the GPCR.org classification and had been updated using various sources e.g. IUPHAR [4], ChEMBL, UniProt and also from our earlier GPCR ontology [12]. Furthermore, the information for the specific endogenous ligands for each protein has been extracted from IUPHAR and has been integrated with the classification. The information about the GPCR ligand and ligand type (lipid, peptide, etc.) has also been included and has been mapped manually to the "Chemical entity" of the ChEBI reference database.

Nuclear receptor classification
This information has been adopted directly from IUPHAR.

External DTO modules and mapping
Proteins mapped to UniProt. Genes were classified identical to proteins (above) and mapped to Entrez gene. The external modules incorporated into DTO were extracted from the Disease Ontology (DOID) [17], BRENDA Tissue Ontology (BTO) [18], UBERON [19], the ontology of Chemical Entities of Biological Interest (ChEBI) [20], and Protein Ontology (PRO) [21]. Data about over 1000 cell lines from the LINCS project [22] were integrated and mapped to diseases and tissues. Gene/protein-disease [23] and protein-tissue associations [24] were obtained from the JensenLab at Novo Nordisk Foundation Center for Protein Research. Mapping between UBERON and BRENDA to integrate the tissue associations of cell lines and proteins was retrieved from the NCBO BioPortal [25,26] and manually cross-checked. Target Development Level (TDL) were obtained from TCRD and included as separate annotation for all protein families.

Drug target ontology (DTO) development Ontology modeling
While curators stored all classification and annotation data into various spreadsheets, ontologists created the ontological model to link the metadata obtained from those spreadsheets, and to create the descriptive logic axioms to define ontology classes using a semi-automated workflow. Finalizing and optimizing the ontology model or design pattern required iterative processes of intensive discussions, modeling refinement, voting, and approval among domain experts, data curators, IT developers, and ontologists. Once ontologists proposed a conceptual ontology model, the selection of the most robust ontology model was guided by simple criteria: correct representation of domain content, minimize the number of relations to link all metadata, avoid contradiction with existing domain knowledge representation ontologies, such as the OBO ontologies. For example, in our conceptual model, the relations among organ, tissue, cell lines and anatomical entity were adopted and refined from the UBERON and CLO ontologies. Some relations such as the shortcut relations between protein and associated disease or tissue were created specifically for DTO, which was a compromise for accommodating the large amount of data in DTO. Approval process of accepting a model proposal was driven by our domain experts with contributing data curators, IT developers, and ontologists. The voting process was rather informal; however, the model had to be agreed by all the parties involved in the ontology development: domain experts, data curators, IT developers, and ontologists. Once the most fit ontology model was chosen, this piece of modeling was used as template for a java tool (described below) to generate all the OWL files by using above mentioned data annotation spreadsheets as input.

Modularization approach
DTO was built with an extended modular architecture based on the modular architecture designed and implemented for BAO [9]. The modularization strategy developed previously was a layered architecture and used the modeling primitives, vocabularies, modules and axioms. Most significantly, DTO's modular architecture includes an additional layer to the modularization process by automating the creation of basic subsumption hierarchies and select axioms such as the axioms for disease and tissue associations. Three types of files are used in the modular architecture: vocabulary files, module files, and combined files, such as DTO_core and DTO_complete. Vocabularies only contain concepts (classes with subsumption only). Module layers enable combining vocabularies in flexible ways to create desired ontology structures or subsets. Finally, in the combined files axioms are added to the vocabularies to formally define the various concepts to allow logical inferences. Classes and relationships are imported (directly or indirectly) from module and/or vocabulary files [9]. The external third-party ontologies were extracted using the OWL API or OntoFox [27].

OntoJOG tool
To streamline the building process, a Java tool (OntoJOG) was developed to automatically create the OWL module files, vocabulary files as components of the whole ontology. OntoJOG takes a flat CSV or TSV data file and loads it as a table either into a temporary SQLite database or a permanent MySQL database. This table is then used as a reference for creating and generating the OWL files as well as several relationship tables. The relationship tables and the final OWL files are generated based on a CSV mapping file that generates the commands for the OntoJOG to perform and the various options for those commands. The commands from the mapping file are read in two passes to ensure everything is added correctly. In the first pass, all classes and their annotations are inserted into the relationship tables and are assigned IDs as necessary, and in the second pass all axioms and relationships between classes are created. After this process is completed an optional reparenting phase is executed before each module of the ontology is generated into its own OWL vocabulary files with an accompanying module file containing the relationships for the given vocabulary files.
Finally, the ontology was thoroughly reviewed, tested and validated by developers, domain experts, and users in the IDG-KMC.

Data quality control
Several steps of Quality Control (QC) at different stages in the development process of the ontology were implemented. First, data extracted from external resources is checked for consistency against that original source by the lead data curator. Depending on how the data was extracted (APIs, download of files) this involves different scripts, but in all cases thorough manual expert review. Secondly, while developers load curated data into a local staging database, another QC step is taking place to assure data integrity during the loading process. Thirdly, as soon as the auto-ontology building using OntoJOG finishes, reasoning over the whole ontology checks for consistency of the logical definitions and the ontology itself. In a fourth QC step, the ontologist runs several SPARQL queries against the ontology to retrieve the data and arrange them in a format that can directly be compared to the original datasets; any discrepancies are flagged and resolved between the lead curator, developer and ontologist. Fifth, for each new ontology build, an automated script reads all DTO vocabulary and module files and compares them to the previous version. This script generates reports with all new (not present in the previous version), deleted (not present in current version) and changed classes and properties based on their URIs and labels. These reports are reviewed by curators and ontologists and any expected differences among versions are resolved. Sixth and finally, the ontology is loaded into Protégé and carefully manually reviewed by curators and ontologists. In order to audit the QC process, all the development versions are stored at a private GitHub repository owned by our lab. Only when data is in 100% consistency with original datasets and all QC steps are completed and passed, the ontology is released to the designated public GitHub repository.

DTO visualization
Data visualization is important, especially with the increasing complexity of the data. Ontology visualization, correspondingly, has an appealing potential to help to browse and comprehend the structures of ontologies. A number of ontology visualization tools have been developed and applied as information retrieval aids, such as OntoGraf, OWLViz as part of the Ontology development tool Protégé, and OntoSphere3D [28] among others. Further, studies and reviews on different visualization tools, e.g. [29,30] and [31], have been published by comparing each tool's performances. Preference of visualization models depends on the type and query context of the visualized network and also on users' needs.
Data-Drive Document (D3) is a relatively novel representation-transparent and dynamic approach to visualize data on the web. It is a modern interactive visualization tool available as a JavaScript library [29]. By selectively binding input data to arbitrary document elements, D3.js enables direct inspection and manipulation of a native representation. The D3.js JavaScript library gained popularity as a generic framework based on widely accepted web standards such as SVG, JavaScript, HTML5 and CSS.
Consequently, we use the D3.js library for the interactive visualization of our DTO as part of the Neo4J graphical database solution.

DTO and BAO integration to model LINCS data
The Library of Network-Based Cellular Signatures (LINCS) program has been generating a reference "library" of molecular signatures, such as changes in gene expression and other cellular phenotypes that occur when cells are exposed to a variety of perturbing agents. One of the LINCS screening assays is a biochemical kinase profiling assay that measures drug binding using a panel of~440 recombinant purified kinases, namely, KINOMEscan assay. The HMS LINCS Center has collected 165 KINOMEscan datasets in order to analyze the drug-target interaction. All these LINCS KINOMEscan data were originally retrieved from Harvard Medical School (HMS) LINCS DB (http:// lincs.hms.harvard.edu/db/). KINOMEscan data was curated by domain experts to map to both Pfam domains, and corresponding Kinases. Unique KINOMEscan domains and annotations, including domain descriptions, IDs, names, gene symbols, phosphorylation status, and mutations were curated from different sources, including the HMS LINCS DB, DiscoverX KINOMEscan® assay list [32], Pfam (http://pfam.xfam.org/), and our previous modeling efforts of the entire human Kinome (publication in preparation). The kinase domain classification into group, family, etc. was the same as described above (kinase classification). Gatekeeper and hinge residues were assigned based on structural alignment of existing kinase domain crystal structures and structural models of the human kinome and sequence alignment with the full kinase protein referenced by UniProt accession in the DTO. Pfam accession number and names were obtained from Pfam [33]. The protocol and the KINOMEscan curated target metadata table were analyzed by ontologists to create kinase domain drug target ontology model.

Ontology source access and license
The official DTO website is publicly available at http:// drugtargetontology.org/, where it can be visualized and

Results
In what follows, the italic font represents terms, classes, relations, or axioms used in the ontology.

Drug targets definition and classification
Different communities have been using the term "drug target" ambiguously with no formal generally accepted definition. The DTO project develops a formal semantic model for drug targets including various related information such as protein, gene, protein domain, protein structure, binding site, small molecule drug, mechanism of action, protein tissue localization, disease associations, and many other types of information.
The IDG project defined 'drug target' as "a native (gene product) protein or protein complex that physically interacts with a therapeutic drug (with some binding affinity) and where this physical interaction is (at least partially) the cause of a (detectable) clinical effect". DTO defined a DTO specific term "drug target role". The text definition of "drug target role" is "a role played by a material entity, such as native (gene product) protein, protein complex, microorganism, DNA, etc., that physically interacts with a therapeutic or prophylactic drug (with some binding affinity) and where this physical interaction is (at least partially) the cause of a (detectable) clinical effect." At the current phase, DTO focuses on protein targets. DTO provides various asserted and inferred hierarchies to classify drug targets. Below we describe the most relevant ones.

Target development level (TDL)
The IDG classified proteins into four levels with respect to the depth of investigation from a clinical, biological and chemical standpoint (http://targetcentral.ws/) [8]: 1) T clin are proteins targeted by approved drugs as they exert their mode of action [3]. The Tclin proteins are designated drug targets under the context of IDG. 2) T chem are proteins that can specifically be manipulated with small molecules better than bioactivity cutoff values (30 nM for kinases, 100 nM for GPCRs and NRs, 10 uM for ICs, and 1 uM for other target classes), which lack approved small molecule or biologic drugs. In some cases, targets have been manually migrated to Tchem through human curation, based on small molecule activities from sources other than ChEMBL or DrugCentral [34]. 3) T bio are proteins that do not satisfy the T clin or T chem criteria, which are annotated with a Gene Ontology Molecular Function or Biological Process with an Experimental Evidence code, or targets with confirmed OMIM phenotype(s), or do not satisfy the Tdark criteria detailed in 4). 4) T dark refers to proteins that have been described at the sequence level and have very few associated studies. They do not have any known drug or small molecule activities that satisfy the activity thresholds detailed in 2), lack OMIM and GO terms that would match Tbio criteria, and meet at least two of the following conditions: A PubMed text-mining score < 5 [ Fig. 1. It should be noted that, as indicated above, the classification information has been extracted from various database and literature resources. The classification is subject to continuous updating for greater accuracy, and enriching the DTO using the most recent information as it becomes available. The present classification of the four protein families is briefly discussed below: Most of the 578 kinases covered in the current version of DTO are protein kinases. These 514 PKs are categorized into 10 groups that are further sub-categorized in 131 families and 82 subfamilies. A representative classification hierarchy for MAPK1 is: Kinase > Protein Kinase > CMGC group > MAPK family > ERK subfamily > Mitogen-activated Protein Kinase 1.
The 62 non-protein kinases are categorized in 5 groups depending on the substrate that is phosphorylated by these proteins. These 5 groups are further subcategorized in 25 families and 7 subfamilies. There are two kinases that haven't been categorized yet into any of the above types or groups.
The 334 Ion channel proteins (out of 342 covered in the current version of DTO) are categorized into 46 families, 111 subfamilies, and 107 sub-subfamilies.
Similarly, the 827 GPCRs covered in the current version of DTO are categorized into 6 classes, 61 families and 14 subfamilies. The additional information whether any receptor has a known endogenous ligand or is currently "orphan" is mapped with the individual proteins. Finally, the 48 nuclear hormone receptors are categorized into 19 NR families.

Disease-and tissue-based classification
Target-disease associations and tissue expressions were obtained from the DISEASES [23]and TISSUES [24] databases (see Methods). Examples of such classifications are available as inferences in DTO (see below section 3.3.2).

Additional annotations and classifications
In addition to the phylogenetic classification of the proteins, there are several relevant properties associated with them as additional annotations. For example, there are 46 PKs that have been annotated as pseudokinases [36]. For ion channels, important properties, like transporter protein type, transported ion(s), gating mechanism, etc. have been associated with the individual proteins. The gating mechanism refers to the information regarding the factors that control the opening and closing of the ion channels. The important mechanisms include voltage-gated, ligand-gated, temperature-gated, mechanically-gated, etc. Similarly, for the GPCRs, the additional information whether any receptor has a known endogenous ligand or is currently "orphan" is mapped with the individual proteins. Current version of DTO has approximately 255 receptors that have information available regarding the endogenous ligands.
The analysis of drug target protein classification along with such relevant information associated through separate annotations may lead to interesting inferences.

Chemical classifications
Known GPCR ligands and IC transported ions were categorized by chemical properties and mapped to ChEBI (see Methods). For example, depending upon their chemical structure and properties, these known endogenous ligands for GPCRs have been categorized in seven types, namely, amine, amino acid, carboxylic acid, lipid, peptide, nucleoside and nucleotide. Similarly, the ions transported by the ion channel proteins and ion types (anion/cation) have been mapped to ChEBI. These annotations together with mappings of substrates and ligands to the proteins enable inferred classification of the proteins based on their chemical properties (see below).

DTO ontology implementation and modeling
Drug discovery target knowledge model of the DTO The first version of the DTO includes detailed target classification and annotations for the four IDG protein families. Each protein is related to four types of entities: gene, related disease, related tissue or organ, and target development level. The conceptual model of DTO is illustrated as a linked diagram with nodes and edges. Nodes represent the classes in the DTO, and edges represent the ontological relations between classes. As shown in Fig. 2, GPCRs, kinases, ICs and NRs are types of proteins. GPCR binds GPCR ligands, and IC transports ions. Most GPCR ligands and ion are types of chemical entity from ChEBI. Each protein has a target development level (TDL), i.e., T clin , T chem , T bio and T dark . The protein is linked to gene by 'has gene template' relation. The gene is associated with disease based on evidence from the DISEASES database. The protein is also associated with some organ, tissue, or cell line using some evidence from TISSUES database. The full DTO contains many more annotations and classifications available at http://drugtargetontology.org/ .
DTO is implemented in OWL2-DL to enable further classification by inference reasoning and SPARQL queries. The current version of DTO contains >13,000 classes and >220,000 axioms. The DTO contains 827 GPCRs, 572 kinase, 342 ion channels (ICs), and 48 NRs.

Modular implementation of the DTO combining autogenerated and expert axioms
In DTO, each of the four drug target families has two vocabulary files of gene and protein, respectively; other DTO-native categories were created as separate vocabulary files. Additional vocabulary files include quality, role, properties, and cell line classes and subclasses. A vocabulary file contains entities of a class, which only contains "is-a" hierarchies. For example, the GPCR gene vocabulary contains only GPCR gene list and its curated classification. DTO core imports all the DTO vocabulary files of four families, including genes and proteins, and necessary axioms were added. Finally, DTO core was imported into the DTO complete file, which includes other vocabulary files and external files. External ontologies used in DTO include: BTO, CHEBI, DOID, UBERON, Cell Line Ontology (CLO), Protein Ontology (PRO), Relations Ontology (RO) and Basic Formal Ontology (BFO). The DTO core and DTO external are imported into the DTO module with auto-generated axioms, which links entities from different vocabulary files. Besides the programmatically generated vocabularies and modules, DTO also contains manually generated vocabularies and modules, as shown in Fig. 3. This modularization approach significantly simplifies the maintenance of the ontology contents, especially when the ontology is large in size. If the gene or protein list changes, only the vocabulary file and the specific module file need to be updated instead of the whole ontology. In addition, external and internal resources are maintained separately. This design facilitates automated content updates from external resources including axioms generated using the above-mentioned Java tool OntoJOG without the need to re-generate manually axiomized domain knowledge, which can be very resource intensive, by simply separating them into two layers.

DTO to infer biologically and chemically relevant target classes Chemically relevant target classes inferred by DTO
In addition to detailed asserted target classifications, DTO incorporates various other annotations including GPCR endogenous ligands for GPCRs, transported ions for ICs, gating mechanism for ICs, or pseudokinases. Endogenous GPCR ligands were manually mapped to ChEBI and classified by chemical category such as amine, lipid, peptide, etc. As ligands relate to receptor properties, GPCRs are typically classified based on their ligands; however, the ligand-based classification is orthogonal to the classification based on class A, B, C, adhesion, etc. and it changes as new ligands are deorphanized.
An example for 5-hydroxytryptamine receptor is shown in Fig. 4; the receptor is inferred as aminergic receptor based on its endogenous ligand.

Disease relevant target classes inferred by DTO
In a similar way, we categorized important disease targets by inference based on the protein -disease association, which were modeled as 'strong' , 'at least some' , or 'at least weak' evidence using subsumption. For example, DTO uses the following hierarchical relations to declare the relation between a protein and the associated disease extracted from the DISEASES database.
has associated disease with at least weak evidence from DISEASES has associated disease with at least some evidence from DISEASES has associated disease with strong evidence from DISEASES In the DISEASES database, the associated disease and protein are measured by a Z-Score [23]. In DTO, the "at least weak evidence" is translated as a Z-Score between zero and 2.4; the "some evidence" is translated as a Z-Score between 2.5 and 3.5; and the "strong evidence" is translated as a Z-Score between 3.6 and 5.
This allows querying or inferring proteins for a disease of interest by evidence. Diseases related targets were defined using following axioms (as illustrative as examples): Putative infectious disease targets ≡ Protein and ('has associated disease with strong evidence from DISEASES' some 'disease of metabolism'); Putative infectious disease targets ≡ Protein and ('has associated disease with strong evidence from DISEASES' some 'disease by infectious agent'); Putative mental health disease targets ≡ Protein and ('has associated disease with strong evidence from DISEASES' some 'developmental disorder of mental health') We created such inference examples in DTO, including 29 metabolic disease targets, 36 mental health disease targets, and 1 infectious disease target.

Modeling and integration of Kinase data from the LINCS project
The Library of Network-Based Cellular Signatures (LINCS, http://lincsproject.org/) program has a systems biology focus. This project has been generating a reference "library" of molecular signatures, such as changes in gene expression and other cellular phenotypes that occur when cells are exposed to a variety of perturbing agents. The project also builds computational tools for data integration, access, and analysis. Dimensions of LINCS signatures include the biological model system (cell type), the perturbation (e.g. small molecules) and the assays that generate diverse phenotypic profiles. LINCS aims to create a full data matrix by coordinating cell types and perturbations as well as informatics and analytics tools. We have processed various LINCS datasets, which are available at the LINCS Data Portal (http://lincsportal.ccs.miami.edu/) [37]. LINCS data standards [22] are the foundation of LINCS data integration and analysis. We have previously illustrated how integrated LINCS data can be used to characterize drug action [38]; among those, KINOME-wide drug profiling datasets.
We have annotated the KINOMEscan domains data generated from HMS LINCS KINOMEscan dataset. The annotation includes domains descriptions, names, gene symbols, phosphorylation status, and mutations. To integrate this information into DTO, we built a kinase domain module following the modularization approach described in section 2.2.
We started with an example scenario given by domain expert shown below: In this scenario, there are four major ontological considerations or relations that need to be considered when building an ontology module (Fig. 5).

Kinase domain and kinase protein
DTO uses the "has part" relation to link the kinase protein and kinase domain, which reflects the biological reality that the kinase domain is a part of the full protein.

Kinase domain variations: Mutated kinase domain and phosphorylated kinase domain
A mutated kinase domain relates to its wild type kinase domain by simply using "is mutated form of" relation. Both, phosphorylated and nonphosphorylated forms of a kinase domain are children of a kinase domain from which they were modified to their current phosphorylation forms. Since the KINOMEscan assay does not provide the specific phosphorylation position information, the definition of a phosphorylated form of a kinase domain, either mutated or wild-type, is generally constituted using an ad-hoc axiom: has part some "phosphorylated residue". Note that "phosphorylated residue" (MOD_00696) is an external class imported from Protein Modification Ontology (MOD).
Pfam domain mapping to kinase domain and its variations DTO data curators / domain experts have mapped all kinase domains (including their variations) to Pfam families using sequence level data. This information was captured by using "map to pfam domain" relation, which links a kinase domain to a pfam domain. Figure 5 shows how in DTO the above scenario is modeled by connecting ABL1 Kinase domain with ABL1 protein using relation is part of, as well as how kinase domain relates to Pfam domain using map to pfam domain relation. In this scenario, all the variations of ABL1 kinase domain are mapped to the same Pfam domain.

Kinase gatekeeper and mutated amino acid residues
The kinase gatekeeper position is an important recognition and selectivity element for small molecule binding. One of the mechanisms by which cancers evade kinase drug therapy is by mutation of key amino acids in the kinase domain. Often the gatekeeper is mutated. Located in the ATP binding pocket of protein kinases, the gatekeeper residue has been shown to influence selectivity and sensitivity to a wide range of small molecule inhibitors. Kinases that possess a small side chain at this position (Thr, Ala, or Gly) are readily targeted by structurally diverse classes of inhibitors, whereas kinases that possess a larger residue at this position are broadly resistant [39].
DTO includes a "gatekeeper role" to define residues annotated as gatekeeper. In the case of ABL1 kinase domain, the THR74 within the ABL1 kinase domain is identified as Fig. 5 Relations between protein, kinase domain, mutated kinase domain, phosphorylated kinase domain, and pfam domains in the DTO a gatekeeper by the data curator / domain expert. This gatekeeper residue is further mapped to the 315th residue located in the whole ABL1 kinase amino acid sequence. DTO defines a term: THR315 in ABL1 kinase domain with an axiom of "has role some gatekeeper role". With an equivalence definition of term "gatekeeper residue" as anything that satisfied the condition of "has role some gatekeeper role", DTO can group all the gatekeeper residues in this KINOMEscan dataset (Fig. 6).

DTO shines light on Tdark proteins
With integrated information about drug targets available in DTO, it is possible, for example to query information for Tdark kinases for which data in LINCS is available. Kinases in the LINCS KINOMEscan assay were annotated by their (kinase) domain, phosphorylation status, gatekeeper residue and mutations as explained above. To illustrate this integration, we conducted a simple SPARQL query to identify Tdark (kinase) proteins that have a gatekeeper annotation in DTO.
The SPARQL query we use to search DTO are as following: ?TDL rdfs:label?tdl_label. } We found in total 378 (kinase) proteins containing gatekeeper residue annotations. Of those 378 proteins, one (Serine/threonine-protein kinase NEK10) is a Tdark protein, two (Mitogen-activated protein kinase 4 and Serine/ threonine-protein kinase WNK1) are Tbio proteins, 320 are  Table S1). We then could look for the associated disease and tissue expression information in DTO. For example, the Serine/threonine-protein kinase NEK10 (Tdark), which contains the gatekeeper residue Thr301, is associated with breast cancer by "weak evidence", and expressed in liver, testis, trachea with "strong evidence". This way, DTO provides rich information to prioritize proteins for further study, linked directly to KINOMEscan results via the LINCS Data Portal.

Integration of DTO in software applications DTO visualization
The drug target ontology consists of >13,000 classes and >122,000 links. Our visualization has two options: a) a static pure ontology viewer starting with the toplevel concepts featured by a collapsible tree layout (mainly for browsing concepts) and b) a dynamic search and view page where a search-by-class user interface is combined with a collapsible force layout for a deeper exploration. Figure 7 shows an excerpt of an interactive visualization of the DTO. Users can search for classes, alter the visualization by showing siblings, zoom in/out, and alter the figure by moving classes within the graph for better visualization.

Pharos: The IDG web portal
Pharos is the front-end Web Portal of the IDG project (http://pharos.nih.gov). Pharos was designed and built to encourage "serendipitous browsing" of a wide range of protein drug target information curated and aggregated from a multitude of resources [11]. Via a variety of user interface elements to search, browse and visualize drug target information, Pharos can help researchers to identify and prioritize drug targets based on a variety of criteria. The DTO is an integral part of Pharos; its user interface has been designed to integrate DTO at multiple levels of detail. At the highest level, the user can get a bird's-eye view of the target landscape in terms of the development level through the interactive DTO circle packing visualization (http://pharos.nih.gov/dto); see Fig. 8. For any suitable set of targets (e.g., as a result of searching and/or filtering), Pharos also provides an interactive sunbrust visualization of the DTO as a convenient way to help the user navigate the target hierarchy. At the most specific level, each appropriate target record is annotated with the full DTO path in form of a breadcrumb. This not only gives the user context but also allows the user to easily navigate up and down the target hierarchy with minimal effort.

Tin-X: Target importance and novelty explorer
TIN-X is a specialized, user-friendly Web-based tool to explore the relationship between proteins and diseases (http://newdrugtargets.org/) extracted from the scientific literature [13]. TIN-X supports searching and browsing across proteins and disease based on ontological classifications. DTO is used to organize proteins and content can be explored using the DTO hierarchy.

Discussion
The IDG program is a systematic effort to prioritize understudied, yet likely druggable protein targets for the development of chemical probes and drug discovery entry points [3]. DTO covers proteins as prospective druggable targets. Druggability can be considered from a structural point of view, i.e. proteins to which small molecules can bind. This structural druggability is implicit in the selection of the IDG target families, GPCRs, kinases, ion channels and nuclear receptors for which there exist a large number of small molecule binders. Another aspect of druggability is the ability to induce a therapeutic benefit by modulating the biological function of the protein that the drug binds to. Establishing and prioritizing this functional druggability is one of the main goals of the IDG project. DTO includes knowledge of protein disease association and the target development level for all proteins as a foundation to formally describe drug mechanisms of actions. DTO provides a framework and formal classification based on function and phylogenetics, rich annotations of (protein) drug targets along with other chemical, biological, and clinical classifications and relations to diseases and tissue expression. This may facilitate the rational and systematic development of novel small molecule drugs by integrating mechanism of action (drug targets) with disease models, mechanisms, and phenotypes. DTO is already used in the Target Central Resource Database (TCRD -http://juniper.health.unm.edu/tcrd), the IDG main portal Pharos (http://pharos.nih.gov/) and the Target Importance and Novelty eXplorer (TIN-X -http://newdrugtargets.org/) to prioritize drug targets by novelty and importance. The search and visualization uses the inferred DTO model, including the inferred classes described in this report.
We have illustrated how DTO and other ontologies are used to annotate, categorize and integrate knowledge about kinases, including nuanced target information of profiling data generated in the LINCS project. By doing so, DTO facilitates contextual data integration, for example considering the kinase domain or the full protein, phosphorylation status or even information important for small molecule binding, such as gatekeeper residues and point mutations. As we develop DTO and other resources, we will facilitate the otherwise challenging integration and formal linking of biochemical and cell-based assays, phenotypes, disease models, omics data, drug targets and drug poly-pharmacology, binding sites, kinetics and many other processes, functions and qualities that are at the core of drug discovery. In the era of big data, systems-level models for diseases and drug action, and personalized medicine, it is a critical requirement to harmonize and integrate these various sources of information.
The development of DTO also provided an example of building a large dataset ontology that can easily be extended and integrated with other resources. This is facilitated by our modularization approach. The modular architecture allows the developers create terms in a more systematic way by creating manageable and contained components. For example, DTO vocabularies are created as separate files by the OntoJOG java tool. Vocabulary files contain only classes and subsumption relations; the files are subsequently Fig. 8 Visualization of the drug target ontology: using the circle packing layout available in the D3 visualization framework combined (imported) into the DTO core module. A similar, separate module is created of classes from external ontologies; thus, cleanly separating responsibilities of ontology maintenance while providing a seamless integrated product for the users. OntoJOG auto-generated axioms import these vocabulary modules. The manual (expert-created) more complex axioms are layered on top. This way, when an existing data resources is updated, one only needs to update the corresponding auto-created file, e.g. the kinase vocabulary, or target-disease associations from the DISEASES database. Updating of the auto-generated modules (including axioms) does not overwrite expert-created, more complex axioms, which formalize knowledge that cannot easily be maintained in a relational database. Separating domain-specific vocabularies also improves maintenance by multiple specialized curators and may improve future crowd-based development and maintenance. The modular design also makes it simpler to use DTO content in related projects such as LINCS or BAO. Last but not least, the modular architecture facilitates different "flavors" of DTO by incorporating upper-level ontologies, such as BFO or SUMO, via specific mapping (axiom) files; different DTO flavors can be useful for different user groups, e.g. a native version for typical end users of software products (such as Pharos or TinX) or a BFO version for ontologists who develop more expansive, integrated and consistent knowledge models.
Several drug target-related resources have been developed, such as the ChEMBL Drug Target Slim [40], where GO annotations are available for drug targets in ChEMBL. Protein Ontology recently enhanced the protein annotation with pathway information and phosphorylation sites information [41]. Comprehensive FDA-approved drug and target information is available in DrugCentral, http://drug central.org/ [34]. The Open Targets Partnership between pharmaceutical companies and the EBI (http://www.open targets.org/) is a complementary project with similarities to IDG. It developed the Open Target Validation Platform (//www.targetvalidation.org/) [42]. Both, IDG and Open Target make use of ontologies for data standardization and integration. Although there is significant overlap in the content integrated by both projects, there is currently little coordination with respect to data standards including ontologies and data representation. For example, Open Target uses the Experimental Factor Ontology (EFO) [43] to annotate diseases whereas IDG and the DTO uses DOID, primarily because of its use in DISEASES. Ongoing ontology mapping efforts will remedy these challenges. As DTO evolves, we aim to include additional content sources and ontologies to support integrative drug discovery and target validation efforts via a semantic drug target framework.

Conclusions
DTO was built based on the need for a formal semantic model for druggable targets including various related information such as protein, gene, protein domain, protein structure, binding site, small molecule drug, mechanism of action, protein tissue localization, disease association, and many other types of information. DTO will further facilitate the challenging integration and formal linking to biological assays, phenotypes, disease models, drug poly-pharmacology, binding kinetics and many other processes, functions and qualities that are at the core of drug discovery. The first version of DTO is publically available via the website http://drugtargetontology.org/, Github (http://github.com/DrugTargetOntology/DTO), and the NCBO Bioportal (http//bioportal.bioontology.org/ontologies/DTO). The long-term goal of DTO is to provide such an integrative framework and to populate the ontology with this information as a community resource.

Additional file
Additional file 1: Table S1.