Numerous existing ontologies and standard initiatives can contribute to the creation of a toxicology ontology supporting the needs of predictive toxicology and risk assessment. We briefly review a number of relevant related projects here.
Computational tools for predictive toxicology include a range of well-known machine learning and bioinformatics algorithms, as well as specific cheminformatics procedures, such as for descriptor calculation and chemical structure processing. The Blue Obelisk descriptor ontology  was the first attempt to provide a formal description of some cheminformatics algorithms. It was adopted in OpenTox, and was further extended, in order to incorporate algorithms not available in the original version. The Chemical Information Ontology is another ontology, which was published , and is considered the successor of the Blue Obelisk descriptor ontology. However, it is not yet used in OpenTox, as it only became available recently. Similarly, the lack of ontologies, covering machine learning and data mining domains at the beginning of the project, led to the independent development of the OpenTox ontology , representing the core components of the OpenTox framework, as datasets, features, tasks, algorithms, models and validation. We were not aware at that time of DAMON , developed in the context of Grid services and available in DAML+OIL instead of OWL that makes this ontology harder to reuse. Despite having been built in the context of predictive toxicology, the OpenTox ontology shares several similarities with published data mining ontologies - the ontology of data mining (OntoDM) ontology [9, 10], KDDOnto , KDO ontology , DMWF Ontology , and the e-LICO Data Mining Ontology (DMO), developed in the framework of another EU FP7 project . OntoDM is based on the unification of the field of data mining and the growing demand for formalized representation of outcomes of research. It includes definitions of basic data mining entities, such as datatype, dataset, data mining task, data mining algorithm and components thereof (e.g., distance function), etc. OntoDM also allows the definition of more complex entities, e.g., constraints in constraint-based data mining, sets of such constraints (inductive queries) and data mining scenarios. The e-LICO team launched the Data Mining Ontology Foundry , which is populated with e-LICO suite of ontologies for data mining (DMO), model selection and meta-mining (Data Mining Optimization – DMOP) . DMO also includes similar basic data mining entities, and provides means to automatically compose workflows by identifying algorithms with compatible input and output. Finally, collecting details of machine learning experiments in “experiment databases” for subsequent analysis [17, 18], is comparable to the OpenTox framework design, which provides distributed storage for all details of predictive toxicology workflows.
Despite the surge of simultaneous activities in developing data mining ontologies, their adoption by the major data mining platforms and tools is still a future goal.
Unifying toxicology data description presents additional challenges.
As one of the central repositories of large-scale biomedical ontologies, the OBO Foundry  is an important source of ontologies for reuse. Several OBO ontologies could potentially be used as part of the development of a Toxicology Ontology.
The Gene Ontology (GO)  project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO project has developed three structured controlled vocabularies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.
Chemical Entities of Biological Interest (ChEBI)  is a freely available dictionary of molecular entities focused on “biologically interesting” chemical entities and their activities in a biological context. The molecular entities in question are either natural or synthetic products used to intervene in the processes of living organisms.
The Ontology of Biomedical Investigations (OBI)  provides terminology relevant to experimental biological and clinical investigations. This includes a set of 'universal' terms, applicable across various biological and technological domains, and domain-specific terms. This ontology supports the consistent annotation of biomedical investigations, regardless of the particular field of study. The OBI addresses the need for a cross-disciplinary approach and represents all phases of experimental processes, and the entities involved in preparing for, executing, and interpreting those processes e.g., study designs, protocols, instrumentation, biological material, collected data and analyses performed on that data.
Other existing ontologies of relevance to the toxicology domain include anatomy ontologies such as the Foundational Model of Anatomy (FMA)  and the Mouse adult gross anatomy ontology .The FMA Ontology is a knowledge source for biomedical informatics; it is concerned with the representation of classes or types and relationships necessary for the symbolic representation of the phenotypic structure of the human anatomy. Its ontological framework can be applied and extended to all other species. The Mouse adult gross anatomy ontology represents the Anatomical Dictionary for the Adult Mouse. This ontology organizes anatomical structures for the postnatal mouse spatially and functionally, using 'is a' and 'part of' relationships. A browser can be used to view anatomical terms and their relationships in a hierarchical display.
One more toxicology-relevant ontology is the NCI Thesaurus , which contains terminology relevant for clinical care, translational and basic research, and public information and administrative activities, with respect to cancer and related diseases and targeted therapies. The NCI Thesaurus provides definitions, synonyms, and other information on nearly 10,000 cancers and related diseases, 8,000 single agents and combination therapies, and a wide range of other topics related to cancer and biomedical research.
The National Center for Biomedical Ontology (NCBO) hosts the BioPortal that is another important ontology repository. The BioPortal provides access to ontologies of interest to the biological and biomedical community including the large-scale terminology standards such as MeSH  and SNOMED . MeSH (Medical Subject Headings) is the controlled vocabulary thesaurus used for indexing articles for PubMed SNOMED CT (Systematized Nomenclature of Medicine -- Clinical Terms), which is a systematically organised computer-processable collection of medical terminology covering most areas of clinical information such as diseases, findings, procedures, microorganisms, and substances. It allows a consistent way to index, store, retrieve, and aggregate clinical data across specialties and sites of care. It also helps organising the content of medical records, reducing the variability in the way data is captured, encoded and used for clinical care of patients and research.
The eTOX project  aims to develop a drug safety database from pharmaceutical industry legacy toxicology reports and public toxicology data. Ontology development within the eTOX Project includes the activity to create ontologies for preclinical safety.
There are two important toxicological standards that are publically available: the OECD harmonized templates (OECD HTs)  and the ToxML (Toxicology XML standard) schema (initiated by Leadscope Inc.) .
The OECD HTs correspond to the IUCLID5 XML  schemas, which are meant to be used by industry when submitting documentation on their chemicals to EU regulatory authorities. For each endpoint, the OECD HTs define a series of fields e.g., defining the information submission requirements of a carcinogenicity experiment. Since they are generic enough to be able to include data on endpoints with different characteristics, in principle the OECD HTs provide a substantial basis for building ontology. However, they are not very formalized; they leave much space for free text entering, and have a strong administration emphasis rather than a scientific focus.
ToxML is an XML data exchange standard based on toxicity controlled vocabulary. The most recent ToxML release has a comprehensive, well-structured scheme for many toxicity studies (carcinogenicity, in vitro mutagenicity, in vivo micronucleus mutagenicity, repeated dose toxicity) which fit well the OpenTox purposes. For this reason we decided to explore the possibility of semi-automatic conversion of the ToxML schema to OWL-DL within OT, with the purpose of benefitting from the reasoning mechanism of OWL. The resulting ontology will be applied to reference and annotate the contents of databases coming from various sources and toxicity studies.
Even though all the ontologies described above exist, there is no systematic ontology for toxicological effects and predictive toxicology needs. The aim of the OpenTox ontology is to standardize and organize chemical and toxicological databases and to improve the interoperability between toxicology resources processing this data.
Even if several ontologies covering the anatomy domain exist, there is a serious gap for localized histopathology, and more generally, ontologies of micro anatomy. For this reason the Organs and Effects ontology has been developed within OpenTox. This ontology is closely linked to the INHAND initiative (International Harmonization of Nomenclature and Diagnostic Criteria for Lesions in Rats and Mice). INHAND aims to develop for the first time an internationally accepted standardized vocabulary for neoplastic and non-neoplastic lesions as well as the definition of their diagnostic features. The description of the respiratory system  is already implemented in the OT ontology; additionally, the terms and diagnostic features of the hepatobiliary system were published .
OpenTox and ontology need
A predictive toxicology framework essentially needs to provide modelling and predictive capabilities, and data access to chemical structures and toxicity data. From an ontology development point of view, data mining ontologies are relevant for the former, and ontologies, handling representation of chemical entities and biological data, are required for the latter. The data mining ontologies are usually developed from an abstract point of view; since the relevant algorithms and data structures are independent of the specific domain they could be applied to. While this approach certainly has its advantages, the ultimate result of only using data mining concepts to represent a predictive toxicology model is that the biological context is stripped off.
A predictive toxicology model, reporting whether a chemical compound is carcinogenic or not, would be represented as a classification one, trained by a given classification algorithm and predicting a binary outcome. The training dataset would be represented most often in a matrix form, with the provenance information related to how the carcinogenicity measurements had been taken either discarded, or in the best case, described in a human readable form in the accompanying documentation only. This is sufficient to build the model and assess its performance, but is less useful for the end users, who are experts in toxicology, but not in modelling algorithms.
On the other hand, toxicology studies are represented in much more detail in specialized databases, but data exchange formats rarely make use of structured formats or ontological representation. Moreover, they are not directly useful for processing by data mining software tools.
We defined the OpenTox ontology to represent datasets and properties of chemicals by unified means, suitable for the modelling algorithms.
To summarize, the ontology development in the OpenTox framework is not an end goal by itself, but an inherent part of retaining the biological context in machine learning datasets and keeping track of the data provenance, as it is passed through various processing methods.
The OpenTox ontology aims to cover from a semantic point of view the toxicological endpoints and experimental databases included in the OT final database. The data sources have been selected within publicly available data sources, providing high-quality structural and/or toxicological data. There are currently no standard datasets in this area and for this reason the purpose of the OT ontology was to integrate all these heterogeneous databases together. One of the important datasets considered for the construction of the various ontologies was the DSSTox CPDBAS (Carcinogenic Potency Database) . Another example of such a data source is the ISSCAN database  developed by the OT partner Istituto Superiore di Sanità (ISS). This database originates from the experience of researchers in the field of structure-activity relationships (SAR), aimed at developing models which theoretically predict the carcinogenicity of chemicals.
These two public and widely known datasets mentioned above show the typical scenario of the current state of representing toxicity data. Both datasets are available as SDF files, with fields described in human readable documents only. The outcome of the carcinogenicity study is represented in the "ActivityOutcome" field in CPDBAS (with allowed values "active", "unspecified", "inactive"), while in ISSCAN, a numeric field named "Canc" is used with allowed value of 1, 2, or 3. The description of the numbers (3 = carcinogen; 2 = equivocal; 1 = non-carcinogen) is only available in a separate "Guidance for Use" pdf file. Ideally, toxicity prediction software should offer comparison between the data and models, derived from both datasets, which is impossible without involving human efforts to read the guides and establish the semantic correspondence between the relevant data entries if and when possible.