Recent advances in imaging techniques make the study of complex biological systems feasible, particularly at the cellular level, complementing existing “omics” approaches, most notably genomics and proteomics, by resolving and quantifying spatio-temporal processes with single cell resolution [1]. High content screening (HCS) is an imaging based multi-parametric approach that allows the study of living cells. HCS is used in biological research and drug profiling, to identify substances, such as small molecules or RNA interference (RNAi) reagents, which can alter the phenotype of a cell. It can also be used to look at the effect of knocking out genes completely, or to determine protein localization by modifying genes to produce tagged proteins that can be visualized. Phenotypes may include morphological changes of a whole cell, or any of its cellular components, as well as alteration of cellular processes.
Correlative analysis of cellular phenotypes, specific to individual genes, with morphological imaging data from diseased tissue specimens (both human and mouse tissues) allow us to link phenotypic data to associated image annotations and metadata, leading to a powerful predictor of disease biomarkers as well as drug targets. For example, when a certain cellular phenotype, like ‘mitotic delay’ or ‘multi-nucleated cells’, observed in cells after gene knockdown experiments, is also observed in cells of a cancer tissue, this could give us an indication of which gene(s) might be involved in the aetiology of the disease, in that specific tissue. Knowledge of the functional implications of somatic tumor mutations can thus be used to design more targeted drug therapies.
Data derived from live cell imaging is typically associated with rich metadata, including genetic information, and can be more easily interpreted and linked to underlying molecular mechanisms. As we move to higher organisms, such as mouse and human, the degree of metadata available decreases (e.g. no genetic information is available for diseased human tissues), alongside the feasibility of assays that can be carried out in such organisms (e.g. genetic engineering is only possible in cell lines and mouse models). Taking this into consideration, it becomes evident that integrating imaging datasets from different biological domains could greatly advance our understanding of the molecular mechanisms underlying specific diseases.
Due to its late arrival on the “omics” scene, the imaging field has not yet achieved the same degree of standardization that other high-throughput approaches have already reached [1], thus hampering integration of image data with current biological knowledge. Standards are needed for describing, formatting, archiving and exchanging image data and associated metadata, including suitable nomenclatures and a minimal set of information for describing an imaging experiment. This is crucial to enable the establishment of databases and public repositories for image data and allow for the integration of independent datasets.
The use of ontologies to annotate data in the life sciences is now well established and provides a means for the semantic integration of independent datasets. Despite the availability of several species-specific ontologies for describing cellular phenotypes (e.g. the Fission Yeast Phenotype Ontology), there isn’t an appropriate infrastructure in place to support the large-scale annotation and integration of phenotypes across species and different biological domains.
As part of the BioMedBridges project,Footnote 1 efforts are underway to integrate biological imaging datasets provided by emerging biomedical sciences research infrastructures, including Euro-BioImaging,Footnote 2 for the provision of cellular image data; Infrafrontier,Footnote 3 for mouse tissue image data, and BBMRI/EATRIS,Footnote 4 for human tissue image data. Such infrastructures are generating a wealth of imaging data that can only be made interoperable through consistent annotation with appropriate ontologies.
There has been much work published on the development of cross-species phenotype ontologies and their benefits [2]. To date ontologies describing phenotypes exist for a host of species including mammalian phenotypes (MP; [3]), Ascomycetes (APO; [4]), S. pombe (FYPO; [5]) and C. elegans (WPO; [6]). There are also well established ontology design patterns for modeling phenotypes in a species and domain independent manner that utilise the Phenotype and Trait Ontology (PATO) [7]. These phenotypic descriptions are based around the Entity-Quality model (EQ) that refers to describing a phenotype in terms of an Entity (E), from one of many given reference ontologies, such as Gene Ontology (GO, [8]) and an associated Quality (Q), from PATO [9]. For example, a “large nucleus” phenotype could be expressed in EQ using the entity term “nucleus” [GO:0005634] from GO’s cellular component and the quality term “increased size” [PATO:0000586] from PATO. This model has been adopted by a range of model organism databases for the annotation of various phenotypes ranging from disease, anatomical and cellular phenotypes [10].
Ontology languages, such as the Web Ontology Language (OWL), allow us to express logical definitions for classes that describe class membership based on quantified relationships to other classes. The Basic Formal Ontology (BFO) defines the “inheres in” [BFO:0000023] relationship that can be used to capture the relationship between qualities, which in BFO are specifically dependent continuants, and the bearer of those qualities, which are typically independent continuants. For example, in order to logically define a “large nucleus phenotype” we say that the quality of “increased size” inheres in the bearer, which in this case would be the “nucleus”. We can express this relationship logically in OWL using existential quantification to assert that the class of all “large nucleus phenotype” is equivalent to the class of things that have an “increased size” quality that “inheres in” a “nucleus”. We could go on to further describe another class of phenotypes, such as a more general “nuclear size phenotype” and by virtue of the fact that “increased size” is a subclass of a more general “size” quality, use an OWL reasoner to automatically classify “large nucleus phenotype” as a subclass of “nucleus size phenotype”. Highly scalable reasoners, such as ELK [11], have made it practical for ontology engineers to fully exploit this expressivity when working with large ontologies. In the case of building phenotype ontologies, it means we can now build logical class definitions for a large number of phenotypes following the EQ pattern, and let the reasoner do the work to classify those phenotypes and infer equivalence across different phenotype ontologies.
A previous effort to develop a species neutral cellular phenotype ontology (CPO) was undertaken by Hoehndorf et al. [12]. The CPO was automatically generated and includes logical class definitions composed from GO and PATO terms. Whilst in principle this is a reasonable approach, in practice the resulting ontology was difficult to work with and did not provide a good vocabulary for data annotation. The size of the ontology coupled with limitations in standard ontology authoring software made it impractical to extend and maintain this ontology whilst keeping in sync with GO and PATO via the automatic generation process. The size and automatic label creation strategy also made it difficult for the biocurators to find terms for annotating data. It would have been a considerable amount of effort to manually clean the CPO to make it fit for purpose as a general annotation vocabulary for imaging datasets.
Our approach was therefore to build CMPO from the available data, using a post-composition approach where phenotypes were manually annotated with ontology terms that were later used to compose new stable phenotype terms in the ontology. These new terms were annotated with appropriate meta-data, such as synonyms and definitions that reflect how the terms are used in the data and literature.