The mouse pathology ontology, MPATH; structure and applications

Background The capture and use of disease-related anatomic pathology data for both model organism phenotyping and human clinical practice requires a relatively simple nomenclature and coding system that can be integrated into data collection platforms (such as computerized medical record-keeping systems) to enable the pathologist to rapidly screen and accurately record observations. The MPATH ontology was originally constructed in 2,000 by a committee of pathologists for the annotation of rodent histopathology images, but is now widely used for coding and analysis of disease and phenotype data for rodents, humans and zebrafish. Construction and content MPATH is divided into two main branches describing pathological processes and structures based on traditional histopathological principles. It does not aim to include definitive diagnoses, which would generally be regarded as disease concepts. It contains 888 core pathology terms in an almost exclusively is_a hierarchy nine layers deep. Currently, 86% of the terms have textual definitions and contain relationships as well as logical axioms to other ontologies such the Gene Ontology. Application and utility MPATH was originally devised for the annotation of histopathological images from mice but is now being used much more widely in the recording of diagnostic and phenotypic data from both mice and humans, and in the construction of logical definitions for phenotype and disease ontologies. We discuss the use of MPATH to generate cross-products with qualifiers derived from a subset of the Phenotype and Trait Ontology (PATO) and its application to large-scale high-throughput phenotyping studies. MPATH provides a largely species-agnostic ontology for the descriptions of anatomic pathology, which can be applied to most amniotes and is now finding extensive use in species other than mice. It enables investigators to interrogate large datasets at a variety of depths, use semantic analysis to identify the relations between diseases in different species and integrate pathology data with other data types, such as pharmacogenomics.


Background
Since the late eighteenth century when achromatic lenses and reliable histological stains began to be available, investigators of anatomic pathology, and particularly in the mid -nineteenth century the innovators of cellular pathology such as Rudolf Virchow, developed and applied terminologies to describe their observations [1,2]. These depended on the "school" to which the pathologists belonged, but more importantly on the etiologic or mechanistic paradigm in which they were working [3]. One of the great achievements of the nineteenth century was the recognition of the universality of pathological processes and entities and their occurrence in multiple species as recognisable manifestations of the same underlying processes [4]. It was, nevertheless, a century before broadly accepted and rationally structured pathology terminologies were developed (e.g. [5]). The development of pathology terminologies has to an extent occurred independently of disease terminologies and nosologies, partly as a result of the much longer history of classifying diseases, and partly due to the inherited preconceptions of the nature of disease in clinical medicine.
The distinction between pathological and clinical descriptions of disease, disorders and predispositions is still not satisfactorily resolved. However, in recent years there have been attempts to rationalise the definitions of these concepts [6] and their relation to each other as part of a broadly applicable model of disease other than an unstructured collection of manifestations or phenotypes which are found in that class of individuals forming the basis of a diagnosis. Issues about severity, time course, organ involvement etc. are beginning to be addressed, but it is remarkable that even treating diseases as a "bag" of phenotypes has been shown to provide a powerful approach in establishing the relationships between diseases, and the presence of related diseases in different organisms [7][8][9][10]. What has recently been identified as important, nevertheless, is that the tissue-specific resolution of the recording of lesions, and the ability to record the pattern of disease within an individual, has proved vital for GWAS mapping of predisposing genetic variants in inbred strains of mice allowing each class of lesion to be analysed in isolation [11,12].
The discipline of pathology may be broken down into clinical and anatomic pathology, the former is concerned with clinical chemistry, hematology, clinical microbiology and emerging sub-specialities such as molecular diagnostics and proteomics. The latter, which forms the domain of MPATH, deals with the histological, histochemical or immunohistochemical observations of alterations in tissue composition or architecture. Both branches of the medical specialty, which are increasingly merging, may be viewed as aspects of phenotyping, and both provide subtypes of the clinical signs associated with ongoing disease processes, the results of developmental abnormalities, or the historical presence of disease.

Anatomic pathology nomenclature and its applications
The universality of the repertoire of responses to underlying genetic or extrinsic insults means that gross and histopathologically-defined phenotypes are some of the most useful phenotypes for relating diseases between different species, and constitute some of the most speciesagnostic phenotype descriptors. This makes a pathologic term-based ontology a crucial tool in experimental and clinical phenotype data capture [13].
The development of systematic human pathologic nomenclatures has been driven by the efforts of the American College of Pathologists, initially with the development of the pathology specific nomenclature (SNOP) over 40 years ago [14] to the current SNOMED -CT with cross references to UMLS, the NCI thesaurus and other terminologies. The ICD [15], now in its 11 th revision and the associated ICD-O v-3 for cancer, also contains descriptions of many pathological lesions.
The other driver for pathologic terminology standardisation has been coding of lesions from toxicopathology. The American Society of Toxicopathology (STP) working with Registry of Industrial Toxicology Animal-data (RITA) database group in Europe has produced several internationally accepted nomenclature systems, particularly focusing on proliferative lesions. Recently, the STP has undertaken a major harmonization exercise for rodent pathologythe INHAND (International Harmonization of Nomenclature and Diagnostic Criteria for Lesions in Rats and Mice) initiative [16]. So far this group has reported on the hepaticobiliary, respiratory, nervous and urinary systems [17][18][19][20]. For some time the National Cancer Institute's Mouse Models of Human Cancer consortium (MMHCC) has been examining the classification of tumours in genetically engineered mice. MMHCC has produced a consensus base terminology for neoplasias of the major organ systems that have been presented in a series of papers over the last decade [21].
Despite the huge value of these resources, none is currently constructed as an ontology with meaningful axioms to support inference and automated reasoning, and to that end we developed MPATH to describe lesions that arise in laboratory mice.

Construction and content
The MPATH ontology was constructed ab initio by a group of clinical and veterinary pathologists in 2,000 and has since been revised and augmented by an evolving group of US and European pathologists on a regular basis. It is clear from more than a decade of experience that expert input and manual curation are essential to generate an accurate and functional resource. One strategy for building the ontology has been to integrate it into large-scale phenotyping and diagnostic programs so that the pathologists use it on a daily basis and have fields to add missing terms or synonyms that they are more familiar with thereby constantly increasing its coverage and utilitarian value.
MPATH contains 888 classes in an almost exclusively is_a hierarchy eight layers deep. Part_of relations are used only in the case of, for example, adenomatous polyp (MPATH:490) and intraductal papilloma (MPATH:285) where pathologists distinguish between a macroscopic lesion ("-osis") and individual lesions which make them up; both need to be described. The top level distinction in MPATH is between pathophysiology (pathological processes) and anatomic pathology (pathological entities). One issue met frequently in development was the normal practice of referring to observation of a physical lesion by using the process term; for example "necrosis" or "sclerosis". Thus the noun describing the real-world entity observed is homonymous with the inferred process. This problem has been addressed through the textual and logical definitions of terms but is a recurrent source of confusion in formal treatments of pathology nomenclature. Pathologists using MPATH almost exclusively use the anatomic pathology segment of the ontology with the exception of describing inflammation or other general processes where the process is described using qualifiers such as "acute" which are logically process-specific, consistent with the use of Phenotype And Trait Ontology (PATO) [22] qualifiers (see below). The upper levels of MPATH's anatomic pathology branch include six broad domains familiar to all traditions of pathology training and comprehensively covering all known lesions; cell and tissue damage, circulatory disorder, developmental and structural abnormality, growth and differentiation defect, healing and repair structure, and neoplasm, and are as orthogonal as is feasible given the complexities of pathobiology (see Figure 1A). The upper levels of MPATH's pathophysiology branch denote pathological processes that underlie lesions and include six broad domains; cell and tissue damage process (see Figure 1C), defective growth and differentiation process, developmental process abnormalities, healing and repair process, immunopathological process, and neoplasia. All pathological processes and entities can be placed within these upper level domains, which will be familiar to all pathologists and are common to all amniotes.

Relationship to upper level and other ontologies
MPATH is largely congruent with the upper level Ontology for General Medical Science (OGMS) [6], founded on the Basic Formal Ontology (BFO). Pathological bodily process (OGMS:0000061) and pathological anatomical structure (OGMS:0000077) are broadly mappable to the upper levels of MPATH; MPATH:603 (pathological anatomical entity) and MPATH: 596 (pathological process) respectively. However, more detailed mapping is difficult. For example the MPATH experts view congenital malformations as pathological anatomical structures, whereas OGMS views them as distinct, and similarly MPATH views inflammation as a pathological process whereas OGMS does not include this as a pathological bodily process. Until such discrepancies are resolved, integration of MPATH into the OGMS framework will be problematical.
From the point of view of application, the most important mappings for MPATH are to the Human Phenotype Ontology (125), the Mammalian Phenotype ontology (111), the Disease Ontology (231), SNOMED-CT (867) and the NCIt (566), reflecting the emphasis on the domain of anatomic pathology rather than disease.

Definitions and axiomatic relationships
Currently, 86% of classes have textual definitions. Each class is in the mouse pathology namespace and is uniquely identified by a URI of the form: http://purl. obolibrary.org/OBO/MPATH_n. The main ontology is available in both the OBO Flatfile Format and the Web Ontology Language (OWL). MPATH is housed in a subversion repository and is made available via OBO registry, Bioportal (http://purl.bioontology.org/ontology/MPATH) and on the project's website http://mpath.googlecode.com/. MPATH contains relationships and other logical axioms to other ontologies such the Gene Ontology (GO) [23], Cell Type ontology (CL) [24] and the Phenotype And Trait Ontology (PATO) [22]. For example, the MPATH term transitional cell metaplasia (MPATH:172) represents a metaplastic response of the transitional epithelium, for example in the bladder to give squamous metaplasia and glandular metaplasia. To allow computational access to these relations, we use the derives-from relation and relate metaplasia (MPATH:549) (an MPATH term that denotes an abnormal transformation of a differentiated adult cell or tissue of one kind into a differentiated tissue of another kind) with the CL term transitional epithelial cell (CL:0000244).

Application and utility
A post-composition strategy for pathology coding Traditionally pathologists have relied on a narrative form of recording their definitive diagnoses, making use of morphologic, etiologic, and disease-based terms that collectively provide a diagnosis useful for clinical patient management. This is particularly important for nonneoplastic lesions where it can be complex to capture important subtleties of distribution, severity, microscopic sub-type and anatomical location for example. Whilst this is the gold standard, it is not possible to compute on data recorded in this way and it is very difficult to tabulate and quantitatively analyse the collected information. There are strong arguments, mainly from experience in toxicologic pathology, that a descriptive (anatomic) rather than diagnostic coding is the most objective and useful way to code pathology-based observations. This is particularly relevant to examination of mutant mice where traditional etiologic or summative diagnostic terms are simply not available because of the novelty of the lesion or its presentation. This is particularly the case where mice are manipulated to model human conditions that have not been previously seen, for example lung or mammary tumours [11,25,26] which have not previously been reported to occur spontaneously in mice. In many cases, a disease diagnosis implies a particular pathogenesis or etiology based on the spontaneous disease, which is not appropriate for the disease caused by genetic and sometimes both genetic and external challenge combined. This latter issue is of particular concern to practicing pathologists and in the development of MPATH we have been urged to include some diagnostic terms as well as descriptive anatomic ones.
Many tissue responses are common to multiple anatomical sites and as far as possible the verbosity (ontology "bloat") of specifying a particular response in multiple tissues has been avoided, with the additional topographical or anatomical information for description coming from an anatomy ontology, generally the MA [27] or EMAP ontologies [28] for the mouse, however, there is often an intrinsic anatomical element embedded in the term or traditional pathology includes information about the cell type or tissue of origin. This is most frequent with the neoplasias and we felt that such terms were best included in their familiar form. Figure 1B shows how anatomically predicated classes such as hepatocellular carcinoma have multiple parents, providing relations in this case to both carcinoma and hepatic tumor superclasses. Most observations made by pathologists using MPATH are, nevertheless, cross-products using a combination of an MPATH term and an anatomical (MA) or cell type (CL) [24,29] component. This strategy provides all of the necessary coverage.
In addition to the core terms in MPATH, it is important to describe organ-specific topography, distribution, microscopic character, duration/chronicity and severity. These are not included in MPATH but drawn from other ontologies. These qualifiers or modifiers are generally applicable across a wide range of organs and lesions and so need to be coded separately to the core terms to allow post-composition as required. The pattern we have adopted is very close to that recommended by the INHAND proposals and also includes "compound" terms which lie beneath a definitive diagnosis or disease level of description, but bundle defined sets of descriptive terms, for example "nephropathy", "alopecia", "glomerulonephritis" which are in common use and well understood. These qualifiers have been incorporated into PATO, and some examples are given in Table 1. The strategy for composing pathology descriptions using the combination of MPATH, MA and PATO is summarised in Figure 2.

Implementation of MPATH coding strategy
The strategy adopted was originally designed to describe histopathology images for the Pathbase mouse pathology database [30], but lends itself readily to a wide range of coding applications. The MPATH strategy has been adopted by two major high-throughput studies. A combination of MPATH and PATO is being used for the capture of pathology data from the genome-wide mutant mouse phenotyping project, KOMP 2 run as part of the International Mouse Phenotyping Consortium [31], where the MPATH approach is being used in the primary phenotyping pipeline by the Toronto Centre for Phenogenomics and other centres carrying out histopathology. MPATH has also been adopted for the MoDIS database [32] to capture and analyse pathology data from a massive aging study which has systematically phenotyped 31 of the most important inbred mouse strains. Complete necropsies of mice were carried out at 12 and 20 months of age (cross-sectional study) and  Nearly 2,000 mice were necropsied, generating more than 50,000 slides [33]. Lesion incidence and severity data for all organs is now being applied in highly successful GWAS studies of age-associated disease [33]. MPATH has proved to be additionally useful in dealing with the recoding of multi-species legacy data from non-standard nomenclatures, permitting integration of otherwise siloed data. Examples are the European Radiological archive (ERA) database where 6,700 human diagnoses were recoded from ICD-8 and the Klinischer Diagnosenschleussel [34] to MPATH/FMA [35], and with the Northwestern University Janus radiobiology database (http://janus.northwestern.edu/janus2/), who have coded 50,000 individual mouse records to MPATH to link the two datasets. Recently the ontology has been applied to zebrafish phenotype data in the Zfin database [36] indicating a useful application of MPATH to non-mammalian species which could be developed further.

MPATH as a core ontology for PATO-based logical definitions
The PATO framework was built with the intention of providing an integration platform for phenotype data between species and between data types [22]. According to the PATO framework, phenotype data can be described by utilising species-specific ontologies (such as the various anatomy ontologies) or species-agnostic ontologies such as GO with the various qualities provided by the PATO ontology in order to describe affected entities in a phenotype manifestation. PATO can be used for annotation either directly in a so-called post-composed (postcoordinated) manner or for providing logical definitions (equivalence axioms) to ontologies containing a set of precomposed (pre-coordinated) phenotype terms [22,[37][38][39]. For further discussion see [40].
Rather than using a pre-composed phenotype ontology such as MP [29] or HPO [41], phenotypes may be described using the Entity-Quality (EQ) formalism. In the EQ method, a phenotype is characterized by an affected Entity and a Quality (from PATO) that specifies how the entity is affected. The affected entity can either be a biological function or process such as specified in GO, or an anatomical entity. The phylogenetic conservation, at least within the amniotes, of most histopathologic lesions or processes makes MPATH an important core ontology in writing logical definitions and we have used it extensively in defining classes in the major pre-composed phenotype ontologies and MPATH is an important component ontology of our recently developed semantic approaches to comparative phenomics -PhenomeNET and Mousefinder [8,9].
Composition of logical definitions is a time-consuming task for which there are currently several approaches to automation using class label segmentation, entity recognition and lexical matching to core ontologies. This approach can be useful for suggesting definitions where the class label is a composite of for example, anatomy and process (MA + GO). Automated decomposition of unilexical terms such as are found in the neoplasias is much more difficult though approaches with text mining definitions from other ontologies such as NCIt for lexically matching labels may be useful to expert curators in establishing more simple definitions for these classes.

Conclusions
Whilst MPATH was originally designed to support rodent, and particularly mouse, pathology the extensive overlap with human pathology means that most of the terms may be used in a human context and linked to the foundational model of anatomy (FMA) [42] as the anatomy ontology. Extending MPATH to become a mammalian pathology ontology encompassing human pathology is a major undertaking, but we have established that the current structure and upper level classes would readily support the inclusion of human terminology. Initially we will import terms for neoplasias from the CINEAS codes (Central Information System for Hereditary Diseases and Synonyms; http://www.cineas.org/; Prof Rolf Sijmons, pers, comm). SNOMED-CT, UMLS and ICD-O v3 will be mined for terms not currently in MPATH which relate to anatomic pathology. Terms already covered by existing ontologies such as Disease Ontology (DO) [43] may be referenced using MIREOT [44]. DO classifies diseases largely by anatomical site and not by disease process or class, and overlaps only slightly with MPATH as it is concerned with summative diagnostic entities for the main part. For example there is no "inflammation" superclass in DO for the tissue specific inflammatory conditions described.
Use of MPATH to construct logical definitions for DO classes would potentially add a further dimension to the richness and applicability of DO.
The power of the description of pathological lesions to discriminate between diseases and therefore between models of human disease is substantial. We recently estimated the information content (IC) of pre-composed MP ontology terms used to code phenotypes in the EUMODIC mouse phenotyping pipeline [45], which included or excluded anatomic pathology descriptions, using their logical definitions. Pathology-related phenotypes were shown to have a significantly greater discriminatory power than other in vivo assays, strongly supporting the use of these assays in the development of mouse models of human diseases [13].
Further development and application of MPATH will inevitably depend on community engagement and we encourage anyone with an interest to provide feedback.