Description of biological processes, corpus building and entity tagging
In the absence of an exhaustive controlled vocabulary in systems biology, we use hereafter the notion of a “biological process”, which comprises the notions of (a) “biological reaction” and “biochemical reaction” as in KEGG (Kyoto Encyclopedia of Genes and Genomes [22]) Reactions database, (b) “biological phenomenon”, “biological pathway” and “biochemical pathway” as in PW or KEGG Pathway database, and finally (c) “biological process” as in GO-BP. Moreover, we use the notion of a “chemical entity” to denote any type of biological compound, including metabolites, proteins, protein complexes, polymers, to cite a few.
To develop a dedicated systemic representation for each biological process involved in the bacterial gene expression, we applied the standard state-of-art approach of system engineering. The approach involves two main tasks. (A) We first gathered up-to-date available biological information about the biological process. (B) We then converted the biological information into a systemic representation using boxes, arrows, inputs and outputs, and a mathematical model. We describe and apply below the approach (A) and (B) on a specific example (the formation of the 30S initiation complex) for illustrative purposes. Note that the approach is generic and can be applied on any biological process.
-
(A)
We collected up-to-date knowledge about the biological processes from scientific literature (books, peer-reviewed original articles, and reviews; see Additional file 1 for a list of references). We primarily focused on figures since they facilitate the conversion from biological knowledge to the systemic representation in the task (B) (as illustrated in Fig. 1). Elementary steps composing a biological process are usually found in research articles while books or review articles provide global descriptions of processes. In a few cases, we used figures from didactic web sites and we checked the biological information using original research papers systematically.
-
(B)
We then converted the selected figure into a systemic representation. Despite the heterogeneity of sources, several common features were identified from these schemas (as illustrated Fig. 1): title (t), arrows (a) and shapes (s) with legend or label (l).
Tagging entities of interest
Given a graphical representation of any biological process with sub-processes (Fig. 1):
-
The title (t) defines the name of the main biological process that embeds the succession of all identified individual processes.
-
The arrows are identified as sub-reactions that correspond to the individual processes. Three types of arrow are distinguished on Fig. 1: linear (dotted), bifid at origin or head (a) and divided in more than two parts (a*).
-
The shapes (s) are identified as the chemical entities (BioE) that participate in a biological process and are related to legends or labels (l). Depending on their relative position regarding arrows (origins or heads), three types of BioE are identified (i, f, c): an unframed BioE at arrow-origins (i) represents an initial reactant of a process (input), called iBioE; an unframed BioE at arrow-heads (f) represents a final product of a process (output), called fBioE; a framed BioE (c) represents a product of a process (output) which is the reactant of the next process (input), called cBioE (for consumed BioE).
Note that
-
Arrows that correspond to BioE recycling within a process are not considered (as illustrated by the dotted arrow in Fig. 1).
-
Any BioE may be an initial reactant and/or a product of several distinct processes.
Biological processes as interlocked systems
After identifying the entities necessary in the biological process, we organized them as a main system composed of different interlocked sub-systems of lower granularity, as follows.
An elementary process is formally defined by its participants, i.e. the input(s) and output(s). The standard systemic representation of an elementary process corresponds to a box framed by its input(s)/output(s) (see Fig. 2a). In this graphical representation, inputs are placed on the left of the box at the tail of the incoming arrows, while outputs are placed on the right of the box at the head of the outgoing arrows (Fig. 2a). In our biological context, an elementary process corresponds to a biological reaction and the inputs are the BioEs required for the production of the BioEs that served as outputs.
Multi-scale representation of processes
In a multi-scale representation, the same process is represented at different levels of granularity (Fig. 2b). On the top level of granularity, there is a unique aggregated process that leads to output(s) (B1 in a dark gray box). An aggregated process can be formally defined either by its input(s)/output(s), like an elementary process, or by the composition of successive sub-processes. On the bottom level of granularity, there is a succession of elementary processes that lead to the same output(s) as those produced by other levels (B3 in white boxes). Via decomposition and aggregation of processes, we can navigate between the different levels of granularity (represented by a gray scale on Fig. 2b).
Systemic model of the main process (level B1 on fig. 2b)
The fully aggregated process (at the lowest granularity level) is the main process having iBioEs as inputs and fBioEs as outputs. In the graphics, it is represented by a box and labeled according to the name of the global reaction. The box is framed by BioEs, one iBioE per input of the main process on the left, and one fBioE per output of the main process, on the right.
Systemic model of elementary processes (level B3 on fig. 2b)
An elementary process is a sub-reaction of an aggregated process (arrows in Fig. 1), having typically one or two inputs and one or two outputs. In Fig. 1, such a reaction usually concerns bifid arrows (case a). In the case of arrows divided into more than two parts (case a* on Fig. 1), and thus implicating at least three inputs or outputs, the process is further split into a sequence of elementary processes through the addition of new consumed BioEs (ncBioE), using additional literature information when available. Two successive elementary processes which share a common participant, i.e. an output of the first elementary process is an input of the second one (cBioE). Elementary processes follow each other until all outputs of the main process are produced (B1 level). Note that cBioEs and ncBioEs never appear as participants in the main process (the fully aggregated one).
Systemic model of intermediate processes (level B2 on fig. 2b)
Intermediate processes provide intermediary levels of granularity between the main process and the elementary processes. In the graphical representation, an intermediate process consists of a box that merge boxes of elementary processes. Intermediate processes define sub-processes of specific biological interest. They are built by aggregation of successive elementary processes, following biological considerations, e.g. about the presence of irreversible reactions, the relevance of an intermediate process and of the special nature of a BioE, or the capability to experimentally detect or quantify a specific BioE.
Mathematical models of biological systems
In systems biology, the community has investigated and developed numerous mathematical models [23, 24] enabling the description, analysis, and simulation of biological processes. Mathematical models can be very different in nature (static, dynamical, stochastic, etc.) and depend on various parameters and variables. One biological process can be described with several mathematical models. For instance, protein translation can be modeled by deterministic [25] or by stochastic models [26]. Conversely, several biological processes can have the same type of mathematical model, such as the Michaelis-Menten equation for the kinetics of different enzymes. In the bio-ontology BiPON, we formalize the relation between biological processes and their mathematical description(s).
BiPON design
BiPON is a bio-ontology that is composed of two sub-ontologies: bioBiPON and modelBiPON. bioBiPON organizes the systemic description of biological information, while modelBiPON describes the mathematical models associated with biological processes. In the following, a class that has no sub-class for the property is_a is called a leaf-class. BiPON has been designed using the software editor Protégé 5 and the Description Logic Manchester syntax [27].
bioBiPON ontological model
Main classes
BioBiPON contains four main classes, which corresponds to the main structure of major bio-ontologies: Biological process (GO:0008150), Chemical entity (CHEBI:24,431), Sequence feature (SO:0000110) and Cellular component (GO:0005575).
The classes Biological process and Cellular component include a selection of GO classes, while the Chemical entity class includes a selection of ChEBI classes for small molecules, of SO classes for gene products (e.g. primary transcript), and terms of the KEGG database orthology (KO) for proteins [22]. The Sequence feature class includes a selection of SO classes for sequence patterns. Finally, classes which were not present in existing bio-ontologies were created manually.
The Biological process class contains as subclasses the biological processes and sub-processes (irrespective of their granularity level). The Chemical entity class contains as subclasses the participants (BioE) of a biological process, e.g. molecules, proteins, molecular complexes, polymers, etc. The Sequence feature class contains as subclasses any sequence patterns carried by molecules. Polymers such as DNA and RNA (which belong to the class Chemical entity) act as template (have a matrix role): they carried different sequences patterns (e.g. promoter sequences, transcription factor binding site, ribosome binding site, pausing site for ribosomes, etc.). Some of these polymers can participate in several processes. For instance, the same mRNA can be an input of the translation process and of the mRNA degradation process. However, the molecular complexes or proteins involved in these distinct biological processes recognize the sequence patterns. For instance, in Figs. 1 and 2, the specific mRNA sequence patterns named “GGAGG” and “AUG” are involved in two successive elementary processes. Effectively, the inputs of these processes are thus the sequence patterns, and not the whole mRNA itself. When a process is decomposed as in the previous example, we choose to use the sequence patterns of polymers as process participants instead of the molecules themselves. In addition, sequence patterns and molecular complexes have to interact and thus have to share a common localization on chromosomes or mRNAs. We defined the Cellular component class, which contains as subclasses the parts of cells in which molecules can be localized, and the polymers that carried sequence patterns or bounded chemical entities. In the case of bacteria (a cell without organelle), Cellular component class contains the cytosol and polymers such as chromosome or mRNA.
Class hierarchy and subclass property
Inside the four main classes, subclasses are organized according to the is_a relation to get a Directed Acyclic Graph (DAG) structured model. Unlike in a tree, a class can not only have several subclasses but also be a subclass of several classes (multiple inheritance). The hierarchy of the classes that were imported from GO, ChEBI and SO is kept within the DAG model. Processes, chemical entities and patterns are placed as leaves of the bioBiPON DAG model.
Importation and interoperability
For all classes and properties that were imported from other bio-ontologies (e.g. translation initiation; see Fig. 3), we kept the original references, such as the Internationalized Resource Identifier (IRI) and Identifier (id) in bioBiPON, to ensure interoperability. In BiPON, the SO classes of gene products are now considered as subclasses of the Chemical entity class instead of the Sequence feature class. Due to this semantic change, we considered these SO classes as new classes: we gave a new IRI and kept the original one with the hasDbXref annotation (e.g. hasDbXref SO_0000185). When a class refers to a term in an existing database (such as KO), the original id is also kept with the hasDbXref annotation (e.g. prokaryote translation initiation factor IF-3: hasDbXref K02520; see Fig. 3).
Labeling
For any imported class, the original label is still used in bioBiPON. For any newly created class, we have manually defined a label that was the most representative of the biological process, molecule or sequence represented by the class. The final label can be (a) a term commonly used in the biological schemes of peer-reviewed articles that we considered, or else (b) a Wikipedia term and, otherwise, (c) a term that we chose by taking into account length, completeness and non-ambiguity criteria.
Main properties
Properties were partly imported from the Relation Ontology (RO) [28] and partly created manually. Two main properties, has_participant (RO_0000057; (INVERSE OF participates_in RO_0000056)) and has_part (BFO_0000051) were used to formalize elementary or aggregated processes, respectively. The has_participant property includes the sub-properties has_input (RO_0002233), has_output (RO_0002234), and has_catalyst. In the ontological model, they are represented by arrows between the biological processes and the BioEs (see Fig. 3). These properties are used to formalize relations between elementary processes. The has_part property is transitive and is further specialized into two intransitive sub-properties called cyclication_of and has_subprocess. The has_subprocess is further specialized into starts_with, ends_with, has_intermediate_process and has_fork_process disjoint sub-properties that can be used to formalize aggregated processes. The has_part property enables the decomposition of an aggregated process along the granularity levels down to elementary processes, while the has_subprocess property manages the relation between two successive granularity levels. starts_with, ends_with, has_intermediate_process and has_fork_process participate in the management of successive processes that are part of a process of the same granularity level. The properties starts_with and ends_with define which sub-process starts and ends the aggregated process respectively. We further define the property has_fork_process in the case of several sub-processes start an aggregated process. The properties has_intermediate_process define the sub-processes that occur between the starting and the ending sub-processes.
The located_in property is used to define the localization of Chemical entity class inside the cell.
As mentioned above, the Chemical entity and Sequence feature classes are in relation through the is_motif_of, binds_to and has_template properties. The transitive property is_motif_of localizes the sequence patterns in a larger one and finally in a polymer. The binds_to property (a located_in sub-property) defines the sequence where a Chemical entity binds a polymer. The has_template property points out a sequence that affects the recruitment of a specific Chemical entity.
Formal definition of biological processes
We used the Protégé editor, which is based on Description Logics, to formalize the classes. We distinguished two kinds of classes, namely primitive classes, which are described by necessary conditions (e.g. subclass of other classes), and complex classes, which are defined by equivalence using both necessary and sufficient conditions. Thus, the formal definition of classes follows templates that may combine universal (ONLY) and existential (SOME) restrictions [27]. The structure of bioBiPON is displayed on Fig. 4.
Elementary process class is related to chemical entity or sequence features classes via has_participant sub-properties by the following general class axiom:
elementary_process ≡ has_input SOME chemical entity AND has_output SOME chemical entity AND has_input ONLY (chemical entity OR sequence feature) AND has_output ONLY (chemical entity OR sequence feature).
In the previous definition, Chemical entity is a primitive class (defined as a subclass of bioBiPON), while elementary_process is a class defined by equivalence using two kinds of restrictions. Any subclass of elementary_process must have at least one Chemical entity subclass as an input and as an output. Moreover, the inputs and the outputs of a subclass of elementary_process must be either subclasses of Chemical entity, either subclasses of Sequence feature.
For instance, the elementary process subclass Free 30S fixation in Fig. 3 is defined as follows:
Free 30S fixation ≡ has_input SOME IF3 AND has_input SOME 30S AND has_output SOME 30S–IF3 complex AND has_input ONLY (IF3 OR 30S) AND has_output ONLY 30S IF3 complex.
In this definition of the class Free 30S fixation, we specialized the type of chemical entity that at least one input (output) must satisfy. For instance, one input has to belong to the class IF3, a sub-class of chemical entity.
Aggregated process class are related to cellular process class via has_part sub-properties according to the following general class axiom:
aggregated_process ≡ has_subprocess SOME cellular process AND has_subprocess ONLY cellular process.
For instance, the aggregated process subclass Formation of 30S–mRNA complex in Fig. 3 is defined as follows:
Formation of 30S–mRNA complex ≡ starts_with SOME free 30S fixation AND has_intermediate_process SOME A site hiding AND has_intermediate_process SOME mRNA binding, translation preinitiation AND ends_with SOME mRNA scanning for start codon recognition AND has_subprocess ONLY (free 30S fixation OR A site hiding OR mRNA binding, translation preinitiation OR mRNA scanning for start codon recognition).
Fig. 3 illustrates the ontological representation of the formation of 30S–mRNA complex into aggregated and elementary processes using the classes and properties of bioBiPON.
modelBiPON ontological model
The ontological model called modelBiPON aims at relating generic biological processes to their mathematical models including parameters. Knowledge about mathematical models was gathered from two sources. A first flat source of knowledge was provided by systems biology specialists who established a list of generic, useful and well-established models of biological processes. The second source of knowledge was a selection of ontology classes that were directly imported from SBO, more specifically from the mathematical expression (SBO:0000064) and the system description parameters (SBO:0000545) classes. These classes and subclasses include fairly enough pieces of knowledge for laws and parameters, respectively.
Main classes
We defined four main classes: Modeled process, Reactant, Mathematical expression and System description parameters (Fig. 4).
The Modeled process class corresponds to the process class of SBO (SBO:0000375) and contains, as subclasses, specific biological processes of bioBiPON for which it exists a mathematical model. The Reactant class is an abstract representation of the inputs and outputs of a Modeled process. Reactant is specialized into two disjoint subclasses, Motif Entity and Chemical, corresponding to the subclasses of Sequence feature and Chemical entity of bioBiPON that are inputs/outputs of a Biological process (see Fig. 4). The Chemical class is further specialized into the disjoint subclasses Free Chemical and Bound Chemical. Subclasses of Free Chemical represent the chemical entities that are freely available to interact with any other chemical entities in the cytoplasm. The subclasses of Bound Chemicals represent molecular complexes composed of one or several chemical entities that are bound specifically to a sequence pattern. The Modeled process and Reactant classes are abstract representations and are therefore at the top level of the modelBiPON ontology.
Mathematical expression and System description parameters include subclasses from SBO and subclasses that were defined according to the mathematical models. For the classes that were imported from SBO, we carefully kept the IRIs and the Ids to ensure interoperability between ontologies.
Main properties
In modelBiPON, the sub-properties of has_participant in bioBiPON were used to relate the classes of Modeled process and Reactant while the sub-properties of has_part managed the decomposition of Modeled process. Two new types of properties were defined (Fig. 4): has_model and has_parameter. The has_model property links the Modeled process and Mathematical expression classes while the has_parameters property links the Mathematical expression and System description parameter classes (Fig. 4).
Formal definition of modeled processes
The Modeled process subclasses are defined by the specificity of their participants (belonging to the class Reactant) or by the nature of their discriminating sub-processes. For instance, the most common process belonging to Modeled process is elementary chemical process. By definition, an elementary chemical process has exclusively participants in the Chemical class:
elementary chemical process ≡ elementary process AND has_input ONLY Chemical AND has_output ONLY Chemical.
In the same way, a Sequence binding process corresponds to the elementary process of binding a FreeChemical to a Motif Entity and leads to the formation of a BoundChemical. This process is formalized as follows:
sequence binding process ≡ elementary process AND has_input SOME Motif entity AND has_input SOME FreeChemical AND has_output SOME BoundChemical AND has_input ONLY (Motif entity OR FreeChemical) AND has_output ONLY BoundChemical.
Aggregated processes that are included in Modeled process might be defined by the nature of their discriminating sub-processes such as Matrix dependent process and Polymer production process:
matrix dependent process ≡ aggregated process AND has_part SOME Sequence binding process.
polymer production process ≡ matrix dependent process AND has_part SOME Release process.
In the previous formal definition, Release process is also a subclass of Modeled process.
Finally, a Modeled process can be refined using the biological property of its participants. For example, the transcription process and translation process are defined in modelBiPON as follows:
Transcription process ≡ native polymer production process AND has_output SOME primary_transcript.
Translation process ≡ native polymer production process AND has_output SOME pre-process polypeptide.
BiPON consistency with GO, ChEBI, SO and SBO
To evaluate the logical consistency of BiPON with respect to GO, ChEBI, SO and SBO, we imported the whole set of classes of each ontology into BiPON. Then, we ensured logical consistency using the HermiT 1.3.8 reasoner within the Protégé editor [29].