Improving the Sequence Ontology terminology for genomic variant annotation
© Cunningham et al. 2015
Received: 19 February 2013
Accepted: 22 July 2015
Published: 31 July 2015
The Genome Variant Format (GVF) uses the Sequence Ontology (SO) to enable detailed annotation of sequence variation. The annotation includes SO terms for the type of sequence alteration, the genomic features that are changed and the effect of the alteration. The SO maintains and updates the specification and provides the underlying ontologicial structure.
A requirements analysis was undertaken to gather terms missing in the SO release at the time, but needed to adequately describe the effects of sequence alteration on a set of variant genomic annotations. We have extended and remodeled the SO to include and define all terms that describe the effect of variation upon reference genomic features in the Ensembl variation databases.
The new terminology was used to annotate the human reference genome with a set of variants from both COSMIC and dbSNP. A GVF file containing 170,853 sequence alterations was generated using the SO terminology to annotate the kinds of alteration, the effect of the alteration and the reference feature changed. There are four kinds of alteration and 24 kinds of effect seen in this dataset. (Ensembl Variation annotates 34 different SO consequence terms: http://www.ensembl.org/info/docs/variation/predicted_data.html).
We explain the updates to the Sequence Ontology to describe the effect of variation on existing reference features. We have provided a set of annotations using this terminology, and the well defined GVF specification. We have also provided a provisional exploration of this large annotation dataset.
The Sequence Ontology (SO)  provides terminology to define sequence features. These features are the building blocks of sequence annotation, and allow biologically meaningful regions to be assigned between coordinates of sequences such as genome assemblies and transcripts. The relationships between the terms in SO provide for the annotation of multi-part features such as gene models, composed of multiple transcripts, exons, introns and UTR features. Reference genome annotations are often shared using a flat file format GFF3, developed by the GMOD community , which stipulates that SO terms describe each annotated feature, thus many genome annotation tools use SO to describe reference genome features. While terms to describe variants have long been part of the Sequence Ontology, increased need for new variation terms to describe the predicted effect of sequence alterations on existing genomic features lead to the development of new terms. This has been driven by the proliferation of software tools that predict the effect of sequence alterations such as Ensembl’s Variant Effect Predictor (VEP)  and the VAAST suite tool: Variant Annotation Tool (VAT) . In this mansucript, SO terms are italicized and written without underscores.
Next generation sequencing (NGS) technologies have provided an enormous expansion in our understanding of the landscape of genetic variation [5, 6] as well as the impact of that variation on human health [7–9]. These datasets create a significant burden in computational analysis and data storage, but established work-flows for analysis are emerging  and well established data formats exist for each stage of the process. The original base calls from the sequencer are converted to FASTQ files  that contain the sequence data; the SAM format  captures the alignment of the sequence to a reference genome and the Variant Call Format  has become widely adopted by variant calling tools to report variants and the information needed to call them. However, knowing the type and genomic location of a sequence change is just the first step in understanding its clinical or biological consequences. Variant annotation then begins the process of adding additional knowledge about the structural and functional consequences of those variants through the impact on reference sequence features and ultimately on phenotype.
The Genome Variation Format (GVF)  is a variant file format for the detailed annotation of genetic variation. GVF is a community supported format that uses established ontologies such as the Sequence Ontology  to describe the variant data. GVF does not replace existing variant nomenclature systems such as HGVS  and ISCN  that provide effective ways to unambiguously describe individual variants in the literature. GVF provides the infrastructure to support inclusion of these nomenclatures along with other detailed variant annotations in a format capable of supporting genome scale variant data. GVF is used in the community for exchange of variant annotations between Ensembl , DGVa and dbVar  and is compatible with existing GFF3 software [2, 18] as well as emerging domain specific tools [4, 19].
User requirements and ontology development
In the SO, the sequence alteration and the effects of the alteration are separated. A sequence alteration defines the nucleotide change observed in an individual sequence, in relation to a reference sequence. Examples of alterations are insertion, deletion, substitution and SNV. The effect of a sequence alteration is the observed or predicted change to annotated reference seqeunce features. These effects of sequence alterations are defined as sequence variants in SO and are outlined in Fig. 1. Examples of these terms are missense variant, whereby codon bases are modified in such a way as the resuling amino acid would change, and splice donor variant where by the alteration changes the two-base pair region at the 5′ end of an intron.
One of the advantages of using an ontology for the annotation of data, is that given the related nature of the terms, there are options to annotate data to the level of detail afforded by the evidence. Under the sequence variant node, SO provides two high level nodes in the ontology: structural variant and functional variant. Structural variants pertain to changes with regard to annotated sequence features, and are the output of automated variant effect predition tools such as VEP . Functional variants however describe the cellular effect of a sequence alteration and are generally manually curated. These functional terms have largely been absorbed into the Variation ontology  and are not automatically assigned by variant effect prediction tools. With regards to structural variants, the alteration can either internally modify a sequence feature, when the alteration falls within the extent of a reference sequence feature such as an exon (feature variant), or the alteration can be greater than the extent of the sequence feature, causing the ablation or amplification of an entire genomic feature such as a transcript.
The feature variant node in the ontology subsumes the terms that describe changes internal to genomic features such as those affecting genes, transcripts and introns. The majority of the sequence alterations currently annotated by Ensembl cause feature variants. These feature variant terms are shown in Fig. 1, where the terms used in Ensembl annotations are highlighted in dark grey. There are five subtypes: intergenic variant, gene variant, feature truncation, feature elongation and regulatory region variant. Of these terms, gene variant has 77 direct and indirect subtypes and includes most of the terms that describe structural sequence variants caused by substitutions and small insertions and deletions. This portion of the SO contains terms with multiple parents, to allow for effective querying of the annotations. For example, the term stop retained variant is both a synonymous variant and a terminator codon variant. Users are thus able to query the Ensembl data for all terminator codon variants or all synonymous variants.
Discussion and conclusions
Detailed annotation of sequence variation is complicated because reference genome annotations are complex. Genes may produce multiple transcripts, may overlap each other on opposite strands, or even be nested within introns of other genes, therefore a variant may influence multiple genomic features. Capturing the effect of a sequence alteration on the genomic features with which it intersects is an important step towards understanding the implication of the variant sequence. The terminology described here provides a basis with whch to categorize and define sequence variation and the flexibility to annotate the effect with respect to the feature intersected. This ontology provides very specific leaf terms, with which to automatically annotate genomic sequence but also useful mid level terms for querying.
Future developments to the ontology will include developing relationships between the sequence variant terms and the sequence features that are affected. There has been significant uptake of these variant effect terms by the genomic variant annotation community. The UCSC genomic browser uses this termnology in variant annotation  as does the NCBI’s ClinVar data dictionary and dbVar database . New terms will be added as required. New terms and updates to the ontology may be requested using the term tracker (https://sourceforge.net/p/song/term-tracker/). Development of the SO is collaborative, incorporating community discussion via our mailing list and the term tracker as well as the results of focused working groups.
This work was supported by the National Human Genome Research Institute [R01HG004341 to KE] and National Libarary of Medicine training grant [T15 LM007124-18, NRS]. Ensembl receives majority funding from the Wellcome Trust (grant numbers WT095908 and WT098051) with additional funding for specific project components from the National Human Genome Research Institute (U41HG007234, 1R01HD074078, and U41HG007823), the Biotechnology and Biological Sciences Research Council (BB/K009524/1, BB/L024225/1, BB/M018458/1 and BB/M020398/1), the Centre for Therapeutic Target Validation (CTTV) and the European Molecular Biology Laboratory. The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 282510 (BLUEPRINT). The research leading to these results has received funding from the European Union's Seventh Framework Capacities Specific Programme under grant agreement n° 284209 (BioMedBridges). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement n° 634143 (MedBioinformatics)
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44.View ArticleGoogle Scholar
- Generic Model Organism Database (GMOD). [http://gmod.org].
- McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26:2069–70.View ArticleGoogle Scholar
- Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 2011;21:1529–42.View ArticleGoogle Scholar
- 1000 Genomes Progect Structural Variant group specification. [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20(Variant%20Call%20Format)%20version%204.0/encoding-structural-variants].
- MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–8.View ArticleGoogle Scholar
- Rope AF, Wang K, Evjenth R, Xing J, Johnston JJ, Swensen JJ, et al. Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet. 2011;89:28–43.View ArticleGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12:745–55.View ArticleGoogle Scholar
- Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30–5.View ArticleGoogle Scholar
- Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–71.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8.View ArticleGoogle Scholar
- Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, et al. A standard variation file format for human genome sequences. Genome Biol. 2010;11:R88.View ArticleGoogle Scholar
- Horaitis O, Cotton RG. The challenge of documenting mutation across the genome: the human genome variation society approach. Hum Mutat. 2004;23:447–52.View ArticleGoogle Scholar
- An International System for Human Cytogenetic Nomenclature. Basel: S. Karger AG; 2009.
- Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41:D48–55.View ArticleGoogle Scholar
- Lappalainen I, Lopez J, Skipper L, Hefferon T, Spalding JD, Garner J, et al. DbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 2013;41:D936–41.View ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–8.View ArticleGoogle Scholar
- Song T, Hwang KB, Hsing M, Lee K, Bohn J, Kong SW. gSearch: a fast and flexible general search tool for whole-genome sequencing. Bioinformatics. 2012;28:2176–7.View ArticleGoogle Scholar
- Chen Y, Cunningham F, Rios D, McLaren WM, Smith J, Pritchard B, et al. Ensembl variation resources. BMC Genomics. 2010;11:293.View ArticleGoogle Scholar
- Ensembl predicted data. [http://www.ensembl.org/info/genome/variation/predicted_data.html].
- Vihinen M. Variation ontology for annotation of variation effects and mechanisms. Genome Res. 2014;24:356–64.View ArticleGoogle Scholar
- Rios D, McLaren WM, Chen Y, Birney E, Stabenau A, Flicek P, et al. A database and API for variation, dense genotyping and resequencing data. BMC Bioinformatics. 2010;11:238.View ArticleGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–11.View ArticleGoogle Scholar
- Forbes SA, Bhamra G, Bamford S, Dawson E, Kok C, Clements J, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr Protoc Hum Genet. 2008;Chapter 10:Unit 10 11.Google Scholar
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, et al. The UCSC genome browser database: extensions and updates 2013. Nucleic Acids Res. 2013;41:D64–9.View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.