SIENA: Semi-automatic semantic enhancement of datasets using concept recognition

Grigoriu, Andreea; Zaveri, Amrapali; Weiss, Gerhard; Dumontier, Michel

doi:10.1186/s13326-021-00239-z

Research
Open access
Published: 24 March 2021

SIENA: Semi-automatic semantic enhancement of datasets using concept recognition

Andreea Grigoriu¹,
Amrapali Zaveri¹,
Gerhard Weiss² &
…
Michel Dumontier¹

Journal of Biomedical Semantics volume 12, Article number: 5 (2021) Cite this article

3614 Accesses
2 Citations
6 Altmetric
Metrics details

Abstract

Background

The amount of available data, which can facilitate answering scientific research questions, is growing. However, the different formats of published data are expanding as well, creating a serious challenge when multiple datasets need to be integrated for answering a question.

Results

This paper presents a semi-automated framework that provides semantic enhancement of biomedical data, specifically gene datasets. The framework involved a concept recognition task using machine learning, in combination with the BioPortal annotator. Compared to using methods which require only the BioPortal annotator for semantic enhancement, the proposed framework achieves the highest results.

Conclusions

Using concept recognition combined with machine learning techniques and annotation with a biomedical ontology, the proposed framework can provide datasets to reach their full potential of providing meaningful information, which can answer scientific research questions.

Background

The amount of data becoming available is rapidly increasing. Various research fields can benefit from the growing volume of information, including the biomedical domain. Unfortunately, answering a research question using the already available data usually requires information which can be found in more than one dataset. Moreover, the information needed is not only spread across sources, but also is stored in different formats such as comma-separated values (CSV), extensible markup language (XML) etc. Therefore data processing is usually needed to solve the provided task. However, data processing has been identified by 80% of data scientists as the most time consuming part of a project and at the same time, the least enjoyable one [1].

In response to this demand, many tools involving various types of data integration and conversion are being developed. Data2Services [2] is such a tool that provides an automatic conversion of various datatypes to the Resource Description Framework (RDF)^{Footnote 1} format, which can help with data integration. The RDF format provides a structured, standardized and machine readable data representation.

However a structured format does not necessarily provide meaning to the data. For data to be meaningful and understandable, additional information, such as knowing what the columns of the dataset represent (their types) and how they are related (interoperability), is required. To semantically enhance the data, one could annotate the data with existing concepts, in the form of public ontologies.

As a use case, consider the following query that a biomedical researcher is interested in: Which genes interact with ethanol?, in order to know how ethanol, that could be used as a component of a drug, reacts with human genes. The answer to this question already requires using two separate datasets, namely Hugo Gene Nomenclature^{Footnote 2}, for gene information, and Comparative Toxicogenomics Database^{Footnote 3}, for information about ethanol. These datasets are available in two different formats CSV, and tab-separated values (TSV), respectively. The two sources share common data attributes, such as gene symbol, and common data values such as the indexed genes. However the gene symbol attribute is represented using two different labels : “Symbol”, “Gene Symbol”. This is represented in Fig. 1.

Without data integration, this would be solved through manual analyses of the data and extraction of the correct answer, which can be a time consuming process. Combining the two datasets can provide the answer. Data2Services can make both datasets publicly available in a common format. However, the tool provides generic transformation of the data. Therefore, a manual investigation is still needed to determine that the two columns containing symbols represent the same attribute, (see Fig. 1), therefore having the same meaning. This can be solved through semantic enhancement. If data would also be semantically annotated, the two columns should be sharing the same concept.

Therefore, this project is addressing the following research question: Can we (semi-)automate the transformation of biomedical datasets into a semantically meaningful representation?, specifically addressing if we can automatically assign the concept for a column label in a tabular data file. In this project, we only focus on gene datasets.

This project has the following contributions:

methodology of using a public biomedical ontology repository to identify relevant gene concepts
developing two separate methods for gene concept recognition through machine learning classification
implementation of a framework performing semi-automatic semantic enhancement using the explored methods
report of quality assessment of the resulting data

There are different tools that can provide RDF conversion from multiple data types [3–7]. However, they require considerable amount of human input. Data2Services [2] can automatically convert different data formats (e.g. CSV, XML) to RDF. However, it provides a generic outcome missing out on semantic types for entities and their relations.

Ontology mapping tools help users map ontology terms to their data. However, in most tools, the user needs to provide the ontology that will be used for the mapping [8–10] or chose from the recommended options [11].

In [9], the task of concept recognition in biomedical data is defined as mapping a piece of text to a previously selected terminology (or in some cases an ontology). Two concept recognition tools are compared in [9], using different dictionaries and data as input. The data mostly contains free text. The results show that the performance varies with different data as input and dictionaries. Therefore, good performance of those concept recognition tools is linked to the prior selected dictionary and dataset. Other approaches combine machine learning techniques such as classifiers into the mapping process [12]. However, using pre-selected dictionaries and free text input data restrain the data and concepts that can be explored. In order to preserve the semantic characteristics of words (linguistic meaning), low-dimensional vectors such as word embeddings can be used as word representations, which have proven to be effective in various tasks [13, 14].

This paper introduces a concept recognition task using machine learning, specifically binary classification, used for semi-automated semantic enhancement of data. In our experiments, we have focused on gene datasets, so the gene concept. However, our method does not depend on pre selected data and/or preselected dictionaries as explored in previous papers. In addition, our approach uses word embeddings on a dataset with heterogeneous values, therefore, the input data is also no longer limited to free text.

Methodology

We investigated two approaches: (i) annotation with BioPortal and (ii) concept recognition. We developed a framework combining both to tackle the problem of semi-automatic semantic enhancement. Figure 2 presents an overview of the applied methods.

The project focuses on providing semi-automated semantic enhancement to three datasets (Hugo Gene Nomenclature, Comparative Toxicogenomics Database and Pharmacogenomics Knowledgebase) which are automatically converted to RDF using the Data2Services tool.

The first method, described in “Annotation with BioPortal” section aims to solve the task of semantic enhancement by using a biomedical ontology repository. The repository can provide both types (classes) and attributes (properties) for the searched term, through separate search options. However, the results might differ for each type of search. For example, the term “Chemical Id” has no matches in a class search, in contrast to 30 matches in a property search. Two separate experiments are conducted in order to establish whether the method should be used to provide properties or classes for the task.

The second method, described in “Concept recognition model” section focuses on automatically recognizing the presence of a class in a dataset. We define “Concept recognition” section as a task where we determine if the gene concept is present in a dataset using binary classification. We developed two separate approaches. The first approach uses the combination of column names (titles) presented in a dataset and the corresponding values (data) in the columns as input for the concept recognition task. In the second approach, only the column names (titles) are used as input for the same task.

The following sections describe the method components in detail. “Datasets” section describes the data used, “Annotation with BioPortal” section focuses on the use of a biomedical ontology repository and “Concept recognition model” section defines the developed concept recognition method.

Datasets

In order to determine the performance of the chosen methods on a smaller scale first, a small corpus sample was chosen. We chose three datasets: (i) Hugo Gene Nomenclature (HGNC), (ii) Comparative Toxicogenomics Database (CTD) and (iii) Pharmacogenomics Knowledgebase (PGKB).

HGNC^{Footnote 4} is a publicly available database which contains all the curated HGNC approved nomenclature, gene groups and associated resources. This project uses the complete HGNC dataset file.
CTD^{Footnote 5} is a publicly available database which contains manually curated information about chemical–gene/protein interactions, chemical–disease and gene–disease relationships. The subset containing chemical–gene/protein interactions was chosen for this project.
PGKB^{Footnote 6} is a pharmacogenomics knowledge dataset that incorporates various curated clinical information such as dosing guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. We used the subset containing gene information used by PGKB.

Further details about the data are presented in Table 1.

Table 1 Detailed information about the datasets used in the methodology

SIENA: Semi-automatic semantic enhancement of datasets using concept recognition

Abstract

Background

Results

Conclusions

Background

Methodology

Datasets

Annotation with BioPortal

Class search

Property search

Concept recognition model

Column name approach

Column name and value approach

SIENA

Results

Annotation with BioPortal

Class search

Property search

Concept recognition

Column name

Column name and value approach

SIENA

Data quality assessment

Syntactic validity

Semantic accuracy

Completeness

Discussion

Conclusion

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us