Representing vaccine misinformation using ontologies

Background In this paper, we discuss the design and development of a formal ontology to describe misinformation about vaccines. Vaccine misinformation is one of the drivers leading to vaccine hesitancy in patients. While there are various levels of vaccine hesitancy to combat and specific interventions to address those levels, it is important to have tools that help researchers understand this problem. With an ontology, not only can we collect and analyze varied misunderstandings about vaccines, but we can also develop tools that can provide informatics solutions. Results We developed the Vaccine Misinformation Ontology (VAXMO) that extends the Misinformation Ontology and links to the nanopublication Resource Description Framework (RDF) model for false assertions of vaccines. Preliminary assessment using semiotic evaluation metrics indicated adequate quality for our ontology. We outlined and demonstrated proposed uses of the ontology to detect and understand anti-vaccine information. Conclusion We surmised that VAXMO and its proposed use cases can support tools and technology that can pave the way for vaccine misinformation detection and analysis. Using an ontology, we can formally structure knowledge for machines and software to better understand the vaccine misinformation domain.


Background
Since their introduction, vaccines have been an important breakthrough that has led to the near-eradication of many infectious diseases. Some of these diseases include polio, typhoid, and smallpox -all which are now uncommon. But in the modern era, certain sectors of society have embraced a post-modernist approach that endorses "that science and 'experts' are open to questioning ... put [ting] greater emphasis on intuition and social relationships and tends to distrust the scientific method as the best paths to healing our ills" [1]. This, compounded with various other factors including misinformation about vaccines, has presented a problem in vaccine uptake into the population. The effects of this are troublesome, considering in one poll 20% of those surveyed believed that there is a link between autism and vaccine [2], in a Gallup poll, 58% are either unsure or actually believe that vaccines cause autism [3], and 11% presume that vaccines are not necessary and 25% In the information age, the unregulated nature of the Web has provided free discourse and information sharing to anyone with a computer and Internet access. To some researchers, the Web is a "Pandora's Box" that has both benefits and costs [17,18], particularly its impact on health-seeking knowledge. In a Pew Research poll from 2013 [19], a majority of those surveyed (73%) sought health-related information with a third of those (35%) diagnosing themselves as opposed to seeing a doctor. In the same study, of the individuals who sought vaccine information (17%), 70% made a decision about vaccination based on the information they found. This may be troubling, as previous studies have highlighted that anti-vaccination websites appear highly ranked in search engine hits [17,20]. Additionally, social media platforms have a significant impact on vaccination attitudes [17,[21][22][23][24]. Overall, the proliferation of vaccine misinformation is accessible to anyone with a mobile device and limited time to perform extensive research.
There are previous studies that have looked at the content of vaccine misinformation and motivation, but none that have investigated informatics tools that can assist and automate the analysis of vaccine misinformation to understand the drivers behind these false notions. The theoretical benefit of such tools can help process massive amount of content (i.e. social media posts), and also discover new knowledge that may not be apparent through manual human analysis. Numerous previous studies can help inform the development of tools and technology to accomplish this objective.
We aimed to use semantic web and ontological technology to represent the domain scope of vaccine misinformation. Also, with ontological representation, we intended to use this artifact to store various misconceptions about vaccines. This would eventually assist in a catalogue misinformation that can be queried and analyzed for future research. While some vaccines are associated with specific misinformation, we focused in this study on the general domain. The Vaccine Misinformation Ontology (VAXMO) is composed of existing ontologies -Misinformation Ontology and nanopublications -and is extended with features pertinent to the anti-vaccine domain. Lastly, we introduced possible use cases that will involve the vaccine misinformation ontology to identify misinformation for text-mining tasks and other applications.

Semantic web and ontologies
The word ontology has its roots in metaphysical philosophy, extending back to Aristotle's Categories, as a "nature of being". In the early 90s, the definition of ontology was applied in the computer science field as a "specification of a conceptualization. " [25]. At the turn of the century, Sir Tim Berners-Lee described his vision for the next generation web called the "semantic web" in Scientific America, where ontologies would be the foundation for this vision [26]. Simply, an ontology is a machine-readable artifact that encodes a logical representation of a domain space using vocabularies, and their semantic meanings. It is the output of a knowledge engineering process where tools and methods are used to build the ontology [27]. Overall, ontologies are used for representing information and knowledge [28][29][30].
In general, knowledge in an ontology is represented as triple which is information presented in subject > predicate > object. Essentially, the subject > predicate > object are concepts that are "smallest, unambiguous unit of thought ... [that are] uniquely identifiable" [31]. Each triple can seamlessly link to another triple to form an ontological knowledge-base. For this knowledge to be readable by a machine, we use a computer-based syntax to encode this knowledge. Once encoded, this artifact can be shared and distributed for various purposes. Moreover, using Web Ontology Language (OWL) or Resource Description Framework (RDF), a specific type of web ontology language syntax for ontologies, we can define more complex axioms and assertions to fully describe concepts which provide machine reasoning capabilities.

Nanopublication primer
Semantic web technologies, specifically ontologies, have had continued impact on research and knowledge sharing, and standardization in the biomedical domain. Some of what has been described were the benefits of formalizing information, information integration, information reuse, and querying and search, etc. We introduce the use of nanopublication, which is an ontology-based micropublishing format for encoding and distributing singular units of assertions. Nanopublications have been used primarily in the life sciences, pharma sciences, as well as genomics and proteomic research data [32]. The benefit of nanopublications include [32]: • Improve finding of scientific information • Connect scientific information from multiple sources • Organize provenance information of the research finding • Verifiable • Small The model or structure of a nanopublication involves a scientific assertion, provenance of the assertion, and provenance information of the nanopublication itself [33]. The scientific assertion component is the singular atomic finding that is represented as subject > predicate > object. An example would be "trastuzumab [subject] is indicated for (treats)[predicate] breast cancer[object]". The other component is the provenance of the assertion, or "the origin or source of something" [34], which will express metadata information, like DOI, authors, research institution, time and date, experimental method, etc. The third part is the provenance information about the nanopublication, which generally indicates who created the nanopublication and when it was created (analogous to citation metadata).
Provided (Listing 1) is a basic example of a nanopublication encoding for the research assertion, "trastuzumab is indicated for (treats) breast cancer. " Specific discussion of the encoding is outside the scope of this proposal, and many references exist to provide further information [33,35]. But briefly, the research assertion is coded in lines 14-16. Lines 18-22 provides provenance of the assertion -the time it was generated, the experiment it was derived from, and who conducted the experiment. Lines 24-27 provide information on the author of the nanopublication and when it was generated. Like all ontology-related artifacts, a unique identifier is associated with the nanopublication in lines 1-2.

VAXMO: Vaccine Misinformation Ontology
We designed and developed the Vaccine Misinformation Ontology (VAXMO) that models concepts pertaining to vaccine misinformation, and a schema that permits archiving of vaccine myths and misinformation. The foundation of VAXMO is built upon the work of Zhou and Zhang, who developed an ontology for general misinformation [36,37]. The goal of their work was to "provide guidance to researchers on misinformation understanding, identification, and detection", and it also considers the Information Theory model to derive concepts, and existing literature of misinformation. In addition to Zhou and Zhang's Misinformation Ontology (MO), we also harnessed the use of the nanopublication format to store vaccine "theories" and their origin information. In the subsequent sections, we will summarize the main concepts for VAXMO model. Figure 1 illustrates the class level description of the VAXMO ontology with extensions for anti-vaccination concepts. As noted earlier, the foundational concepts of the model are derived from Misinformation Ontology. At the time of this research, the OWL-based ontology of MO is not available on the web, so based on their early publications, we reconstructed the ontology in OWL2 with Protégé [38], and incorporated modifications to elaborate on the model. Zhou and Zhang [36,37] provides theoretical detail on the misinformation concepts.
The central concept for VAXMO is Anti-vaccination Information which is a subclass of the Misinformation concept from MO. In addition to the subclasses for Misinformation (Ambivalence, Concealment, Distortion, and Falsification), Anti-vaccination Information concept introduces subclasses of itself -Vaccine inefficacy, Alternative medicine, Civil liberties, Conspiracy theories, Falsehoods, and Ideological. These subclasses for Antivaccination Information are based on classification of misinformation and myths from [17]. For the time being, some of the subclasses have not been extensively defined and may be equivalent or subcategories of the four subclasses for the Misinformation concept. While Falsehood may be the same as Falsification, but Alternative medicine might be equivalent to Distortion or Conspiracy theories to Concealment. From Information Theory, the transmission of information encapsulates a sender and receiver. We represented the concept Participants, which permits defining a number of participants who are part of the misinformation process, and is a parent class of Sender and Receiver class. The Anti-vaccination Information inherits relationships to a Sender and Receiver from the original Misinformation concept. Software Agent and Human Being are subclasses of the Participant. Human Being is defined with an Individual Profile concept class that describes demographic information (Culture and Gender concept). Human Being has definitions that describes how familiar via the Familiarity class that the human participant(s) is with the misinformation.
Additionally, VAXMO associates Anti-vaccination Information concept with the Communication Channel. The Communication Channel represents how, when, and where misinformation is transmitted. This is depicted by concepts like Availability, Synchronicity, Distribution Method, and Modality classes -classes originating from MO. Also, Anti-vaccination Information has a property associated with Controversial Vaccine (a subclass of Subject) that defines what the Anti-vaccination Information class is referring to. In this specific domain, Anti-vaccination Information is about the vaccine topic (Controversial Vaccine concept). The Controversial Vaccine concept is further broken into subclasses pertaining to specific type of vaccines (e.g. HPV Vaccine, MMR Vaccine, etc.).
Both Motivation and Evidence are concepts described in VAXMO and are properties associated with Antivaccination Information. Motivation concerns the reason for transmitting misinformation (Benefiting Someone, Malicious Intent, Protective Action). Evidence is a class for conceptualizing supporting information.
For the purpose of collecting vaccine misinformation in the form of triples (e.g. vaccines > causes > seizures), we look to the nanopublication format. In order to model these triples belonging to a single concept, we extended it using the nanopublication graph model which was originally designed to encode scientific assertions in the form of triples. False Asserted Nanopublication class serves as a listing denoting exactly what the misinformation content is. We subclassed Unsubstantiated Vaccine Theory from False Asserted Nanopublication which is a subclass of nanopublication to inherit its graph model to represent the claims about vaccines. We view these claims as singular decomposed statements in the form of subject > predicate > object. Shown in Fig. 2, the nanopublication instance is associated with Unsubstantiated Vaccine Theory. This provides VAXMO with a means of cataloging samples of vaccine misinformation.
Lastly, to model cues associated with anti-vaccination misinformation, VAXMO modeled a relationship between Anti-vaccination Information with class Anti-Vaccination Evidence (Indication) that represents evidence associated with vaccine misinformation.

Preliminary evaluation metrics
The Vaccine Misinformation Ontology (VAXMO) was encoded with Protégé [38] in OWL format, and it is available for download at http://goo.gl/pT1Enz. Based on metrics from Protégé, there are a total of 116 classes, 26 properties (20 object and 6 data). In its current state, the ontology does not utilize any instances, however, we will utilize the ontology to annotate data from various offline and online misinformation sources into the VAXMO model.
We produced some initial scoring to determine an early evaluation (Table 1) of VAXMO's quality using our inhouse web application, OntoKeeper [39,40]. OntoKeeper is a web-based tool we have developed that calculates metrics rooted in semiotic theorysemantic, pragmatic, and syntactic. These metrics were introduced by Burton-Jones, et al. and have been used in some previous studies  [40]. b Overall score does account for social quality scores reported in [40] to evaluate ontology artifacts [41,42]. The benefit of this metric according to the authors, is that it is domain independent and applicable to measuring the quality of ontologies of any domain, and concise and easy to interpert and to use for evaluators [43]. OntoKeeper automates the calculations of each of the metrics except for the metrics that involve external participants (i.e. subject matter expert review). The user uploads their ontology and the tools parses and extracts the meta-data needed to calculate the scores and presents them in an easy to use interface. Each of these metrics qualitatively measures the lexical quality of the concept labels (semantics), the domain coverage and domain applicability of the ontology (pragmatic), the quality of syntax for machine-readability (syntactic), and the community usage (social). For review of the semiotic evaluation scoring for ontologies see [40,43] for a primer. As a benchmark, we used the National Center for Biomedical Ontology (NCBO) Bioportal sample evaluation scores from our previous work [40]. The syntactic score, which measures syntax-level assessment of the ontology (i.e. machine readability) based on any breach of syntax (lawfulness metric) and utilization of ontology features (richness metric) was 0.69, with lawfulness and richness at 0.95 and 0.44, respectively. The semantic score, a score that measures the term label quality of the ontology was rated at 0.94. The semantic score is comprised of a consistency score that quantifies inconsistent labeling of concepts and instances was 1, clarity that quantifies ambiguity of the term labels was 0.95, and interpretability that measures the ontology's term labels' meaning was 0.91.
For the comprehensiveness score (a component of pragmatic score to assess the utility of the ontology), we utilized the seed number of 1,277,993, which is the average number of classes, instances, and properties from a sample of NCBO Ontologies in a previous study [40].
Ideally, we would like to have identified appropriate ontologies that are comparable to VAXMO, but for initial scoring we settled on the aforementioned seed number from the previous study. Comprehensiveness score from the NCBO seed number provided a very low number value of less than 0.00. The overall quality score based on equal weighting of syntactic (0.69), semantic (0.94), and pragmatic (comprehensiveness at less than 0.00) was 0.54. A summary of the scores are presented in Table 1.
We calculated the z-score using the data from the NCBO Bioportal scores to attain an initial evaluation. When comparing the syntactic score, z-score yielded 0.36 indicating above-average syntactic score for VAXMO. The z-score for semantic was 0.40 also indicating above-average semantic score for VAXMO, and the z-score for pragmatic was −0.29 revealing below-average rating for VAXMO. Also, we calculated the z-score for the final overall quality using the average NCBO overall score (0.51) that does not account for the social metric. The z-score for the overall score of VAXMO was 0.43, which is above average in its overall quality compared to the NCBO sample.
We examined the z-score to assess the quality of VAXMO. The syntactic score of VAXMO appear to be of higher quality with the NCBO BioPortal sample (z=0.36). We interpreted this to mean that the encoding of the ontology with respect to utilization of formal logic (richness) and minimal syntactic violations (lawfulness) is better than other ontologies. The semantic score for VAXMO was also better than the sample NCBO BioPortal ontologies (z=0.40) with respect to minimal inconsistencies with the term labels (consistency), and with respect to meaningful term labels, i.e. at least one word sense (interpretability). However, clarity was slightly weaker than average (z=-0.08), where there may have been term labels that had ambiguous meaning, i.e. above average word senses. The sample from NCBO had the benefit of larger ontologies and therefore were more comprehensive in its domain coverage than VAXMO (z=-0.29) in regards to comprehensiveness.
Overall, with the exception of pragmatic (comprehensiveness), the Vaccine Misinformation Ontology (VAXMO) is, in its current state, a relatively respectable quality ontology based on its comparison of syntactic, semantic, and overall quality scores with a sample of NCBO Bioportal ontologies. The low pragmatic score indicates the need for greater expansion of the ontology, and we acknowledge that VAXMO still needs some refinement and expansion. In addition, we also plan on attaining a pragmatic score's accuracy score [43] that would involve public health experts to provide a review of VAXMO's veracity which would also produce a more complete pragmatic score. Zhou and Zhang have stated that their Misinformation Ontology [37], which is the foundation for VAXMO, could be used for machine-learning tasks to enable machines to detect vaccine misinformation. The features for training would be the classes from the ontology that annotates text, and based on these features potential models can be generated to automatically assess if certain documents or text harbor anti-vaccination opinions. Another future direction is to utilize this ontology to annotate a collection of false statements from the public, specifically in an application-based system where a web-based portal would allow community participants to log statements about vaccines into the system. These false statements would be annotated as nanopublication-types assertionsa benefit of integrating nanopublication -and later be annotated by other concepts of VAXMO to extrapolate features of the false statement. Aside from machinelearning opportunities and application-based usage we may also explore more semantic-based approaches involving natural language processing techniques with ontologies. In the next section we further discuss two use-cases involving machine learning and a method to identify vaccine misinformation in textual content.

Theoretical use-cases
In this section, we envision two possible use cases where VAXMO would assist in the detection of vaccine misinformation. One of those use-cases is similar to what has been described in [37], using the ontology to annotate unstructured data. By annotating the data, such as textual information, we can produce a dataset that can be trained by a machine learner. That machine learner would be enabled to reveal statements that contain misinformation. While discussion of machine learning is out the scope of the paper, we introduced a sample of how data can be annotated for machine learning purposes. Figures 3 and 4 illustrates an example for the aforementioned use-case. Using the classes from VAXMO, one could potentially link the various concepts to unstructured data such as a free text. Figure 3 shows a quote by then-candidate Donald Trump in 2015 stating his position on vaccines. In that example, we demonstrated how some of the various classes (Subject, Modality, Anti-vaccine Evidence, etc.) could be used to annotate the quote. By annotating the data, we can produce a dataset with rows representing whether each class was linked to a piece of data. Figure 4 shows a slice of what the row of data may represent. In the figure, there is a column indicating whether the annotated data is misinformation, followed by each class and subclasses of VAXMO with data designating the features of the annotated data. Determining what to populate into each feature may depend on the type of learner to be used. While VAXMO might have some possibilities for machine learners, there may be some additional refinement for the ontology needed. One aspect is the ambiguity or fuzziness for a few of the classes. For example, classes like Availability with subclass categories of High Availability or Low Availability may require either some individual estimation, methods to explicitly quantify classes, or adding more categories for further refinement of the concepts. Aside from the ontology itself, the unstructured data may have missing or implied contextual information. While the type of vaccine is not clearly specified in Fig. 3, we may assume the speaker is referring to the MMR vaccine -which in the past has been mistakenly associated with autism. Also, the quote itself does not hint who was spoken to, unless one refers to external references to help provide a link with the Receiver class and the number of individuals listening (i.e. for the hasSize data property). This is also true of finding out the motive for communicating misinformation to link VAXMO's Motivation concepts. Overall, either finding external references to confirm some of the annotation, or with caution, making an assumption to associate the VAXMO classes with the data may be undesirably necessary for this use-case, but it lends some future work to consider.

Semantic-driven approach for misinformation detection
Another use-case involves leveraging the triples linked to the ontology through the nanopublication segment of VAXMO. Described earlier, the nanopubulication model for VAXMO was designed to link triples and their metadata to the overall VAXMO model. VAXMO utilizes nanopublication to link to triples that assert vaccine misinformation which reflect misconceptions permeating some sectors of the general public (e.g. vaccine causes autism, vaccines are utilized to sterilize minority communities, etc.). For this use-case we applied the use of semi-supervised natural language processing tools to augment the vaccine misinformation triples. For demonstration purposes, we used the description data for a Youtube video discussing some false information about vaccines [44] and the following triples to automatically analyze the video description info: These above-mentioned triples would be encoded in the assertion line (i.e. line 15 of Listing 1) where each triple would be in their own nanopublication representation.
The sample description text from the Youtube video is: Using this definition (Definition 3), we used the misinformation triples, from VAXMO, to preform matches to identify misinformation of the target statement. Figure 5 outlines the method to analyze textual information for misinformation. The entire test of our proofof-concept method was developed in Java using off-theshelf natural language processing and semantic web programming libraries. To summarize our process, we initially started with the sample text, and imported the text using an open-sourced open information extraction tool (ClausIE [45]). The exported results were a set of triples from each sentence of the text. The list of triples are provided below.
Next, with each triple extracted from the text, we compared the tuples of the triple with the tuples of the misinformation triples from VAXMO using basic exact string matching. If there was an exact match we recorded the match, and if not, we proceeded with the next phase of using graph-based and word-embedding similarity matchings.
Before the next phase, to ensure better accuracy in similarity matching, we lemmatized each term using Mor-phaStemmer from KnowItAll [46]. After all of the triples were lemmatized, we utilized Semantic Measures Library [47] and ConceptNet Numberbatch term vectors [48] with Semantic Vectors [49] to interface with the vectors -to compare the similarity of tuples. Noted in our definition, the subject, predicate, and object tuples between the two triples were compared. Any resulting similarity score of the tuples equaling 1 was deemed a match, and After all triples from the text were analyzed by the code, we assessed the results from the method (See Tables 2, 3 , 4, 5, 6, 7, 8, 9, 10 and 11). The first column of scores in each of tables were produced from the Semantic Measures Library (SML) Java library and the second column of scores were produced from Semantic Vectors-Numberbatch (SV-NB). The triples from Tables 2, 3, and 4 appeared to be misinformation, however, none of the three VAXMO triples were similar to the misinformation triples from the text. All of similarity scores were below 0.18 and, therefore, had very low similarity between the tuples.
Tables 5, 6 and 7 showed some identification of misinformation through our test method. vaccine > cause > convulsion revealed to be similar to the VAXMO triple  of vaccines > causes > seizures (Table 5). Both the subject and predicate tuples were highly similar with a score of 1.00, and object similarity comparing convulsion and seizures were above 0.68 (SML) and 0.56 (SV-NB). With results in Table 6, we assumed that vaccine > cause > brain damage would be approximatively similar to vaccines > causes > autism, but unfortunately this did not succeed. Both their subject and predicate tuples were highly matched, but the similarity analysis revealed that brain damage and autism were not similar, with scores of 0.20 (SML) and 0.16 (SV-NB). Same as Table 5, Table 7's data revealed some success in identifying misinformation vaccine > cause > death in child were similar to vaccines > results > in death. The subject tuples were a match, and the predicate and object comparison had high similarity scores. The SV-NB score for the predicate comparison was 0.44 but the SML score was at 0.50. Object similarity was 0.56 (SML) and 0.51 (SV-NB).  For the remaining data, none of the triples from the text appear to have vaccine misinformation, or were relevant by our observation. Tables 8 through 11 are provided for examination purposes.
The approach described in this subsection is a proofof-concept method, yet there are some limitations to this method. One such limitation is that we need to be aware and encode vaccine misinformation beforehand into VAXMO. In the sample test, there was a possible false statement mentioning that doctors admit vaccine causes harmful effects. If we wanted to denote that it is misinformation we would need a triple in VAXMO that expressed that notion. Another limitation was determining a threshold. In one example we noted that similar tuples had at least 0.50 similarity score. However, we assumed that future examples, when we further test this method, may yield similarity scores below 0.50. Generally, we would need to identify a minimal threshold that would maximize

Discussion and conclusion
The Vaccine Misinformation Ontology (VAXMO)'s purpose is to catalogue and analyze vaccine misinformation that has been one of the drivers for low rates of vaccination rates worldwide. Ontologies benefit from reusing other ontologies. We have utilized an existing model of misinformation (Misinformation Ontology) to address antivaccination information. In addition, we have utilized an  innovative approach using nanopublication (which is generally used for scientific assertions) for linking common false assertions or theories about vaccines (i.e. "vaccines causes autism", "government created weaponized Ebola vaccines", etc. ). Yet, this poses some difficulty -lack of Protégé support and manually editing the ontology artifact. This may inspire us to investigate the possibility of developing a Protégé plugin that provides an interface to view and edit the nanopublication segment of VAXMO.
With some modifications, we constructed the ontology based off of the Misinformation Ontology and extended some of its concepts from an existing survey literature. While MO is specifically designed to model false intention and not misfacts, as stated by the original authors,  we further extended the ontology to utilize nanopublication graph structure to store and represent false assertions about vaccines. The current representation of VAXMO is encoded in OWL with only the class-level fleshed out and with some conceptual gaps. Noted earlier, there have been various studies that focused on content analysis of misinformation and myths of vaccines in the public health domain. Some of the literature can help furnish additional concepts to further expand VAXMO, which could help model and understand the features within anti-vaccination information domain.
While VAXMO is of better quality than NCBO Bioportal ontologies, there is still some more work needed to expand its conceptual domain space for anti-vaccine information. Also, we have described a future use-case that aims to detect misinformation about vaccines, and we plan on reporting on our findings in a future study.
We assume that the impact of this work could lead to applicable uses of semantic web ontologies for public health informatics and future informatics tools that can assist researchers to understand and address health misinformation in the post-modern era.