Adverse events induced by drug-drug interactions are a major concern in the United States . The U.S. Food and Drug Administration (FDA) reported around 297,010 serious outcomes and around 44,693 deaths due to adverse drug events (ADEs) in the first quarter of 2017 . The current direction is moving towards the utilization of electronic health records (EHRs) for clinical research, including ADE discovery [3,4,5,6]. EHR-based research, in general, relies on the process of electronic phenotyping to advance knowledge of a disease or an adverse event [7, 8]. An accurate phenotype definition is critical to identifying patients with a certain phenotype from the EHRs [7,8,9,10]. A “phenotype” can refer to observable patient characteristics inferred from clinical data [7, 11,12,13] or drug-related adverse events or reactions . Several methods can be used for EHR electronic phenotyping by utilizing either structured or unstructured data [11, 15, 16], including natural language processing (NLP), rule-based systems, statistical analysis, data mining, machine learning, and hybrid systems [13, 15]. However, it can be challenging to develop new phenotype definitions for each phenotype of interest. These phenotype definitions are present in literature; however, to our knowledge, no work has previously annotated phenotype definitions from full-text publications on a sentence-level for the goal of text mining applications.
Different institutions view a phenotype definition or a phenotyping case definition differently. For example, Strategic Health IT Advanced Research Projects (SHARP) , which is a collaboration effort (academic and industries partners) to advance the secondary use of clinical data, views a phenotype definition as the “inclusion and exclusion criteria for clinical trials, the numerator and denominator criteria for clinical quality metrics, epidemiologic criteria for outcomes research or observational studies, and trigger criteria for clinical decision support rules, among others” . On the other hand, the Electronic Medical Records & Genomics (eMERGE) phenotype definitions extend to include practices as the “algorithmic recognition of any cohort within EHR for a defined purpose. These purposes were inspired by the algorithmic identification of research phenotypes” . Further practices that eMERGE used in developing phenotype definitions include other data modalities, such as diagnosis fields, laboratory values, medication use, and NLP . Here, we include summarized examples of definitions for a phenotype definition, which are:
▪ Inclusion and exclusion criteria are performed using the EHR’s structured data and unstructured clinical text to identify a cohort of patients from the EHR .
▪ EHR-based research is concerned with cohort selection which is the identification of cases and controls for a phenotype of interest. A phenotype definition is developed by combining EHR data, such as billing codes, medications, narrative notes, and laboratory data [19,20,21,22].
▪ The process of deriving a cohort of a phenotype of interest using either low-throughput or high-throughput approaches .
▪ The identification of the cohort utilizing risk factors and clinical or medical characteristics and complications [24, 25].
Developing a new phenotype definition can be done either by creating new case definitions or utilizing existing case definitions' information that is already available in existing data sources. Traditional expert-driven phenotyping relies on expert knowledge; however, these definitions might change over time . In addition, this task is challenging due to the complexity of EHRs and the heterogeneity of patient records . Depending on the phenotype of interest as well as the study purpose, standard queries for defining a phenotype can consist of any of the following: logical operators, standardized codes, data fields, and values sets (concepts derived from vocabularies or data standards) . Furthermore, it is also a labor-intensive process in which a multidisciplinary team is needed with experts including biostatisticians, clinical researchers, informaticians, and NLP experts . One example of an expert-driven definition is a study that identified patients with chronic rhinosinusitis (CRS) for a better understanding of the “prevalence, pathophysiology, morbidity, and management” using EHR data . The authors developed a phenotype algorithm to define CRS cases using the International Classification of Diseases, Ninth Revision (ICD-9) diagnosis codes  and the Current Procedural Terminology (CPT) codes . The process took several iterations until they achieved a predictive positive value of 91%. Further, they stated that the manual review of sinus computed tomography (CT) results and notes, which was completed by two reviewers in 40 h, was not scalable to larger numbers of patients or notes. Not to mention, their CRS definition has only been tested on one site and its performance is not known in other centers . This creates further difficulties when creating new definitions.
Lessons learned from the eMERGE Network  showed that the process of developing, creating, and validating a phenotype definition for a single disease is time-consuming and can take around 6 to 8 months. Consequently, the eMERGE network developed the Phenotype KnowledgeBase (PheKB) , which is a phenotype knowledgebase collaborative environment that allows collaborating and commenting between groups of researchers who are invited by a primary author. The PheKB  uses an expert-driven approach where new phenotype definitions are generated by multi-institutional inputs and are publicly available for use. The PheKB provides a library of definitions for several phenotypes that include drug response phenotypes such as adverse effects or efficacy, diseases or syndromes, and other traits. Inspired by PheKB modalities or methods , a phenotype definition includes the presence of the following attributes: biomedical and procedure information, standard codes, medications, laboratories, and NLP. The NLP has been used in many phenotypes in the PheKB, such as angiotensin-converting enzyme inhibitor (ACE-I) induced cough which provides a list of terms that can be used to identify cases . On the other hand, data and study design can still be important to capture, but these are not the primary modalities/attributes of a phenotype definition.
Another method relies on deriving phenotype definitions from existing data sources, such as EHR and biomedical literature. Some of these have been addressed manually using systematic reviews [30,31,32,33,34,35,36] or automatically using computational approaches. Systematic reviews have a big role in medical knowledge; however, with the massive amount of information, there is still a need to use automated approaches to extract medical knowledge. For example, the rate of published clinical trial articles is over 20,000 per year while around 3,000 systematic reviews were indexed in MEDLINE yearly. Overall, conducting systematic reviews can be time-consuming and labor-intensive . On the other hand, the automated approaches for mining phenotypes in the literature are mostly focused on extracting phenotype terminologies [38,39,40]. This approach can miss important phenotype definitions information that is contained within text sources. Additionally, some of these studies [40, 41] have addressed only one phenotype at a time which might not be generalizable, especially when working on a large-scale set of phenotypes. Furthermore, these studies utilized abstracts rather than full-text articles [40, 41]. Unlike full-text articles that are richer in information, abstracts are not sufficient for the granularity of phenotype definitions information. In addition, such approaches might not be generalizable, especially when working on a large-scale set of phenotypes. In the study done by Botsis and Ball , they developed a corpus and a classifier to automate the extraction of “anaphylaxis” definitions from the literature. However, Botsis and Ball  only relied on abstracts rather than full-text articles and only addressed one condition, "anaphylaxis". Even though they focused on some features of phenotype definitions, e.g. signs and symptoms, they did not consider other features, such as standardized codes and laboratory measures . Therefore, this effort did not address our information needs that reflect modalities of phenotype definitions such as those used in PheKB.
Applications of electronic phenotyping and phenotype definitions
Electronic phenotyping is the process of identifying patients with an outcome of interest, such as patients with ADEs . There are two major types of research in the biomedical domain: primary research that directly collects data and secondary research that relies on published information or sources of data. EHR phenotyping is not limited to but is mostly needed in primary research which includes observational studies, also called epidemiological studies. For example, the design of observational studies can include cross-sectional, retrospective, and prospective cohorts , where phenotype definitions can be used . Furthermore, other examples of studies that use phenotype definitions are pharmacovigilance, predictive modeling, clinical effectiveness research, and risk factor studies. More examples are shown in Banda et al. research . For a phenotype of interest, different study designs require different cohort designs as well as phenotype definitions where one phenotype can be defined in different ways depending on the study's needs. For instance, type 2 diabetes mellitus can be defined as “simple as patients with type 2 diabetes or far more nuanced, such as patients with stage II prostate cancer and urinary urgency without evidence of urinary tract infection” .
New research, such as pharmacovigilance, is moving towards the emergence of electronic health information, machine learning, and NLP . Methods used for electronic phenotyping, include NLP, machine learning, rule-based, and collaborative frameworks . EHRs provide complementary data with some flexibility in extended period tracking, large sample size, and data heterogeneity . The availability of a cohort can create several opportunities for data mining and modeling such as building risk models, detecting ADEs, measuring the effectiveness of an intervention, and building evidence-based guidelines . Cohort identification can be accomplished by using phenotype definitions, which classify patients with a specific disease based on EHR data and can be manually developed by experts or machine learning. A phenotype definition shares some major features, such as logic, temporality, and the use of standard codes . Furthermore, examples of data categories that are commonly used in phenotype definitions across institutions are “age, sex, race/ethnicity, height, weight, blood pressure, inpatient/outpatient diagnosis codes, laboratory tests, medications” . On the other hand, there are some challenges with the cohort identification process that vary depending on the study type. The phenotyping process is more sophisticated than a simple code search . Several factors can contribute to their complexity, including the used research methods and the presence of confounding factors. For example, when defining acute or less-defined phenotypes, one critical step is addressing confounding factors by using the matching of gender and age. These confounders are relatively easy to address, but others, such as co-diseases, might be more difficult. In a study completed by Castro et al. , they were not able to identify methods for matching controls in EHR data. Case–control studies may inherent some limitations in detecting comorbidities such as insufficient controls, identification of correct confounders, and case–control matching processes. Castro et al.  stated that their goal is to compare matching algorithms methods to identify clinically meaningful comorbidity associations. Literature-based comorbidity associations, derived by clinical experts from literature, are considered a reference standard to compare the performance of the matched controls. However, there were disagreements among gastroenterologist experts who compared the inflammatory bowel disease and disease associations found in Phenome-wide association studies (PheWAS)  disease groupings versus the associations found in the literature .
Medical corpora for text mining
Many of the text mining applications require a corpus, a collection of text annotated by experts because these applications rely mostly on supervised machine learning methods. This is due to the challenges of recognizing terms as the example provided by Rodriguez-Esteban R  for: “the text ‘early progressive multifocal leukoencephalopathy’ could refer to any, or all, of these disease terms: ‘early progressive multifocal leukoencephalopathy’, ‘progressive multifocal leukoencephalopathy’, ‘multifocal leukoencephalopathy’, and ‘leukoencephalopathy’”. Such annotations based on expert knowledge can be used to train machines on, for example, recognizing biomedical terms in a text . An annotated corpus requires experienced annotators, comprehensive guidelines, and large-scale high-quality corpora . The manually annotated corpus can serve as a gold standard for building automated systems, e.g. statistical, machine learning, or rule-based . Examples of annotated biological corpora are GENIA for annotating biological terms , BioCreativeFootnote 1 for annotating biological entities in literature e.g. genes and proteins , and BioNLPFootnote 2 which is a collection of corpora, such as Colorado Richly Annotated Full-Text Corpus (CRAFT)Footnote 3 and Protein Residue Corpora,Footnote 4 for annotating biological entities. Another usage of an annotated corpus is to create a literature-based knowledgebase, such as MetaCoreFootnote 5 and BRENDA8Footnote 6 for enzyme functional data . However, these are mostly restricted to specific domains such as the biological domain which annotates information, such as gene names, protein names, and cellular location or events (e.g. protein–protein interaction) . The availability of corpora in the medical domain is even more limited than in the biological domain. One of the major reasons is that the medical domain is confronted with data availability and ethical issues of using electronic medical records , including privacy and confidentiality and Health Insurance Portability and Accountability Act (HIPAA) regulations . Examples of biomedical corpora are Text Corpus for Disease Names and Adverse Effects for annotating diseases and adverse effects entities , CLinical E-Science Framework (CLEF) for annotating medical entities and relations (e.g. drugs, indications, findings) in free texts of 20,000 cancer patient records , and Adverse Drug Effects (ADE) corpusFootnote 7 for annotating ADEs entities . None of the available corpora serves our needs for this task to annotate contextual cues of defining a phenotype in observational studies on sentence-level annotations from full texts, such as the presence of codes, laboratory tests, and type of data used.
An example of developing a corpus for phenotypes is PhenoCHF [55, 56], an annotated corpus by domain experts for phenotypic information relevant to Congestive Heart Failure (CHF) from literature and EHR. The PhenoCHF corpus data was derived from the i2b2 (the Informatics for Integrating Biology at the Bedside) discharge summaries dataset  and five full full-text articles retrieved from PubMed that covered the characteristics of CHF and renal failure. However, PhenoCHF focused only on one condition, CHF, and was built on a small corpus of only five full full-text articles. Furthermore, they did not annotate contextual cues for phenotyping case definitions. Intending to minimize human involvement, we realized that there is a lack of phenotyping tools  addressing or automating the extraction of existing definitions from the scientific literature.
There is no existing corpus that addressed the automatic identification of phenotype definitions on a sentence-level. In this study, our aim is to annotate a corpus that captures sentences with phenotypes and contextual cues and patterns of a phenotype definition that are presented in the literature. We believe that EHR-based studies will provide relevant information for defining phenotypes. An annotation guideline is developed and serves as a foundational approach for annotating phenotype definition information in the literature. Both the corpus and the guidelines are designed based on an extensive textual analysis of sentences to reflect phenotype definitions information and cues. Ten dimensions are proposed to annotate the corpus at the sentence-level. Furthermore, after identifying the presence or absence of the 10 dimensions, the level of evidence for each sentence was generated automatically using a rule-based approach to ensure consistency and accuracy of annotations. All sentences in the methodology section were extracted from full-text research articles. To the best of our knowledge, no annotated corpus is publicly available for annotating sentences with contextual cues of phenotype definitions from biomedical full-text articles for text mining purposes.