Skip to main content

Table 1 List of features used in the CRF method.

From: Discovering opinion leaders for medical topics using news articles

Feature name Type Description
Dictionary Semantic Person names; Organization names; Location names
Distributional Semantic Distributional thesaurus
Section Pragmatic Name of the section in which the sentence appears
Part of speech Syntactic Part of speech of the token in the sentence
Others Lexical Lower case token, Lemma, Prefixes, Suffixes, n-grams, Matching patterns such as beginning with a capital, etc.
  1. Dictionary features: all the three dictionaries contain words that have a single token and are obtained by removing stop words. Each dictionary corresponds to one feature depending on whether a token is present in the dictionary. Distributional features: using the Semantic Vectors package [27] trained on the text retrieved from the links obtained for the case study, each word is represented in a 2000-dimensional vector space. The vector representation is used to find the 20 most similar words from the text to each word. For each token, we thus have 20 distributional semantic features that represent the entries in the thesaurus. Section features: section names are detected automatically using simple rules (e.g. a sentence ending with a semi-colon). Other features: there are about a hundred more features considering different part of speech tags according to Penn Treebank format, the different matching patterns used, prefixes, n-grams etc