Skip to main content

Table 1 List of features used in the CRF method.

From: Discovering opinion leaders for medical topics using news articles

Feature name

Type

Description

Dictionary

Semantic

Person names; Organization names; Location names

Distributional

Semantic

Distributional thesaurus

Section

Pragmatic

Name of the section in which the sentence appears

Part of speech

Syntactic

Part of speech of the token in the sentence

Others

Lexical

Lower case token, Lemma, Prefixes, Suffixes, n-grams, Matching patterns such as beginning with a capital, etc.

  1. Dictionary features: all the three dictionaries contain words that have a single token and are obtained by removing stop words. Each dictionary corresponds to one feature depending on whether a token is present in the dictionary. Distributional features: using the Semantic Vectors package [27] trained on the text retrieved from the links obtained for the case study, each word is represented in a 2000-dimensional vector space. The vector representation is used to find the 20 most similar words from the text to each word. For each token, we thus have 20 distributional semantic features that represent the entries in the thesaurus. Section features: section names are detected automatically using simple rules (e.g. a sentence ending with a semi-colon). Other features: there are about a hundred more features considering different part of speech tags according to Penn Treebank format, the different matching patterns used, prefixes, n-grams etc