Discovering opinion leaders for medical topics using news articles

Table 1 List of features used in the CRF method.

Feature name	Type	Description
Dictionary	Semantic	Person names; Organization names; Location names
Distributional	Semantic	Distributional thesaurus
Section	Pragmatic	Name of the section in which the sentence appears
Part of speech	Syntactic	Part of speech of the token in the sentence
Others	Lexical	Lower case token, Lemma, Prefixes, Suffixes, n-grams, Matching patterns such as beginning with a capital, etc.

Dictionary features: all the three dictionaries contain words that have a single token and are obtained by removing stop words. Each dictionary corresponds to one feature depending on whether a token is present in the dictionary. Distributional features: using the Semantic Vectors package [27] trained on the text retrieved from the links obtained for the case study, each word is represented in a 2000-dimensional vector space. The vector representation is used to find the 20 most similar words from the text to each word. For each token, we thus have 20 distributional semantic features that represent the entries in the thesaurus. Section features: section names are detected automatically using simple rules (e.g. a sentence ending with a semi-colon). Other features: there are about a hundred more features considering different part of speech tags according to Penn Treebank format, the different matching patterns used, prefixes, n-grams etc

ISSN: 2041-1480