Skip to main content

Table 1 Feature sets: features used by the NN and CRF (see the “Features” section for details)

From: Entity recognition in the biomedical domain using a hybrid approach

 

Neural network

Conditional random fields

Implementation

 Software

R [67], nnet library

CRFSuite [68]

 Model parameters

1 hidden layer of size 2×(n u m b e r o f i n p u t f e a t u r e s), softmax output layer

Training algorithm: averaged perceptron, default epsilon, 2 words window

Input

n-grams selected by OGER

Single tokens

Features

 Candidate character count

Count

 Candidate is all uppercase

Label yes/no

Label yes/no

 Candidate is all lowercase

Label yes/no

Label yes/no

 Candidate contains Greek (i.e. “alpha”, α)

Label yes/no

Label yes/no

 Candidate contains dashes (‘-’)

Count

Label yes/no

 Candidate contains numbers

Count

Label yes/no

 Candidate ends with a number

Label yes/no

Label yes/no

 Candidate contains capital letter not in first position

Label yes/no

Label yes/no

 Candidate contains lowercase characters

Count

Label yes/no

 Candidate contains uppercase characters

Count

Label yes/no

 Candidate contains spaces

Count

Label yes/no

 Candidate contains symbols

Count

Label yes/no

 2-3 character affixes appearing in an ontology in [36]

Normalized frequency

Label yes/no

 Candidate is symbol

Label yes/no

 Candidate’s part-of-speech

Yes, using [69]

 Candidate’s stem

Yes, using [70]

 Candidate pre-selected by OGER

Yes (see the “Features” section)

Total features

36

About 2.8 million

Tagging speed (on an Intel 4720HQ CPU)

1286 tokens/sec

632 tokens/sec