C-value (a) [18] | \( \left\{\begin{array}{l}\kern12.5em lo{g}_2\left|a\right|\cdot f(a),\kern2em \left|\alpha\ is\ not\ nested\right.\hfill \\ {}lo{g}_2\left|a\right|\left(f(a)-\frac{1}{P\left({T}_{\alpha}\right)}{\displaystyle \sum_{b\epsilon {T}_{\alpha }}f(b)}\right),\kern1em \left| otherwise\right.\hfill \end{array}\right. \) |
where: | |
\( \alpha \) is the candidate string | |
f(.) is its frequency of occurrence in the corpus | |
Τa is the set of extracted candidate terms that contain a | |
P(Τa) Is the number of these candidate terms | |
Termhood (a) \( \log \left(\frac{P\left( vote= yes\right)}{P\left( vote= no\right)}\right) \) [53] | = −0.7836 + |
0.7541* FirstPOS _ ADJECTIVE – | |
1.3722* FirstPOS _ ADVERB + | |
0.3541* FirstPOS _ NOUN + | |
1.4182 * FirstPOS _ VERB – | |
0.7722 * LastPOS _ ADJECTIVE + | |
2.2576 * LastPOS _ ADVERB + | |
0.0285 * LastPOS_NOUN + | |
0.6038 * LastPOS _ VERB + | |
1.2899 * NP _ VALUE + | |
1.0475 * REPEAT _ SUP _ GREATER _ MEDIAN + | |
0.8417 * REPEAT _ SUB _ GREATER _ MEDIAN + | |
0.8422 * DISTINCT _ PERHOST _ GREATER _ THAN _ MEDIAN | |
where: | |
POS is Part of Speech tag | |
REPEAT_SUP is number of supra (candidate terms containing a) = P (Τa) | |
REPEAT_SUB is subgroup (candidate terms that are contained within a) = P (Αt) | |
NP_VALUE is a a noun phrase | |
DISTINCT_PER_HOST is equivalent to document frequency | |
MEDIAN is calculated for the whole document set | |
TF-IDF = wi,j = TFi,j x IDFi [43] | \( T{F}_{i,j}=\frac{f_{i,j}}{ma{x}_z{f}_{z,j}} \) |
where: | |
TFi,j is term frequency for keyword ki in document dj | |
fi,j is the number of times ki appears in dj | |
maxzfz,j is the maximum frequency across all keywords kz in dj | |
\( ID{F}_i= log\frac{N}{n_i} \) | |
where: | |
IDFi is the inverse document frequency for keyword ki | |
N is the total number of documents in the corpus | |
nj is the number of documents that ki appears in | |
Cosine similarity [43] \( cosine\left(\overrightarrow{w_c},\overrightarrow{w_s}\right)=\frac{\overrightarrow{w_c}\cdot \overrightarrow{w_s}}{\overrightarrow{w_c}\times \overrightarrow{w_s}} \) | \( =\frac{{\displaystyle {\sum}_{i=1}^K}{w}_{i,c}{w}_{i,s}}{\sqrt{{\displaystyle {\sum}_{i=1}^K}{w}_{i,c}^2}\sqrt{{\displaystyle {\sum}_{i=1}^K}{w}_{i,s}^2}} \) |
where | |
wi,j is defined above |