| C-value (a) [18] | \( \left\{\begin{array}{l}\kern12.5em lo{g}_2\left|a\right|\cdot f(a),\kern2em \left|\alpha\ is\ not\ nested\right.\hfill \\ {}lo{g}_2\left|a\right|\left(f(a)-\frac{1}{P\left({T}_{\alpha}\right)}{\displaystyle \sum_{b\epsilon {T}_{\alpha }}f(b)}\right),\kern1em \left| otherwise\right.\hfill \end{array}\right. \) |
| where: | |
| \( \alpha \) is the candidate string | |
| f(.) is its frequency of occurrence in the corpus | |
| Τa is the set of extracted candidate terms that contain a | |
| P(Τa) Is the number of these candidate terms | |
| Termhood (a) \( \log \left(\frac{P\left( vote= yes\right)}{P\left( vote= no\right)}\right) \) [53] | = −0.7836 + |
| 0.7541* FirstPOS _ ADJECTIVE – | |
| 1.3722* FirstPOS _ ADVERB + | |
| 0.3541* FirstPOS _ NOUN + | |
| 1.4182 * FirstPOS _ VERB – | |
| 0.7722 * LastPOS _ ADJECTIVE + | |
| 2.2576 * LastPOS _ ADVERB + | |
| 0.0285 * LastPOS_NOUN + | |
| 0.6038 * LastPOS _ VERB + | |
| 1.2899 * NP _ VALUE + | |
| 1.0475 * REPEAT _ SUP _ GREATER _ MEDIAN + | |
| 0.8417 * REPEAT _ SUB _ GREATER _ MEDIAN + | |
| 0.8422 * DISTINCT _ PERHOST _ GREATER _ THAN _ MEDIAN | |
| where: | |
| POS is Part of Speech tag | |
| REPEAT_SUP is number of supra (candidate terms containing a) = P (Τa) | |
| REPEAT_SUB is subgroup (candidate terms that are contained within a) = P (Αt) | |
| NP_VALUE is a a noun phrase | |
| DISTINCT_PER_HOST is equivalent to document frequency | |
| MEDIAN is calculated for the whole document set | |
| TF-IDF = wi,j = TFi,j x IDFi [43] | \( T{F}_{i,j}=\frac{f_{i,j}}{ma{x}_z{f}_{z,j}} \) |
| where: | |
| TFi,j is term frequency for keyword ki in document dj | |
| fi,j is the number of times ki appears in dj | |
| maxzfz,j is the maximum frequency across all keywords kz in dj | |
| \( ID{F}_i= log\frac{N}{n_i} \) | |
| where: | |
| IDFi is the inverse document frequency for keyword ki | |
| N is the total number of documents in the corpus | |
| nj is the number of documents that ki appears in | |
| Cosine similarity [43] \( cosine\left(\overrightarrow{w_c},\overrightarrow{w_s}\right)=\frac{\overrightarrow{w_c}\cdot \overrightarrow{w_s}}{\overrightarrow{w_c}\times \overrightarrow{w_s}} \) | \( =\frac{{\displaystyle {\sum}_{i=1}^K}{w}_{i,c}{w}_{i,s}}{\sqrt{{\displaystyle {\sum}_{i=1}^K}{w}_{i,c}^2}\sqrt{{\displaystyle {\sum}_{i=1}^K}{w}_{i,s}^2}} \) |
| where | |
| wi,j is defined above |