Thematic clustering of text documents using an EM-based approach

Table 1 The thematic clustering algorithm

Given K initial clusters, the number n_U, and the set of prior probabilities {pr_d}_d∈D,
1. Create a random partition ${V_{i}}_{i = 1}^{K}$ of D with corresponding relations ${R_{i}}_{i = 1}^{K}$ .
2. Compute p_t, q_t, and r_t for V_i.
3. Compute α_t for V_i.
4. For each cluster, select the n_U points for which α_t is the greatest to define the set U and the indicator values {u_t}_t∈T.
5. Compute the probabilities {pz_d}_d∈Dfor each cluster V_i.
6. For all d, assign a document to the cluster in which the document has the highest probability.
7. Test for convergence. Terminate if converged.
8. For a subset $D_{s} \subset D_{V_{i}}$ , where the documents in D_s has the lowest 1% {pz_d} in V_i, re-assign to the clusters that have the second highest probabilities.
9. Return to Step 2.

ISSN: 2041-1480