Main Article Content
Abstract
Text document clustering has been intensively studied because of its important role in text-mining and
information retrieval. High dimensionality problem caused by high number of words is always happened in
word-based clustering technique using vector space model. Although extracting words in the preprocessing
phase is simple, the collection itself is not only can be viewed as a set of words but also a set of partly more than
one word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore in
order to maintain the context of words a phrase must be maintain as a phrase. It is assumed that by adding
phrases to words as features in clustering will improve the performance. This paper will study the comparison of
word-base and phrase-based clustering. Three clustering models was chosen i.e. hierachical, partional and
hybrid model. Four similarity technique i.e. GroupAverage, CompleteLink, SingleLink, and ClusterCenter was
tried for hierarchical, K-Means and Bisecting K-Mean for partitonal and buckshot for hybrid. Document
collections from 200-800 news text that has been categorized manually was used to test these algorithms by
using F-measure as criteria of clustering performance. This value was derived from Recall and Precision and
can be used to measure the performance of the algorithms to correctly classify the collections. Results show that
by adding phrases or simply word pair, although it’s still not statistically significant, it slightly improves the
performance of clustering.
Keywords: word-base document clustering, phraset-based document clustering, clustering performance
information retrieval. High dimensionality problem caused by high number of words is always happened in
word-based clustering technique using vector space model. Although extracting words in the preprocessing
phase is simple, the collection itself is not only can be viewed as a set of words but also a set of partly more than
one word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore in
order to maintain the context of words a phrase must be maintain as a phrase. It is assumed that by adding
phrases to words as features in clustering will improve the performance. This paper will study the comparison of
word-base and phrase-based clustering. Three clustering models was chosen i.e. hierachical, partional and
hybrid model. Four similarity technique i.e. GroupAverage, CompleteLink, SingleLink, and ClusterCenter was
tried for hierarchical, K-Means and Bisecting K-Mean for partitonal and buckshot for hybrid. Document
collections from 200-800 news text that has been categorized manually was used to test these algorithms by
using F-measure as criteria of clustering performance. This value was derived from Recall and Precision and
can be used to measure the performance of the algorithms to correctly classify the collections. Results show that
by adding phrases or simply word pair, although it’s still not statistically significant, it slightly improves the
performance of clustering.
Keywords: word-base document clustering, phraset-based document clustering, clustering performance