Authors : S. Vijayalakshmi and D. Manimegalai
Abstract: Text document clustering is the fundamental technique to mine massive amount of textual data. The problem is of high dimension and most of the machine learning algorithms does not perform well with all the terms in the corpus. In this study, researchers proposed an application of flocking algorithm for text document clustering using two document representation methods. They are Unigram and Noun. In this research, the problem of high dimensions has been dealt with representing documents as Bag of Nouns (BoN) and Bag of Unigrams (BoU). As there are thousands of words present in documents to find Unigram, user has to connect with WordNet and verified the selected features are Unigram. The same process is repeated for Noun. In clustering algorithm, boids follow four simple local rules like alignment, separation, cohesion and similarity to calculate the velocity for flocking. Experiments were conducted with documents of 20 Newsgroup, Reuter Real datasets and Specific Crime Judgment corpus to study the advantages of the system. Flocking algorithm for Text Document clustering is compared with Unigram based document representation and Noun based Document representation. It is observed that Flocking algorithm with Bag of Noun is working efficiently than Bag of Unigram and Bag of Words.
S. Vijayalakshmi and D. Manimegalai, 2014. Text Document Clustering with Flocking Algorithm using Specific Crimes Judgment Corpus. Asian Journal of Information Technology, 13: 21-28.