Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval

Abstract: Document clustering generate clusters from the whole document collection automatically and is used in many fields including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most important ones. These characteristics of text data require clustering techniques to be scalable to large and high dimensional data and able to handle sparsity and semantics. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem such a bag of original words cannot represent the content of a document precisely. Most of the existing text clustering methods use clustering techniques which depend only on term strength and document frequency where single terms are used as features for representing the documents and they are treated independently which can be easily applied to non-ontological clustering. To overcome these issues, this study makes a survey of recent research done on ontology or thesaurus based document clustering.

HOME JOURNALS CONTACT

Journal of Engineering and Applied Sciences

Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval

G. Bharathi and D. Venkatesan

How to cite this article

G. Bharathi and D. Venkatesan, 2012. Study of Ontology or Thesaurus Based Document Clustering and Information Retrieval. Journal of Engineering and Applied Sciences, 7: 342-347.