Journal of Engineering and Applied Sciences

Year: 2018
Volume: 13
Issue: 6
Page No. 1499 - 1505

Web Documents Similarity Using K-Shingle Tokens and MinHash Technique

Authors : Mehdi Ebady Manaa and Ghufran Abdulameer

Abstract: Nowadays, web search engine plays an integral role in discarding similar documents from the web search engine using one of the effective data mining techniques. Document similarity techniques in a massive data mining is such important technique in order to detect the mirror pages and the similarity of the articles in a large web repository. This will lead to avoid showing two web pages which are near identical at the top of search results. One of the document similarity approach is based on K-shingle which is a unique sequence of consecutive K words that can be used to find the similarity between two documents (K is a positive integer). The large web documents can be represented in a sets of long bit vectors 0 and 1. Here, 0 means not found while 1 means found in that document. The two documents that are near identical should have many shingles in common. The similarity ratio is calculated by using one of the distance metrics such as Jaccard similarity between two documents. Jaccard similarity is working well in the comparison between a pair of set values in a small dataset and to find the similarity score. Whereas in the large data set, MinHash and Locality-Sensitive Hashing (LSH) techniques come to solve this problem by providing a small signature matrix for the fast approximation to the truly Jaccard similarity in less time. In this study, we apply the Jaccard similarity, MinHash and LSH techniques based on K-shingles for a different number of the documents. The results show that the MinHash and LSH techniques produce more accuracy in results with less time for large documents. The experimental results show that the chosen K-shingle is applied into different documents number of ranges from 100, 200, 300-1000 documents. The hash functions are applied in different number from 10, 20 and 30. The average similarity time is <5 sec. The false positive and false negative were minimum to truly clustering of the documents.

How to cite this article:

Mehdi Ebady Manaa and Ghufran Abdulameer, 2018. Web Documents Similarity Using K-Shingle Tokens and MinHash Technique. Journal of Engineering and Applied Sciences, 13: 1499-1505.

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved