Journal of Engineering and Applied Sciences

Year: 2017
Volume: 12
Issue: 18
Page No. 4651 - 4656

Machine Learning-Based Topical Web Crawler: An Ensemble Approach Incorporating Meta-Features

Authors : Tae Jun Kim and Han- Joon Kim

Abstract: A topical web crawler is to collect web pages that describe some pre-specified topics. The web pages collected by the topical crawler share the same or similar words and however among them not a few pages can be irrelevant to the given topics. In particular, the performance of topical crawler degrades for a more specific topic. To achieve successful topical crawling, an additional job is required to actively filter out the pages irrelevant to the given topics. For this we propose an ensemble-style machine learning architecture that can effectively handle not only literal term features but also numeric meta-features to improve topical web crawler; in our work we intend to more precisely crawl the web pages about ‘fire accidents’ as a specific topic. In case of the fire we have found that significant meta-features for topical crawling include the information of tags, the number of words in the title, the number of person names, the number of location names of web pages and so forth. For the numeric meta-features we use the logistic regression and random forest learning algorithms and for the literal word features, Naive Bayes and support vector learning algorithms. Through extensive experiments using the fire accident-related news articles we prove that the proposed method outperforms the conventional ones.

How to cite this article:

Tae Jun Kim and Han- Joon Kim, 2017. Machine Learning-Based Topical Web Crawler: An Ensemble Approach Incorporating Meta-Features. Journal of Engineering and Applied Sciences, 12: 4651-4656.

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved