International Journal of Signal System Control and Engineering Application

Year: 2019
Volume: 12
Issue: 4
Page No. 67 - 73

Web Page Block Identification using Machine Learning Techniques

Authors : Neetu Narwal, Sanjay Kumar Sharma and Amit Prakash Singh

Abstract: Internet has gained greatest acceptance as reservoirs of information. It has been observed that the web page along with main content comprises of noise (advertisement, external links). This noise content poses difficulty for various search engines to classify the web page accurately and provides distraction to the serious user interested in gathering data related to a topic. There are various segmentation techniques that partition the web page but very few have categorized the segmented block. In this study, we tried to categorize the page blocks extracted from segmentation. We have used web page segmentation algorithm for parsing the web page and extracted important features to build a dataset. Linear and nonlinear machine learning techniques to have been used to train dataset. In this experiment we also analyzed the importance of features for the learning process. We perceived that the embedded objects from external source have highest significance for block identification. In our experiment, the non-linear radial basis neural network resulted in best performance with an accuracy of 99.89%.

How to cite this article:

Neetu Narwal, Sanjay Kumar Sharma and Amit Prakash Singh, 2019. Web Page Block Identification using Machine Learning Techniques. International Journal of Signal System Control and Engineering Application, 12: 67-73.

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved