Journal of Engineering and Applied Sciences

Year: 2018
Volume: 13
Issue: 3 SI
Page No. 3215 - 3220

Implementation of Real-Time Data Collection System Based on Improved Web Crawling Engine

Authors : Ki-Bok Nam, Koo-Rack Park, Jin-Young Jung and Han-Jin Cho

Abstract: Today, artificial intelligence has become an issue in the overall society. To learn artificial intelligence, it is required to collect big data. The conventional web crawling technique for data collection sometimes fails to read web pages when web sites change or servers shut down and has no any separate algorithm implemented in case of calling a page late. Therefore, the method has difficulty with data accuracy and effective data collection. In order to analyze the HTML DOM structure of a particular website to collect data and obtain proper data, this study implemented java based web crawling engine. Also, the implemented engine has an alarm function which is used to give an administrator the notification of the problem of a relevant website and uses multithread for fast simultaneous collection of multiple data. A relevant site may consider such an action to be a DDoS attack. Therefore, to solve the problem, the implemented engine does not access the same URL. When the system proposed in this study was applied, it shortened the time to obtain about 440,000 images at Google website from 58 h to <4 h. With the implemented system, it was possible to obtain data simultaneous by multiple keywords. Therefore, it is expected to collect big data more easily and accurately for artificial intelligence learning. In the future, the proposed web crawling engine will be developed further and therefore the system to process and classify collected data will be researched.

How to cite this article:

Ki-Bok Nam, Koo-Rack Park, Jin-Young Jung and Han-Jin Cho, 2018. Implementation of Real-Time Data Collection System Based on Improved Web Crawling Engine. Journal of Engineering and Applied Sciences, 13: 3215-3220.

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved