Journal of Engineering and Applied Sciences

Year: 2017
Volume: 12
Issue: 3
Page No. 468 - 474

Construction of Malay Abbreviation Corpus Based on Social Media Data

Authors : Nasiroh Omar, Ahmad Farhan Hamsani, Nur Atiqah Sia Abdullah and Siti Zaleha Zainal Abidin

Abstract: This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user’s behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public’s sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media.

How to cite this article:

Nasiroh Omar, Ahmad Farhan Hamsani, Nur Atiqah Sia Abdullah and Siti Zaleha Zainal Abidin, 2017. Construction of Malay Abbreviation Corpus Based on Social Media Data. Journal of Engineering and Applied Sciences, 12: 468-474.

Design and power by Medwell Web Development Team. © Medwell Publishing 2024 All Rights Reserved