A New CMAC Neural Network Model for Content-based Web Page Classification

Authors : Somaiyeh Dehghan and Amir Masoud Rahmani

Abstract: The rapid growth of World Wide Web in recent years, makes it necessary for search engines to classify this data into categories. The automatic classification of web pages deals with text information, structure information and hyperlink information of web pages, which focus research on automatic classification of web text information, that is classification based on content. A major difficulty of content based web page classification is how to deal with high dimensional feature space. Based on the analysis done, CMAC neural network showed faster learning in high dimensional problems. In the present study, the new CMAC neural network model is proposed for use in content based web page classification. The results show that the proposed model is more useful than any other algorithms.

How to cite this article:

Somaiyeh Dehghan and Amir Masoud Rahmani , 2009. A New CMAC Neural Network Model for Content-based Web Page Classification. International Journal of Soft Computing, 4: 10-15.

URL: https://medwelljournals.com/abstract/?doi=ijscomp.2009.10.15

INTRODUCTION

With the rapid growth of information on the World Wide Web, automatic classification of web pages has become important for effective indexing and retrieval of web documents. The automatic classification of web pages deals with text information, structure information and hyperlink information of web pages, which focus research on automatic classification of web text information, that is classification based on content (Liang, 2003). A major difficulty of this application is the high dimensionality of the feature space. A common approach to representing a text document is to use a bag of words that appear in the document. Since, a web page can contain thousands of words, the feature space for representing web pages is potentially huge. Few machine learning systems can handle such a large number of features.

Due to the diversity of web pages’ content, complexity of structure and other characteristics, it is very difficult to improve the accuracy of automatic classification. Of course the researchers have designed different kinds of classifiers in order to solve this problem such as: Naive Bayes classifier (Fern´andez et al., 2006; Tomar et al., 2006), K-neighbor Clustering (Shi, 2002), SOM neural network (Zhang, 2002), Support Vector Machine (SVM) ( Dumais, 2000; Xue, 2006) each of which has its own unique characteristics and applications. For example, Naive Bayes classifier is based on the assumption that features to be classified are orthogonal and subject to polynomial independent uniform distribution, K-neighbor Clustering assumes that no sample of another class appears in the known neighboring area, while SOM neural network is an order-holding mapping but requires more training time and SVM in spite of its strong classifying ability, demands a way of solving the problem through quadratic programming (Liang, 2003).

Based on the past analysis’s (Albus, 1975; Palacios et al., 2006) because of its rapid learning qualification, capability of a better local generalization, guaranteed convergence and easy implementation by the hardware, the Cerebellar Model Arithmetic Computer (CMAC) has high potentiality in the production of effective data mining techniques.

In the present study, we have introduced the new CMAC neural network model for content based web page classification.

WEB PAGE REPRESENTATION AND FEATURE EXTRACTION

The first step in the process of content based web page classification is the choice of features. The purpose of feature selection is to find a subset of attributes, which can describe the documents the best with regard to classification process or by their choice the learning algorithm has the most accuracy. Since, web data is unclear and disorganized, before the choice of features, there is need for some preprocessing steps on web documents. First the web pages are converted to HTML tag free text structures, then by eliminating the punctuation and stopwords (Frakes and Baeza-Yates, 1992) is purified and all the words in the document are changed to either small or capital letters. After this the number of terms remained in the text corpus is still a lot, so by the use feature extraction algorithm a number of key words are chosen for the description of the documents.

In order to select the features, first of all the web documents are converted to feature vectors. For displaying the documents in the form of vectors, there are different models such as: Vector Space Model (VSM) or Term Frequency-Inverse Document (TFID) (Salton et al., 1975), Boolean representation (Markov and Larose, 2007) and bags of words or Term Frequency (TF) (Markov and Larose, 2007). After converting the documents to feature vectors by one of the algorithms of feature extraction such as, Correlation-based Feature Selection (CFS) (Hall, 1998), Information Gain attribute evaluation (Quinlan, 1986) and Similarity-based Feature Selection (Kononenko, 1994) a number of terms are selected as features.

In this study, we have applied Boolean representation (Markov and Larose, 2007) method for representation of web documents. In Boolean representation method first the term-document matrix is made. In the term-document matrix columns represent terms and lines represent the documents. In Boolean representation method, which is the simplest model for displaying the documents, each term is considered as a Boolean attribute. As a result each cell of matrix consists 1 (if the term exists in the document) and/or 0 (if the term does not exists in the document).

After displaying the documents in the form of feature vectors, in order to determine, which term to select as the feature, the we used Information-based or Information Gain attribute evaluation (Quinlan, 1986). Information Gain attribute evaluation is based on entropy (Markov and Larose, 2007).

Let, S be a set of document vectors from k classes, C₁, C₂, ..., C_k. Then the number of vectors in S is |S| = |S₁| + |S₂| + ... + |S_k|, (where S_i is the set of vectors belonging to class) . The entropy is the average information needed to predict the class of an arbitrary vector in S. It is defined to be according to Eq. 1:

(1)

Where, the probability of class C_i is calculated as the proportion of instances in it (i.e., P(C_i) = |S_i|/|S|).

Assume now that attribute A has m values v₁, v₂, ..., v_m. Then A splits the set S into m subsets, A₁, A₂, ..., A_m, each including the documents that has value v_i for A. The entropy in the split based on attribute A is defined to be according to Eq. 2:

(2)

where, H(A_i) is the entropy of the class distribution in set A_i.

After splitting a set of documents the entropy in the split decreases. The best split (produced by the best attribute) will put in each A_i documents from a single class and thus, its entropy will be 0. The information gain (Quinlan, 1986) measures the quality of a split, respectively an attribute) by the decrease of entropy: that is according to Eq. 3:

(3)

CONVENTIONAL CMAC

A conventional CMAC model was introduced Albus (1975). The Conventional CMAC is a neural network, which models the structure and the operation of the cerebrum, part of the brain whose main function is to coordinate and control eyes, hands, fingers, arms and legs. The model is based on associative memory, which uses a lookup-table technique. The block diagram of CMAC model is shown in Fig. 1.

Figure 2 illustrates an example of CMAC structure with 2 s₁ and s₂ input variables. Each input variable is divided into a number of discrete sections called blocks. Figure 2, these sections are shown as A, B and C for s₁ and a, b and c for s₂. By shifting each block a small distance (called an element), different blocks arc created in different layers. For example: For A, B and C of's₁ input variable, we have D, E and F in the next layer. The blocks in 2 s₁ and s₂ input state form a space called hypercube. These spaces are named Bb, Ee, Ih. Each hypercube is a memory cell, which stores and retrieves data (Rahmani, 2003).

Output mapping: The output corresponding to the inputs is saved as information or weights in N_e association memory cells or in N_e hyper cubes. In order to obtain the outputs the Eq. 4 is used:

(4)

where, s is α distinguished state input, M is the volume of the memory and is equal to the number of the hyper cubes. If the location of memory j is covered by one of hyper cubes of input state s, a_j (s) = 1otherwise will be 0.

Learning algorithm: In CMAC model, supervised learning is used to adjust the stored weights in memory cells. Learning is iterative and based on global error generalization.


Fig. 1:	The block diagram of CMAC


Fig. 2:	The CMAC structure with 2 inputs

From this view it resemble to standard back propagation algorithm. Learning algorithm has two steps:

Applying the inputs and obtaining the outputs and Adjusting weights by comparing the obtained outputs with the desired ones.

The first step is obtained by the Eq. 4 or output mapping and the second one is obtained through the Eq. 5:

(5)

Where, α is the learning rate such that 0<α<1, N_e is the number of block elements or the number of layers, ŷ(s) is the amount of desired output and (ŷ(s) -y(s)) is the error amount for the training sample.

The error is multiplied by a(s) in order to adjusting the weights of the locations of memory, which are covered by hyper cubes. CMAC neural network has generalization potentiality such that if applied input to the network, which the network has not been taught to those inputs, the suitable output is generate.

THE PROPOSED NEW CMAC NEURAL NETWORK MODEL

The problem in web page classification is with high dimensional feature space, which increases the required memory for CMAC. For solving this problem and adaptation CMAC neural network for classifying web pages based on content, we proposed the new CMAC model, which is shown in Fig. 3.

In content based web page classification, the presence or absence of features or the number of feature occurrences together with keyword of each class, determine the class of the web pages.

Thus, each class has at least one keyword and if the number of the classes is C = {c₁,..., c_m}, the least number of keywords is also K = {k₁, ..., k_m}. Also, by using Information Gain attribute evaluation (Quinlan, 1986), N features extracted from the web pages, such that Features (c_i) = {x₁, ..., x_N}.


Fig. 3:	The new CMAC model

In the new CMAC model the number of modules is equal to the number of classes, that is M = {M₁, ..., M_m}. Within each module equal to the selected features from documents there are 2 inputs CMAC that is N 2 inputs CMAC. Then inputs of each CMAC within each module M_i is the keyword of class C_i and one of the selected feature from the same class. In the new CMAC model the output of each module (ONET_i) is the sum of two-inputs CMAC outputs.

Output mapping: In this model output of each CMAC is the stored weights in hyper cubes, which covered the input space (k_i, x_j), which is calculated through Eq. 6:

(6)

where, s is a distinguished state input, M is the volume of the memory and is equal to the number of the hyper cubes. If the location of memory j is covered by one of hyper cubes of input state s, α_j(s) = 1otherwise will be 0.

Also, the output of each module (ONET_i) is the sum of CMAC outputs when the number of occurrence of feature x_j in training sample is not zero, that is: frequency(x_j)≥1.

For this purpose, we define the vector P = [p₁ p₂ p_j ... p_N]. If feature x_j appears in the sample web document, p_j = 1 otherwise p_j = 0. Thus, the output of module i is obtained through the Eq. 7:

(7)

where, O_ij is the output of jth CMAC in module i, N is the number of features, which has been extracted form web pages and p_j shows the presence or absence of feature x_j in the training sample.

Learning algorithm: The learning algorithm in the new CMAC model is similar to the conventional CMAC. This model makes use of supervised learning based on global error generalization in order to adjust the stored weights in hyper cubes. Learning Rule obtained through the Eq. 8:

(8)

where, α is the learning rate such that 0<α<1, N_p is the number of non zero elements of vector P, N_e is the number of layers, is the amount of desired output and is the error amount for the training sample. The error is multiplied by α(s) in order to adjusting the weights of the locations of memory, which are covered by hyper cubes.

EXPERIMENTAL RESULTS

Dataset used: In order to show, the learning efficiency of the proposed model, the dataset of Syskill and Webert web pages ratings (Pazzani et al., 1996) were used for testing the new CMAC model. Pazzani et al. (1996) suggested an intelligent agent by the name of Syskill and Webert, which was designed to help users identify their favorite web pages on a special topic. This system provides an user interface so that users can rate the web pages to 3 different interesting classes: hot, medium, or cold. The results of the rating are recorded as user's profile with positive and negative examples. In this system each user has a set of profiles for every topic. The dataset of Syskill and Webert web pages ratings consists of HTML web pages together with the single user's rating on four distinct topics: Bands-recording artists, Goats, Sheep and Biomedical. This rating consists of the following 5 records: the name of source HTML web pages, the level of rating (cold, medium, hot), URL of the web site, the visited data and the title of web page.

In most researches, the medium level of this dataset because of its few number is merged with cold pages (Pazzani et al., 1996). Table 1 shows the rating subjects of single user, different interesting levels and the total number of web pages.

Data preparation: First of all, we converted web pages of Syskill and Webert dataset into HTML tag free text, then by using Weka software (Markov and Larose, 2007) the documents in the dataset were purified with the omission of punctuations and stopwords (Frakes and Baeza-yates, 1992) and all the words were changed into small letters.

Table 1:	The used topic in the experiments

Table 2:	Four labels assigns to each document in classification tasks

Table 3:	The accuracy of various classification algorithms and the new CMAC model

Then by using Weka, Boolean representation of documents was prepared. By the help of Weka, we used Information Gain attribute evaluation algorithm (Quinlan, 1986) for selecting features from web pages and a number of informative terms as features were selected. Then whole documents were converted feature vectors, which is learnable by the new CMAC model.

Performance measure: The major measurement criterion for classification process is Accuracy. As the most common clustering and classification problems involve 2 classes, they are usually called positive and negative. Also, the original class labels are referred to as actual and those determined by the classification algorithm are called predicted. According to this terminology, the Classes-to-Class evaluation assigns to each document one of the following 4 labels in Table 2 (Markov and Larose, 2007).

With regard to Table 2, Accuracy is the ratio of the number of predicted corrects (TP + TN) to total predicted ones, which is represented in Eq. 9:

(9)

Experiment: In this experiment the reported results by Pazzani et al. (1996), which used the seven classification algorithms for learning user profiles on Syskill and Webert rating web pages, compared with the results of the proposed new CMAC neural network results. For comparison the identical conditions were used. In this experiment a training set by the size of 20 samples of web pages was used.

Also for selecting 128 features, Information Gain attribute evaluation (Quinlan, 1986) and for constructing feature vectors, Boolean representation model (Markov and Larose, 2007) were used. Table 3 shows the Accuracy of seven algorithms Pazzani et al. (1996) and new CMAC model in the identical conditions.

The results of Pazzani et al. (1996) show that Bayesian Classifier has the best efficiency. The results obtained from experiment show that the new CMAC model in average has better efficiency with respect to algorithms.

CONCLUSION

In this study, the new CMAC model was introduced for content based web page classification, which is capable of learning the profiles of users. The experimental results show that the new CMAC model has fast learning and suitable generalization and when compared with other Classification algorithms has more Accuracy.

Related Links

Journals By Subject

International Journal of Soft Computing

A New CMAC Neural Network Model for Content-based Web Page Classification

How to cite this article: