Abstract— Text classification is one approach
used for recognizing an intent on chat bot. One of its success factor is
preprocessing. Combination of all preprocessing techniques do not always
improve the performance of machine learning. It depends on the data and the
algorithm. This paper performs the comparison of preprocessing technique
combinations performance – case study of Bahasa Indonesia e-learning chat bot.
With the convolutional neural network and cross validation evaluation, there
was several preprocessing techniques compared – noise removal, case folding,
tokenization, stop word removal, and stemming. The benchmark reveals that
combination of all preprocessing tasks has a precision of 93.84%, recall of
90.85%, and accuracy of 91.98% taking preprocessing time 35.71ms for 1320 rows
questions. Compared to the best performance, the combination of noise removal,
case folding, and tokenization has a precision of 92.56%, recall of 89.42%, and
accuracy of 90,65% with the preprocessing time of 20.21ms. So, no one model can
be concluded as the best choices. It depends on the functionality. For the
production environment with the small-scale resource and needing fast
processing, should be chosen the suitable preprocessing combination fitting to
it.

Keywords: preprocessing,
performance, process-time

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

                                                                                                                                                             
I.       
 Introduction

Text classification is common method in machine learning,
but it is challenging due to no one method works with all dataset. The
implementation of this technique has been succeed for solving many task, for
instance email spam filter 1, news categorization 2, and sentiment analysis 3. Moreover, this technique can
be an approach for detecting question intents in a chat bot.

There are several stages to recognize a question intent on
a chat bot, one of them is preprocessing. It is the most important subtask
because of the exponential data growing affecting computational time 4. Not only that, preprocessing
efforts take up to 80% than the other stages 5. Another benefit of preprocessing
is improving the performance (e.g., accuracy, precision, and recall) of machine
learning 167. By the fact of those
research, the use of more preprocessing techniques like combination of case
folding, stop word removal, and stemming do not always improve the performance.
So, the use of preprocessing technique depends on the data and machine learning
algorithm.

The objective of this study is benchmarking some
combinations of preprocessing technique and those effect to the accuracy,
precision, and recall. The dataset, containing the collection of student
questions to the e-learning chat bot, used to evaluate those performance of
convolutional neural network. Furthermore, this study also shows the time impact
when using several combination techniques.

An investigation of implementing the combination of text
preprocessing is discussed in this paper. Study literature is explained in
Section II. Methodology of this study is described in Section III. Details of
result and analysis is provided in Section IV, and the conclusion remarks in
Section V.

                                                                                                                                                        
II.      
Study Literature

A. 
Preprocessing

In intent
recognition, noise removal is used for removing the unnecessary character
(e.g., symbol, number, and white space) 1. The removal of symbol and
number due to meaningless to specific class or intent, and whitespace can
disrupt the next step.

Case folding
is the procedure of converting all character to be lowercase since assuming
both of uppercase and lowercase word forms are have the same meaning 6. By using this technique, the
corpus size could be decreased.

Other technique
used in intent recognition is tokenization. Tokenization used for splitting a
text into meaningful feature like words or phrases 8.

Another basic
preprocessing is stop word removal. The purpose is preventing meaningless words
classified to the intent having these words (e.g., conjunction and
prepositions) 8. In other words, stop word
removal used for dropping irrelevant words to all category.

The goal of
stemming is to stem the derived words to be the root. It used to be use since
the derived words semantically similar to the root. Several approaches used
this preprocessing is affix removal stemming, n-gram stemming, end table lookup
stemming 1.

B. 
Evaluation Metrics

1.   Accuracy

where

TP   : True Positive (correct prediction count as
correct)

TN   : True Negative (correct prediction count as
wrong)

P     : Positive (actual positive class)

N     : Negative (actual negative class)

2.   Precision

where

TP   : True Positive (correct prediction count as
correct)

FP   : False Positive (incorrect prediction count as
correct)

3.   Recall

where

TP   : True Positive (correct prediction count as
correct)

FN   : False Negative (incorrect prediction count as
correct)

                                                                                                                                                           
III.     
Methodology

A. 
Dataset

A dataset
obtained from student of Universitas Padjadjaran which consist of 1320 sample questions.

Table 1 Sample Dataset

Intent

Sample Question

Lecturer

astprak pemrograman web

Siapa aja asprak praktikum pw?

Siapa astprak hari ini ?

Schedule

Waktu kelar praktikum

Setelah praktikum ini praktikum apa

Jadwal kuis uts uas

Information

Liat info praktikum sisber

perlihatkan tugas yg harus dikumpulkan

Buat praktikum ini ada berita apa aja?

Assignment

Kapan kah deadline tugas dikumpulkan?

Tugas saya yang belum lengkap

apa tugas minggu ini

Grade

perlihatkan nilai uts/uas/kuis praktikum saya

Mau liat nilai praktikum yang kemarin dong, bisa?

Bisakah saya lihat nilai praktikum (matakuliah ini)?

There
are 263 questions for lecturer intent, 265 questions for schedule intent, 262
questions for information intent, 267 questions for assignment intent, and 263
questions for grade intents.

B. 
Preprocessing Method

In this study, four combinations of five preprocessing techniques
provided. There is noise removal (NR), case folding (CF), tokenization (TKN),
stop word removal (SWR), and stemming (STM). All combination shown in Table 2

Table 2 Combinations of
preprocessing techniques

Model

Preprocessing

1

NR + CF + TKN

2

NR + CF + TKN + SWR

3

NR + CF + TKN + STM

4

NR + CF + TKN + SWR +
STM

C. 
Classification Algorithm

Modern algorithm having better result in classifying text compared
than traditional (e.g., Naïve Bayes and SVM) is Convolutional Neural Network 3. There are several
architectures used to classify a text. Thus, the architecture of Convolutional
Neural Network used in this study is shown in Figure 1.

Figure
1 Convolutional Neural Network Architecture

This
architecture consists of preprocessing which is used preparing data to be ready
to process. Word2Vec process the data from the previous stage to be the
vector-based word representation. Then, it is calculated by using convolution
function and activated with rectified linear unit. Max pooling is used for
summarize the data. Three steps before is repeated 2 times and activated by
using softmax which is good for multi class
classification.

D. 
Evaluation

Accuracy,
precision, recall are common metrics used for evaluating the performance of classification.
All of these measured by using cross-validation since this method closer to the
real-world case than using traditional method 9, 10 – splitting train and test by
total percentage. The value of cross-validation used for evaluating is 10
folds.

E.  
Resource

This study uses the resource listed in Table
3. It is used for evaluating the model performance and processing time.

Table 3 Resource for Evaluation

Resource

Detail

Processor

Intel Core i5 5250U 1.6GHz

Memory

4GB DDR3

Graphic Card

Intel HD Graphics 6000

                                                                                                                                                  
IV.     
Result and
Discussion

First stage of preprocessing,
un-preprocessed
dataset containing number and symbol filtered from both characters as shown in
Figure 2.

Plain Text

 

Preprocessed

Sebutkan
jadwal praktikum yang diajar oleh Kang Budi!

Sebutkan jadwal
praktikum yang diajar oleh Kang Budi

Bot,
apa informasi 2 hari yang lalu?

 

Bot apa informasi hari yang lalu

Figure
2 Noise Removal

The second stage, all uppercase
character converted to be lowercase for assuming both of form has a same
meaning.

Plain Text

 

Preprocessed

Sebutkan
jadwal praktikum yang diajar oleh Kang Budi

sebutkan jadwal
praktikum yang diajar oleh kang budi

Bot apa
informasi hari yang lalu

 

bot apa informasi hari yang lalu

Figure
3 Case Folding

After case folding, tokenization technique
is implemented to split each sentence into sequence of words. So, the data
consist of collection of sequence of words.

Plain Text

 

Preprocessed

sebutkan jadwal praktikum yang
diajar oleh kang budi

 

sebutkan

jadwal

praktikum

yang

diajar

oleh

 

kang

budi

 

bot apa informasi hari yang lalu

 

bot

apa

informasi

 

hari

yang

lalu

Figure
4 Tokenization

Another preprocessing is removing the
prepositions and conjunctions consisting in all intents.

Plain Text

 

Preprocessed

sebutkan

jadwal

praktikum

 

sebutkan

jadwal

praktikum

yang

diajar

oleh

diajar

kang

budi

kang

budi

 

 

 

 

 

bot

apa

informasi

 

bot

apa

informasi

hari

yang

lalu

 

hari

lalu

 

Figure
5 Stop Word Removal

The last preprocessing is
stemming for changing the word from derived form into the root as shown as
Figure 6. The form changing, for instance sebutkan into sebut, diajar into ajar is the
implementation of this technique.

Plain Text

 

Preprocessed

sebutkan

jadwal

praktikum

sebut

jadwal

praktikum

diajar

kang

budi

 

ajar

kang

budi

bot

apa

informasi

 

bot

apa

informasi

hari

lalu

 

 

hari

lalu

 

Figure
6 Stemming

Then, Table 4 shows the total
corpus after preprocessed by the scenario of Table 2.

Table 4 Total Corpus after
Preprocessing

Model

Total Corpus

1

537

2

510

3

443

4

420

By the result, the use of stemming technique
reduces the corpuses significantly than stop word removal. The used of stemming
reduce the corpuses until 94, then stop word only 27. Depend on 1, 6, small size corpus is not always has a good performance. To compare
it, each combination of preprocessing is trained by CNN architecture in Figure
1 and evaluated by using cross-validation. The result is shown in Table 5.

Table 5 Performance Evaluation

Model

Precision

Recall

Accuracy

1

0.92556128

0.89416158

0.90648012

2

0.9257851

0.8972754

0.90961498

3

0.93029693

0.9054597

0.91471701

4

0.93845744

0.90849989

0.91982958

In this case, recognizing
Bahasa Indonesia intent for this case, the best result is marked by using all
combination of preprocessing techniques – noise removal, case folding,
tokenization, stop word removal, and stemming – with the number of 93.84% for
precision, 90.85% for recall, and 91,98% for accuracy. By the fact of Table 4,
stemming has more performance improvement than stop word removal. It is shown
that model 3 has better performance than model 2.

For the
production purpose – intent recognition on chat bot, time processing is one of
parameter which cannot be ignored. More techniques are used, the more time is
spent to preprocess the dataset.

Figure
7 Preprocessing Time

Based on Figure 7,
combinations of all preprocessing technique take the longest time. Model 4
takes 35.71ms for preprocessing 1320 rows of dataset which has a significant
increasing time because of stop word removal. It is same as 0,027ms per
sentence to preprocess and 0.015ms per sentence compared to Model 1.

                                                                                                                                                             
V.      
Conclusions

Considering which preprocessing is used,
there are some focus configuration to implement an intent recognition. For this
study, combination of noise removal, case folding, tokenization, stop word
removal, and stemming improve the convolutional neural network classification performances
and decrease the corpus dictionary. However, the impact is increasing the
process time for preprocessing the raw text. The limit of resource can be the
consideration for the model in chat bot production environment.

References

1           W. Etaiwi and G. Naymat, “The Impact
of applying Different Preprocessing Steps on Review Spam Detection,” Procedia
Comput. Sci., vol. 113q, pp. 273–279, 2017.

2           R. Wongso, F. A. Luwinda, B. C. Trisnajaya, O. Rusli, and
Rudy, “News Article Text Classification in Indonesian Language,” Procedia
Comput. Sci., vol. 116, pp. 137–143, 2017.

3           S. Liao, J. Wang, R. Yu, K. Sato, and Z. Cheng, “CNN for
situations understanding based on sentiment analysis of twitter data,” Procedia
Comput. Sci., vol. 111, no. 2015, pp. 376–381, 2017.

4           V. Srividhya and R. Anitha, “Evaluating preprocessing
techniques in text categorization,” Int. J. Comput. Sci. Appl., no.
2010, pp. 49–51, 2010.

5           K. Morik and M. Scholz, “The MiningMart Approach to Knowledge
Discovery in Databases,” pp. 47–65, 2003.

6           A. K. Uysal and S. Gunal, “The impact of preprocessing on
text classification,” Inf. Process. Manag., vol. 50, no. 1, pp. 104–112,
2014.

7           P. Chandrasekar and K. Qian, “The Impact of Data
Preprocessing on the Performance of a Naive Bayes Classifier,” 2016 IEEE
40th Annu. Comput. Softw. Appl. Conf., pp. 618–619, 2016.

8           R. Feldman and J. Sanger, The Text Mining Handbook.
2006.

9           R. Kohavi, “A Study of Cross-Validation and Bootstrap for
Accuracy Estimation and Model Selection 2 Methods for Accuracy Estimation,” Proc.
of IJCAI’95, pp. 1137–1145, 1995.

10         A. Krogh, A. Krogh, J. Vedelsby, and J. Vedelsby, “Neural
Network Ensembles, Cross Validation, and Active Learning,” Nips, pp.
231–238, 1995.

 

x

Hi!
I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out