Introduction Textminingis basically defined as conversion of huge data or documents into usefulnumbers. Text mining is used to analyze useful or meaningful information fromraw data with use of various algorithms and patterns in the data. Text miningis used for unstructured data or Semi structured data such as Emails, textmessage. It used to filter out spam message in emails by identifying certaintext common is such emails.
After certain information retrieval from thedata/documents this data is used in data mining projects (clustering andfactoring, graphics, predictive data mining). TextMining is the same as Data Mining except for the fact that Text Mining works onraw or unstructured text such as Emails, HTML or FullText Documents while Data mining works on structured data. SomeCommon aspects of Text Mining include removing certain keyword like “THE”,punctuation marks etc. from the important data to improve search quality.
Wewill learn about it in preprocessing text TextMining is used in various Educational, Research and Industrial purposes such asSocial Media, Research Papers, and Sentimental Analysis etc. PREPROCESSINGTEXTNEEDFOR PREPROCESSING TEXT1)To Reduce the Size of Text Document i) To eliminate words according to theirfrequency.ii)It is used to eliminate common words or stop words like “the” “and”, etc. 2)To Improve Efficiency and Performance of Information Retrieval System in Textmining3)It can save Administrator significant amount of time and space resources. WAYS OF PREPROCESSING TEXT1) TokenizationTokenizationis the process of deciphering textual content into meaning full words, terms orsymbols which are known as tokens. These words are differentiated using fullstops, commas, and whitespaces. Tokenization is dependent on the languages usedfor English language Tokenization is a simple task while for languages likeChinese, Korean it’s a difficult task to perform.Eg:à”TEXTMINING IS THE PROCESS OF RETRIVAL OF IMPORTANT INFORMATION FROM UNSTRUCTUREDDATA”.
Output:à”TEXT,MINING,IS,THE,PROCESS,OF,RETRIVAL,OF,IMPORTANT,INFORMATION,FROM,UNSTRUCTURED,DATA”. 2)Stop Word Removal The Major aim of stop word removal is to makereduce the dimensionality of the text by removing certain prepositions,articles, pre-nouns those are not necessary for text mining. This reduces textdata significantly and helps in optimizing the data. The list of stop words isavailable online . Another way of building a stop word list based on frequencyof word in a number of Documents.Somemethods of Stop Word Removal are:-i) Term Based Random Sampling(TBRS) ii) Zipf’s Lawiii) Mutual Information Methodiv) Based on Precompiled List Application of TextMining Themain objective of text mining is to reduce time utilization and filtering outunnecessary data from the main keywords or important data. It is used toprovide better services to the users by giving proper feedback.
It is used toby businesses to analyze consumer base and provide services accordingly bytargeting the potential customers. 1) SPAMMINGIDENTIFICATION As Filtering basedon IP address is not sufficient certain techniques of Text Mining are uses todetect salting. Salting is basically adding certain information to make it looklike original or official content.
Email service providing companies uses textmining to filter out spam messages, promotional message from the rest ofimportant messages thus saving users time and resources. This can be used forfurther filtering out messages according to the suitable age group. It is usedto provide protection against phishing and spamming. 2) SENTIMENTANALYSIS Sentiment Analysisis used to identify positive, negative or neutral reviews about a subject.Consider a watching a TV SERIES based on the reviews of viewers. The text usedin writing reviews is analyzed and according to the keywords used the emotionof the user is identified which can be used for marking them as positive ornegative reviews of the show. It also focuses on the words and phrases toidentify how negative or positives these words are. Consider thisStatement -“I LOVED THE NEW MOBILE.
BUT IT IS VERY EXPENSIVE AND DOES NOT HAVEGREAT BATTERY LIFE”.According to thefirst line the customer seems impressed but the overall the customer has anegative impression of the product.Sentiment Analysisare used to give indication about products such as while reading reviews abouta hotel you come across a word ROTTEN this Create a negativeimpression about the hotels. 3) INBIOMEDICAL DOMAINS Year by Year thenumbers of researches in medical fields are increasingly significantly thus thenecessity of text mining is evident text mining is used for quickly sorting outthe necessary data from medical record which are available. IN FIELDS likeCancer treatment text mining means improvising diagnostics, treatment, andprevention of cancer by mining of database.Another importantuse of text mining is mining EHR (Electronic Health Record) is used to searchthe patients previous records of certain diseases and medical history.
Text Mining is usedin for comparing gene markers with the previous Records andidentifying different pattern in genes for checking diseases. 4) SOCIALMEDIA PLATFORMS Social media are arich form of Unstructured Data. Social media is used connecting people i.e.
interactions and conversations. Some of these well known platforms are twitter,facebook, orkut. Data can be gathered using APIs.