It
is estimated that major part of useable business information is
unstructured, often in the form of text data. Text mining provides a collection
of methods that help us to derive actionable insights from these data.
The main package to perform text
mining tasks in R is tm .The structure for managing documents in tm
is Corpus, representing a collection of
text documents. Or "A corpus is a large body of natural language text used
for accumulating statistics on natural language text. The plural is corpora. A lexicon is a collection of
information about the words of a language about the lexical categories to which
they belong. A lexicon is usually structured as a collection of lexical entries like same word used
for verb, Noun and adjectives.
Transformations:
Once we have a corpus
we typically want to modify the documents in it, e.g., stemming, stopword
removal…etc. In tm, all this
functionality is subsumed into the concept of a transformation. Transformations
are done via the tm_map() function which applies (maps) a function to all
elements of the corpus. Basically, all transformations work on single text
documents and tm_map() just applies them to all documents in a corpus.
Eliminating Extra
Whitespace
> sample <-
tm_map(sample, stripWhitespace)
Convert to Lower Case
> sample <-
tm_map(sample, content_transformer(tolower))
Remove Stopwords
> sample <-
tm_map(sample, removeWords, stopwords("english"))
Stemming is done by:
> sample <- tm_map(sample,
stemDocument)
------------------------------------------------------------------------------Wordcloud _example_1:
Step 1 : Install package "tm"
Step 2: Install package "RColorBrewer"
Step 3 : Install package wordCloud
Step 4 : Load Libraries
Step 5 : Execute the R script :
-------------------------------------------------------------------------------------------------------
my_data_file = readLines("/home/spb/data/
library(tm)
myCorpus = Corpus(VectorSource(my_data_file))
myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))
myTDM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
m = as.matrix(myTDM)
v = sort(rowSums(m), decreasing = TRUE)
library(wordcloud)
set.seed(4363)
wordcloud(names(v), v, min.freq = 50)
-----------------------------------------------------------------------------------------------------
Step 6 : wordcloud visualization :
---------------------------------------------------------------------------------------
Wordcloud _example_2:
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = TRUE)
-------------------------------------
Wordcloud _example_3:
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = FALSE)