User Tools

Site Tools


kafka_s_metamorphosis_visualized

Text Mining is an interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics. I will be using the tm() library package to achieve a simple text analysis on Kafka's Metamorphosis.
The material is always under development! This document will change (and hopefully improve) regularly.

Large scale data processing for corpora of text requires integrated frameworks for parallel/distributed computing available (e.g., Hadoop). Parallel/distributed computing is now easier than ever on R - via hive, nws, Rmpi, snow, etc.

Create a directory with text file, ie the corpus we are going to work with. Remember that the matrix sparsity of a corpus is define as a matrix in which most of the elements are zero. The more elements we store in the corpus, the more sparsity we will have. I split the Metamorphosis in 4 text files and stored them in “./corpus/txt” .

Load the libraries and the corpus.

library(tm)
library(nplr)
 
kafka <- Corpus(DirSource("./corpus/txt"))


I inspect the elements of the corpus.

>> inspect(kafka)
 
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4
 
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 20680
 
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 34218
 
[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 48000
 
[[4]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 14319


I apply simple trasformations onto the corpus, applying map functions. Typically, Map tasks are applied to partitioned data. Computational operations (Map) are applied on all elements in parallel.

## simple transformation on special characters
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
kafka <- tm_map(camus, toSpace, "/|@|\\|")
 
## lower cases, numbers, punctuation
kafka <- tm_map(camus, content_transformer(tolower))
kafka <- tm_map(camus, removeNumbers)
kafka <- tm_map(camus, removePunctuation)

We can take invoke the build-in function stopwords(). Let's have a look at the df first.

stopwords("english")
 
> stopwords("english")
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 
 
kafka <- tm_map(kafka, removeWords, stopwords("english")) 
 
## or replace with a vector c("a","b")
kafka <- tm_map(kafka, removeWords, c("even"))
 
##whitespaces
kafka <- tm_map(kafka, stripWhitespace)

Now let's work on frequencies

freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
 
## the tail?
freq[tail(ord)]
 
>>mother   door sister father   room gregor 
    90     97    101    102    133    298 

Sparse terms should be exclude from our analysis. The intuition is clear. A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.

dim(dtm)
dtms <- removeSparseTerms(dtm, 0.6)
inspect(dtm)
 
>  Docs        charwoman chase cheaper check cheek cheer cheerio chees chest chew
  meta.txt          0     0       0     0     0     1       0     0     1    0
  meta2.txt         0     0       0     0     0     0       0     3     3    0
  meta3.txt         6     4       0     0     2     0       0     0    10    1
  meta4.txt         3     0       1     2     1     0       1     0     0    0

Finding words which 'associate' together. Here, we are specifying the Term Document Matrix to use, the term we want to find associates for, and the lowest acceptable correlation limit with that term. Thisreturns a vector of terms which are associated with “gregor” at 0.60 or more (correlation) - and reports each association in decending order.

findFreqTerms(dtm, lowfreq=60)
findAssocs(dtm, "gregor", corlimit=0.11)
 
$gregor
>           eaten           father             fled             flew 
            0.99             0.99             0.99             0.99 
            fond             full             girl          heavily 
            0.99             0.99             0.99             0.99 

Now let's plot the result.

#install Rgraphviz
#source("http://bioconductor.org/biocLite.R")
#biocLite("Rgraphviz")
 
#add graphs
plot(dtm, terms=findFreqTerms(dtm, lowfreq=80)[1:4], corThreshold =0.40)

Self-explaining.

kafka_s_metamorphosis_visualized.txt · Last modified: 2015/07/24 12:49 by vincenzo