User Tools

Site Tools


kafka_s_metamorphosis_visualized

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
kafka_s_metamorphosis_visualized [2015/07/24 12:45]
vincenzo
kafka_s_metamorphosis_visualized [2015/07/24 12:49] (current)
vincenzo
Line 9: Line 9:
 \\ \\
 \\ \\
-Create a directory with text file, ie the corpus we are going to work with. Remember that the matrix sparsity of a corpus is define as //a matrix in which most of the elements are **zero**//. The more elements we store in the corpus, the more sparsity we will have. Sparsity.+Create a directory with text file, ie the corpus we are going to work with. Remember that the matrix sparsity of a corpus is define as //a matrix in which most of the elements are **zero**//. The more elements we store in the corpus, the more sparsity we will have.  
 +I split the Metamorphosis in 4 text files and stored them in ''"​./​corpus/​txt"'' ​.
  
-I split the Metamorphosis in 4 text files and stored them in   .+Load the libraries ​and the corpus.
  
-Load the library and the corpus. +<​code ​java
- +
-<​code ​python+
 library(tm) library(tm)
 library(nplr) library(nplr)
Line 24: Line 23:
 \\ \\
 I inspect the elements of the corpus. I inspect the elements of the corpus.
-<​code ​python>+<​code ​java>
 >> inspect(kafka) >> inspect(kafka)
  
Line 56: Line 55:
 I apply simple trasformations onto the corpus, applying map functions. Typically, Map tasks are applied to partitioned data. Computational operations (Map) are applied on all elements in parallel. I apply simple trasformations onto the corpus, applying map functions. Typically, Map tasks are applied to partitioned data. Computational operations (Map) are applied on all elements in parallel.
  
-<​code ​python>+<​code ​java>
 ## simple transformation on special characters ## simple transformation on special characters
 toSpace <- content_transformer(function(x,​ pattern) gsub(pattern,​ " ", x)) toSpace <- content_transformer(function(x,​ pattern) gsub(pattern,​ " ", x))
-camus <- tm_map(camus,​ toSpace, "/​|@|\\|"​)+kafka <- tm_map(camus,​ toSpace, "/​|@|\\|"​)
  
 ## lower cases, numbers, punctuation ## lower cases, numbers, punctuation
-camus <- tm_map(camus,​ content_transformer(tolower)) +kafka <- tm_map(camus,​ content_transformer(tolower)) 
-camus <- tm_map(camus,​ removeNumbers) +kafka <- tm_map(camus,​ removeNumbers) 
-camus <- tm_map(camus,​ removePunctuation)+kafka <- tm_map(camus,​ removePunctuation)
  
 </​code>​ </​code>​
  
  
-We can take advantage of the build-in function ''​stopwords()''​. Let's have a look at the ''​df''​ first.+We can take invoke ​the build-in function ''​stopwords()''​. Let's have a look at the ''​df''​ first.
  
-<​code>​+<​code ​java>
 stopwords("​english"​) stopwords("​english"​)
  
Line 86: Line 85:
  
  
-camus <- tm_map(camus, removeWords,​ stopwords("​english"​)) ​+kafka <- tm_map(kafka, removeWords,​ stopwords("​english"​)) ​
  
 ## or replace with a vector c("​a","​b"​) ## or replace with a vector c("​a","​b"​)
-camus <- tm_map(camus, removeWords,​ c("​even"​))+kafka <- tm_map(kafka, removeWords,​ c("​even"​))
  
 ##​whitespaces ##​whitespaces
-camus <- tm_map(camus, stripWhitespace)+kafka <- tm_map(kafka, stripWhitespace)
 </​code>​ </​code>​
  
 Now let's work on frequencies Now let's work on frequencies
- +<​code ​java>
-<​code>​+
 freq <- colSums(as.matrix(dtm)) freq <- colSums(as.matrix(dtm))
 length(freq) length(freq)
Line 111: Line 109:
 Sparse terms should be exclude from our analysis. The intuition is clear. A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse. Sparse terms should be exclude from our analysis. The intuition is clear. A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.
  
-<​code>​+<​code ​java>
 dim(dtm) dim(dtm)
 dtms <- removeSparseTerms(dtm,​ 0.6) dtms <- removeSparseTerms(dtm,​ 0.6)
Line 125: Line 123:
 Finding words which '​associate'​ together. Here, we are specifying the Term Document Matrix to use, the term we want to find associates for, and the lowest acceptable correlation limit with that term. Thisreturns a vector of terms which are associated with "​gregor"​ at 0.60 or more (correlation) - and reports each association in decending order. Finding words which '​associate'​ together. Here, we are specifying the Term Document Matrix to use, the term we want to find associates for, and the lowest acceptable correlation limit with that term. Thisreturns a vector of terms which are associated with "​gregor"​ at 0.60 or more (correlation) - and reports each association in decending order.
  
-<​code>​+<​code ​java>
 findFreqTerms(dtm,​ lowfreq=60) findFreqTerms(dtm,​ lowfreq=60)
 findAssocs(dtm,​ "​gregor",​ corlimit=0.11) findAssocs(dtm,​ "​gregor",​ corlimit=0.11)
Line 132: Line 130:
 >           ​eaten ​          ​father ​            ​fled ​            ​flew ​ >           ​eaten ​          ​father ​            ​fled ​            ​flew ​
             0.99             ​0.99 ​            ​0.99 ​            ​0.99 ​             0.99             ​0.99 ​            ​0.99 ​            ​0.99 ​
-            fond             ​full ​            ​girl ​         ​heavili ​+            fond             ​full ​            ​girl ​         ​heavily ​
             0.99             ​0.99 ​            ​0.99 ​            ​0.99 ​             0.99             ​0.99 ​            ​0.99 ​            ​0.99 ​
 </​code>​ </​code>​
Line 138: Line 136:
 Now let's plot the result. Now let's plot the result.
  
-<​code>​+<​code ​java>
 #install Rgraphviz #install Rgraphviz
 #​source("​http://​bioconductor.org/​biocLite.R"​) #​source("​http://​bioconductor.org/​biocLite.R"​)
kafka_s_metamorphosis_visualized.txt · Last modified: 2015/07/24 12:49 by vincenzo