User Tools

Site Tools


I created a text dataset containing the totality of lyrics by Morrissey and the Smiths. Data cleaning and frequencies.

## load corpus
smiths <- readLines('./The Smiths Lyrics, by Album, by Year/global_smiths.tsv')
smiths.nospace = apply(as.matrix(smiths), 1, function (x) gsub('\t',' ', x))
smiths.nopunct = removePunctuation(smiths.nospace)
smiths.stop = removeWords(smiths.nopunct,stopwords("english"))
smiths.stem <- stemDocument(smiths.stop)
smiths.clean = removeWords(smiths.stem, c("and","dont","just","the", "never","youre","know"))  # Go back to line 6

Corpus creation.

smiths.corpus = Corpus(VectorSource(smiths.clean))
smiths.dtm = DocumentTermMatrix(smiths.corpus)

Filter out any terms that have shown up in less than 80 documents

Dictionary <- function(x) {
  if( is.character(x) ) {
    return (x)
  stop('x is not a character vector')
smiths.dict = Dictionary(findFreqTerms(smiths.dtm,40))
# sparsity reduction by 12%
smiths.dtm.filtered = DocumentTermMatrix(smiths.corpus, list(dictionary = smiths.dict))

Frequency analysis

freq <- colSums(as.matrix(smiths.dtm))
ord <- order(freq)
#least frequent 
rowTotals <- apply(smiths.dtm.filtered , 1, sum)                   #Find the sum of words in each Document   <- smiths.dtm.filtered[rowTotals> 0, ]           #remove all docs without words

In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

smiths.lda = LDA(, 2)

The result is (8-)):

Topic 1 Topic 2 
 "life"  "love"
latent_dirichlet_allocation_on_the_smiths_morrissey_s_lyrics.txt · Last modified: 2015/08/01 00:16 by vincenzo