I am trying to knit this Rmd file to PDF and I keep getting this error:
! LaTeX Error: Unicode character � (U+FFFD)
not set up for use with LaTeX.
output:
pdf_document: default
html_document: default
Intro to Data Science - HW 9
Copyright Jeffrey Stanton, Jeffrey Saltz, Christopher Dunham, and Jasmina Tacheva
# Enter your name here:
Attribution statement: (choose only one and delete the rest)
# 1. I did this homework by myself, with help from the book and the professor.
Text mining plays an important role in many industries because of the prevalence of text in the interactions between customers and company representatives. Even when the customer interaction is by speech, rather than by chat or email, speech to text algorithms have gotten so good that transcriptions of these spoken word interactions are often available. To an increasing extent, a data scientist needs to be able to wield tools that turn a body of text into actionable insights. In this homework, we explore a real City of Syracuse dataset using the quanteda and quanteda.textplots packages. Make sure to install the quanteda and quanteda.textplots packages before following the steps below:
Part 1: Load and visualize the data file
A. Take a look at this article:
and write a comment in your R script, briefly describing what it is about.
library(tidyverse)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
# The City of Syracuse held a snowplow naming contest to name the 10 new snow-plows.
# There was a total of 1,948 unique submissions
#Plowie McPlowFace surprisingly receieved the second most submissions
B. Read the data from the following URL into a dataframe called df:
df <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv")
C. Inspect the df dataframe – which column contains an explanation of the meaning of each submitted snowplow name? Transform that column into a document-feature matrix, using the corpus(), tokens(), **tokens_select(), and dfm() functions. Do not forget to remove stop words.
Hint: Make sure you have libraried quanteda
#Let's inspect the dataframe
glimpse(df)
# the "meaning" attribute is the column that contains an
# explanation about each submitted snowplow name
# Let's transform the "meaning" attribute into a document-feature matrix using
# corpus(), tokens(), tokens_select() and dfm() functions.
# Let's also remove stop words
df_corpus <- corpus(df$meaning, docnames = df$submission_number)
toks <- tokens(df_corpus, remove_punct = TRUE)
toks_nostop_words <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")
df_dfm <- dfm(toks_nostop_words)
D. Plot a word cloud, where a word is only represented if it appears at least 2 times . Hint: use textplot_wordcloud():
Hint: Make sure you have libraried (and installed if needed) quanteda.textplots
textplot_wordcloud(df_dfm, min_count = 2)
E. Next, increase the minimum count to 10. What happens to the word cloud? Explain in a comment.
textplot_wordcloud(df_dfm, min_count = 10)
# The word cloud gets smaller in the number of words included when increasing the min_count from 2 to 10.
# It makes sense for the word cloud to get smaller due to the words varying in
# frequency of 2 through 9 would ultimately be removed from the wordcloud.
F. What are the top words in the word cloud? Explain in a brief comment.
# Looking at the larger words in the word cloud, it appears that the top words
# in the word cloud, provided above, include "snow", "syracuse", "ï", "city", "plow", "1/2".
# We are able to determine these are the top words as the scaling font of the words are proportional
# with the frequency of the word appearing in the snow plow naming campaign
Part 2: Analyze the sentiment of the descriptions
A. Create a named list of word counts by frequency.
output the 10 most frequent words (their word count and the word).
Hint: use textstat_frequency() from the quanteda.textstats package.
textstat_frequency(df_dfm, n=10)
B. Explain in a comment what you observed in the sorted list of word counts.
# In this sorted list of word counts, we are better able to see that there are still
# some additional irrelevant entries such as the "1/2" and "ï' entries.
# There's another aspect we may need to account for being that the 6th ranked
# "plow" with 140 appearances and the 8th ranked "plows" with 100 appearances
# are essentially the same word.
Part 3: Match the words with positive and negative words
A. Read in the list of positive words, using the scan() function, and output the first 5 words in the list. Do the same for the the negative words list:
https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt
https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt
There should be 2006 positive words and 4783 negative words, so you may need to clean up these lists a bit.
pos_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt"
pos_words <- scan(pos_url, character(0), sep = "\n")
pos_words <- pos_words[-1:-34]
length(pos_words)
B. Use dfm_match() to match the words in the dfm with the words in posWords). Note that dfm_match() creates a new dfm.
Then pass this new dfm to the textstat_frequency() function to see the positive words in our corpus, and how many times each word was mentioned.
pos_dfm <- dfm_match(df_dfm, pos_words)
pos_freq <- textstat_frequency(pos_dfm)
nrow(pos_freq)
C. Sum all the positive words
sum(pos_freq$frequency)
D. Do a similar analysis for the negative words - show the 10 most requent negative words and then sum the negative words in the document.
neg_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt"
neg_words <- scan(neg_url, character(0), sep = "\n")
neg_words <- neg_words[-1:-34]
length(neg_words)
neg_dfm <- dfm_match(df_dfm, neg_words)
neg_freq <- textstat_frequency(neg_dfm)
nrow(neg_freq)
sum(neg_freq$frequency)
E. Write a comment describing what you found after matching positive and negative words. Which group is more common in this dataset? Might some of the negative words not actually be used in a negative way? What about the positive words?
# We see that the positive words had more matches (866) than the negative words (255).
# The group of positive words were by far more common. It may be possible,
# with contextual discrepancies, that the exact count of negative words may be different,
# same with the positive words.