It is about an R code (text analysis)

Mayank1 · December 20, 2023, 10:20am

data_review2 <- data %>%
unnest_tokens(word, reviewText) %>%
anti_join(stop_word)
I am running this code but I am getting the error regularly. The error is
Error in auto_copy():
! x and y must share the same src.
x is a <tbl_df/tbl/data.frame> object.
y is the string "i|me|my|myself|we|our|ours|...".

nirgrahamuk · December 20, 2023, 10:29am

The issue seems fairly clear as to the cause, anti-join is a function for the purpose of integrating information from two data.frames, whereas you are using it with with one data.frame (the left input) and a character string ( the right input).

Its hard to advise you without understanding your intentions for having written this code.
Perhaps you could say more about that.

Mayank1 · December 20, 2023, 10:36am

Ok I am trying LDA.
So the file data is having the review text in column ReviewText. It is working fine if I do not want to remove the stop words. The code for same is:-
data_review <- data %>%
unnest_tokens(word, reviewText)
head(data_review)
But if I wish to remove stop-words, this error is popping regularly. I have tried a few things but nothing works.

nirgrahamuk · December 20, 2023, 10:52am

my assumption is that given the way your data is structured you would use a filter with str_detect
I assume your source for methodology may be Chapter 3 Stop words | Supervised Machine Learning for Text Analysis in R.

I adapted that to more closely fit your stopword data, which rather than being a character vector with multiple entries, is a single character variable where the content has a pipe delimeter.

library(hcandersenr)
library(tidyverse)
library(tidytext)
library(stopwords)
fir_tree <- hca_fairytales() %>%
  filter(book == "The fir tree",
         language == "English")

tidy_fir_tree <- fir_tree %>%
  unnest_tokens(word, text)

nrow(tidy_fir_tree)

sw <- stopwords(source = "snowball")
(one_long_sw <- paste0(sw,collapse = "|"))


(tidy_fir_tree_stops_removed <- filter(
  tidy_fir_tree,
  str_detect(word,pattern = one_long_sw)
))

nrow(tidy_fir_tree_stops_removed)

Mayank1 · December 20, 2023, 11:20am

Ok I have tried this code but still, the stop words are not removed. I wish to plot the bar graph and I am getting all the stop words in top 30 in my chart.

nirgrahamuk · December 20, 2023, 11:30am

Thanks for providing code. Could you kindly take further steps to make it easier for other forum users to help you? Share some representative data that will enable your code to run and show the problematic behaviour.

How do I share data for a reprex?

You might use tools such as the library datapasta, or the base function dput() to share a portion of data in code form, i.e. that can be copied from forum and pasted to R session.

Reprex Guide

system · January 10, 2024, 11:30am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.