Cleaning Data & Association Rules - R

anna4296 · February 14, 2020, 7:21pm

Help please!

I am trying to tidy the following dataset (in link) in R and then run an association rules below.

install.packages("dplyr")
library(dplyr)

df <- read.csv("Groceries (2).csv", header = F, stringsAsFactors = F, na.strings=c(""," ","NA"))
install.packages("stringr")
library(stringr)
temp1<- (str_extract(df$V1, "[a-z]+"))
temp2<- (str_extract(df$V1, "[^a-z]+"))
df<- cbind(temp1,df)
df[2] <- NULL
df[35] <- NULL
View(df)

summary(df)
str(df)

trans <- as(df,"transactions")

I get the following error when I run the above trans <- as(df,"transactions") code:

Warning message: Column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 not logical or factor. Applying default discretization (see '? discretizeDF').

summary(trans)

When I run the above code, I get the following:

transactions as itemMatrix in sparse format with
 1499 rows (elements/itemsets/transactions) and
 1268 columns (items) and a density of 0.01529042 

most frequent items:
  V5= vegetables   V6= vegetables temp1=vegetables   V2= vegetables 
             140              113              109              108 
  V9= vegetables          (Other) 
             103            28490

The attached results is showing all the vegetable values as separate items instead of a combined vegetable score which is obviously increasing my number of columns. I am not sure why this is happening?

fit<-apriori(trans,parameter=list(support=0.006,confidence=0.25,minlen=2))
fit<-sort(fit,by="support")
inspect(head(fit))

nirgrahamuk · February 14, 2020, 8:28pm

Hello, I have two recommendations.

its good you provide code, and also a link to csv of input data. but to help others help you, its good to make things as convenient as possible for the community here.
In this case what I recommend that you do is read the csv into a dataobject yourself, and them sample it, and then use dput to have it represented in an easily copy pasteable format that the community here can copy paste as if it were just another part of the code
example :

dput(dplyr::sample_n(d,50))

where d would be the dataframe object loaded from the csv. this would give a random 50 records. (relies on dplyr library which itself is part of the tidyverse -- which is popular here)

best usually not to jump forward beyond an early error in ones code to also discuss later errors, as often code errors in a chain reaction, so I would recommend focusing your help with addressing your code up to the first point of error the as() function ?
you can hint there may be more issues to follow, but I think its better to string people along, than to intimidate them with a long list (that may not even need to be looked at, if fixing early errors solves issues with later errors).

Regards,
Nir

system · March 6, 2020, 8:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.