Factor in data.table, data.frame in R6/S3

ktanizar · July 14, 2018, 5:04am

Background:

I am helping my team understanding why our code is taking very long.
Dataset used has 363,358 rows.
We are using R6 to model our functions.
After investigating the code, I found that this data.frame "odata" looks suspicious, where Brand variable is stored as Factor with 363,358 levels. This means that every row is considered a unique level.

Have anyone seen this behaviour before? I really appreciate any help, especially since the person who wrote the code has left our team.

Thank you!

jcblum · July 16, 2018, 2:32am

It looks like all the visible variables in that screenshot are factors with the same number of levels as rows — how about the other 39 variables (str() output could be useful here)? I can't imagine a factor with one level per row making much sense for any of these variables.

Usually the case where I see this happen is when someone imports data using read.csv() (or another read.table() variant) without changing the default stringsAsFactors parameter from TRUE to FALSE. This causes most anything that isn't easily parseable as a number to be imported as text, and then immediately converted to a factor. But in those cases, you usually only get the same number of levels as rows if there were no repeated values, since the default factor levels will be the unique values of as.character(x) (where x is the vector of values).

What's really weird here is that according to the screenshot, Brand has identical values in the first three rows — but levels are supposed to be unique. You can force R to make duplicate levels (you will get a warning), but it shouldn't happen under normal circumstances. That makes me suspect that something beyond just stringsAsFactors happened here, so that either there are duplicate levels (bad!) or there are a bunch of unused levels (also probably bad — or at least, unintended).

What do you get when you run the following?

## These should be the same! How different are they?

# All the levels
length(levels(odata$Brand))

# Just the unique levels
length(unique(levels(odata$Brand)))

# Just the levels that are in use
length(levels(droplevels(odata$Brand)))

# Just the unique levels that are in use
length(unique(levels(droplevels(odata$Brand))))

## Preview the result of re-creating a factor out of Brand with default levels
str(factor(odata$Brand))

All that said, it's possible that these weirdly formatted data have nothing to do with your slowdown (though they do seem problematic for analysis). Before you burn too much time in this direction, have you tried profiling your slow code? For some sensible and accessible advice on the subject, see:

ktanizar · July 30, 2018, 6:43am

Thanks so much for your reply! I just got back from a trip and got the chance to response.

I ran your codes and all of them give the same answer.

After further checking, I discovered that:

str(odata) is a data.frame
typeof(odata) is a list.

Could this be the reason we are seeing # of levels = # of rows?

odata is initialized below:

myobject <- R6::R6Class(
  "myobject"
  , public = list(
    odata = NULL
       ,initialize = function(data) {
      self$odata <- data
    }
  )
)

Thank you!