fast way to compute frequency table for large data

Dear R experts,

I have a data frame like this,0,1,1),x2=c('no','yes','yes','yes'),
The data frame is of 9000000 rows and 380 columns(x1~x380).

Is there any efficient way to see which rows have only 0 or 'no'? Also, I hope to compute the proportion of 0 and 'no' for each row (in the above case, I want to create a new column: 0.75 0.50 0.00 0.25). I there any way to achieve these goals in one minute?


Well, I can't claim R expertise, compared to many of the other members of this community, but I'll give it a shot.

3.8 GB has to fit in memory, so you need to have the free ram or run on something like an AWS instance, to start.

The two dplyr functions to do this are filter and mutate

next_df <- example %>% filter(x3 == 0  | x3 == "no")

For mutate, I can only point you to the function case_when since I don't know your decision criteria.

My experience on a 4MB Macbook Air is that doing that many rows will take more than a minute or choke, less on an 8MB Dell under Ubuntu. I haven't tried that on my 32MB desktop.

Hi Veda:

following your example:,0,1,1,0),x2=c('no','yes','yes','yes','no'),
                      stringsAsFactors = FALSE)

Note that i have added stringsAsFactors = FALSE... I also added one row to your example, so you can have a column with only 0 and 'no' and see the difference in the output

Now you can get a vector indicating a logical value TRUE if the row only have 0 or 'no'

onlidesired <- apply(example, 1, function(x) mean(x == 0 | x == 'no')==1)

> onlidesired

So the last row is the one that matches your criteria.
To calculate the proportion of 0 and 'no' of each row, we follow the same way, but now, we assign it to a new variable of example:

example$proportions <- apply(example, 1, function(x) mean(x == 0 | x == 'no'))

> example
  x  x2  x3 x4 proportions
1 0  no  no  1        0.75
2 0 yes yes  0        0.50
3 1 yes yes  1        0.00
4 1 yes  no  1        0.25
5 0  no  no  0        1.00

Hope it helps:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.