Messy dataframe to a tidier format: Summarize column values across rows

dpolychr · November 16, 2017, 5:40pm

Dear all,

I have a dataframe:

        col1                  `col2`           `col3`               `col4`
       <chr>                  <chr>             <chr>                <chr>
 1 111002017              111002017         111002017            111002017
 2 115008292              115008292         115008292            115008292
 3 210012153              210012153         210012153                 <NA>
 4 119000532              119000532         119000532                 <NA>
 5 115009831                   <NA>         115009831                 <NA>
 6    149688                   <NA>            149688                 <NA>
 7 SW0000006                   <NA>         SW0000006                 <NA>
 8    G95615                   <NA>            G95615                 <NA>
 9 115008871              115008871         115008871            115008871
10 115008883              115008883         115008883            115008883

The first column is the col I would like to report the counts against the other three. The other three columns are the data sets that the col 1 can appear in - they are all in at least one and can be all three.

What I would like is a dataframe like that:

col1         n
111002017    3
115008292    3

or (even better)

col1            col2      col3     col4
111002017         1         1       1

Any suggestions would be much appreciated

Dimitris

martin.R · November 16, 2017, 6:30pm

Something like this:

library(tidyverse)

your_data %>% 
  gather(colnum, identifier, -col1) %>% 
  filter(col1 == identifier) %>% 
  group_by(col1, colnum) %>% 
  count() %>% 
  spread(colnum, n)

dpolychr · November 16, 2017, 8:00pm

That was exactly what i needed, Martin! Many thanks

martin.R · November 16, 2017, 8:58pm

Actually, here is a more concise and straightforward version:

your_data %>% 
  group_by(col1) %>% 
  summarise_all(funs(sum(. == col1)))

nick · November 16, 2017, 9:16pm

Depending on whether the col1 data is unique (and if you want to group rows if it isn't), you could even leave out the group_by bit and make that:

mutate_at(your_data, vars(-col1), funs(as.numeric(. == col1)))