Tidying a one column dataframe by groups in the data

R-by-Ryo · July 5, 2018, 5:28pm

I have some data that shows numerical values for a few groups, all in 1 long column. Below is a small sample of what it looks like:

library(tibble)

df <- tribble(
  ~value,
  "group a",
  1,
  12,
  56,
  17,
  24,
  "group b",
  23,
  1,
  5,
  "group c",
  76,
  55,
  89,
  2,
  20,
  "group d",
  50,
  23,
  44,
  39
)

What i want to end up with is as follows:

tribble(
  ~value, ~name,
  1,     "group a",
  12,    "group a",
  56,    "group a",
  17,    "group a",
  24,    "group a",
  23,    "group b",
  1,     "group b",
  5,     "group b",
  76,   "group c",
  55,   "group c"
  # and so on....
)

Below are my attempts to tidy this data:

library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
library(purrr)

df_rows <- df %>% rownames_to_column()

groups <- df_rows %>% select(value) %>% str_extract_all("group [a-z]") %>% as_vector()

# find the rows where data for certain group begins and ends:
df_length <- df_rows %>% 
  filter(value %in% groups) %>% 
  mutate(
    rowpos = as.numeric(rowname)) %>% 
  select(-rowname) %>% 
  rownames_to_column() %>% 
  mutate(
    group_start = rowpos + 1, 
    group_end = lead(rowpos) - 1,
    group_end = case_when(
      is.na(group_end) ~ 21,  # manually fill in the 'end row position' for the last group...
      TRUE ~ group_end
    )) %>% 
  select(-rowpos, -rowname)

# Now, I have the row number for where the data for a specific group starts and ends. Along with the original # df with row numbers as well.
# Next, slice() by the group_start and group_end that I just created, then mutate() the group name

df_rows %>% 
  slice(df_length$group_start[1]:df_length$group_end[1]) %>% 
  mutate(groupname = "Group A")

df_rows %>% 
  slice(df_length$group_start[2]:df_length$group_end[2]) %>% 
  mutate(groupname = "Group B")

# do the same for all the other groups

and then I can just rbind() them all to get the solution that I showed above but it's not very efficient...

I was thinking of using a for loop or map2 but I haven't been successful. Any help with simplifying the above or getting the result in a different way would be appreciated!

jcblum · July 5, 2018, 6:36pm

Here's what I might do. It's not really that different from your attempt, but a little briefer and takes advantage of the power of purrr::map2(), list columns and tidyr::unnest().

library(tidyverse)

df <- tibble::tribble(
  ~value,
  "group a",
  1,
  12,
  56,
  17,
  24,
  "group b",
  23,
  1,
  5,
  "group c",
  76,
  55,
  89,
  2,
  20,
  "group d",
  50,
  23,
  44,
  39
)

# If your real groups don't follow this naming pattern, use any other 
# vectorized function that returns TRUE for the label values
# instead of `startsWith()`
group_rows <- which(startsWith(df$value, "group"))
group_labels <- df$value[group_rows]
group_start <- group_rows + 1
group_end <- c(group_rows[-1] - 1, length(df$value))

map2(group_start, group_end, ~ df[.x:.y, ]) %>% 
  tibble(group_labels) %>% 
  unnest()
#> # A tibble: 17 x 2
#>    group_labels value
#>    <chr>        <chr>
#>  1 group a      1    
#>  2 group a      12   
#>  3 group a      56   
#>  4 group a      17   
#>  5 group a      24   
#>  6 group b      23   
#>  7 group b      1    
#>  8 group b      5    
#>  9 group c      76   
#> 10 group c      55   
#> 11 group c      89   
#> 12 group c      2    
#> 13 group c      20   
#> 14 group d      50   
#> 15 group d      23   
#> 16 group d      44   
#> 17 group d      39

Created on 2018-07-05 by the reprex package (v0.2.0).

The final output doesn't have exactly the same column order as what you specified, but if that's important you can rearrange it with dplyr::select() (and you'll probably want to add a follow-on mutate() to convert those values into whatever the actual appropriate data type is).

markdly · July 5, 2018, 9:23pm

Perhaps this is a good opportunity to use tidyr::fill() as an alternative approach (combined with the startsWith() function I didn't know about from jcblum's example):

library(tidyverse)
df %>% 
  mutate(name = if_else(startsWith(value, "group"), value, NA_character_)) %>% 
  fill(name) %>% 
  filter(value != name)
#> # A tibble: 17 x 2
#>    value name   
#>    <chr> <chr>  
#>  1 1     group a
#>  2 12    group a
#>  3 56    group a
#>  4 17    group a
#>  5 24    group a
#>  6 23    group b
#>  7 1     group b
#>  8 5     group b
#>  9 76    group c
#> 10 55    group c
#> 11 89    group c
#> 12 2     group c
#> 13 20    group c
#> 14 50    group d
#> 15 23    group d
#> 16 44    group d
#> 17 39    group d

jcblum · July 6, 2018, 1:44am

Ooh, clever and sleek! There’s so much good stuff in tidyr, I’ve almost started looking forward to encountering one of these absurd data formats

R-by-Ryo · July 6, 2018, 3:03am

wow this is great,concise too! I didn't think of using fill() at all, I was so dead-set on using slice(). Thanks!

R-by-Ryo · July 6, 2018, 3:05am

ah ok, I think I understand what I was doing wrong with my map/for loops. This will come in handy for other situations so thanks!

alexpghayes · July 9, 2018, 11:11pm

If you're like me and can't remember what fill does, you can get pretty much the same effect with cumsum:

df %>% 
  mutate(
    is_group = str_detect(value, "group"),
    group_num = cumsum(as.numeric(is_group))
  ) %>% 
  filter(!is_group)

but now the group identifiers are just doubles.