Help with tidying messy data from MS Excel

Please help me to understand how I can efficiently tidy this data. I'm given data in MS Excel. It is a very wide table. I am comfortable being able to read each of the Excel sheets into R (I would use readxl::read_excel).

My issue is, it seems that for me to turn this abomination of data into something useable I must spend a great deal of time to tidy it. I'm sure there is a better way than what I have attempted.

Launch the reprex!


# set up
library(here)
#> here() starts at C:/Users/renedherrera/AppData/Local/Temp/Rtmps3EihZ/reprex31ecd507146
library(tidyverse)

# how to efficiently take data that looks like this (much wider in real life)
df <- tribble(
  ~na,  ~area_1,         ~area_1_1,         ~area_1_2,         ~area_1_3,         ~area_1_4,         ~area_1_5,         ~area_1_6,         ~area_1_7,         ~area_1_8,         ~area_1_9,         ~area_1_10,         ~area_1_11,         ~area_2_1,         ~area_2_2,         ~area_2_3,         ~area_2_4,         ~area_2_5,         ~area_2_6,         ~area_2_7,         ~area_2_8,         ~area_2_9,         ~area_2_10,         ~area_2_11,         ~area_2,
          "na", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4",
          "na",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",
  "variable_1",          "72.1",          "69.1",          "75.2",          "67.5",          "64.1",          "70.8",          "79.1",          "71.9",          "86.2",          "79.5",          "64.5",          "94.5",          "72.6",          "67.4",          "77.7",          "70.9",          "65.9",          "75.9",          "73.3",          "60.3",          "86.4",          "81.3",          "50.6",         "100.0",
  "variable_2",          "66.9",          "64.1",          "69.7",          "70.4",          "67.6",          "73.3",          "57.0",          "48.5",          "65.4",          "37.7",          "20.0",          "55.3",          "71.0",          "66.1",          "76.0",          "72.9",          "68.3",          "77.4",          "64.9",          "48.9",          "81.0",          "17.0",           "0.0",          "41.3",
  "variable_3",          "61.0",          "57.4",          "64.5",          "59.9",          "56.2",          "63.6",          "65.0",          "56.3",          "73.7",          "52.0",          "25.6",          "78.3",          "64.6",          "58.8",          "70.5",          "64.8",          "59.0",          "70.6",          "65.4",          "50.4",          "80.5",          "42.5",           "0.0",         "100.0",
  "variable_4",          "35.7",          "33.0",          "38.3",          "38.6",          "35.8",          "41.4",          "24.3",          "17.3",          "31.4",          "35.9",          "15.5",          "56.3",          "39.2",          "34.5",          "43.9",          "40.5",          "35.9",          "45.2",          "29.3",          "16.1",          "42.4",          "66.0",          "23.1",         "100.0",
  "variable_5",          "68.4",          "64.5",          "72.3",          "73.6",          "69.4",          "77.8",          "61.7",          "53.3",          "70.1",          "50.3",          "30.0",          "70.6",          "71.5",          "65.0",          "78.0",          "75.9",          "69.1",          "82.6",          "66.6",          "52.2",          "81.0",          "42.7",           "2.9",          "82.5"
  ) %>%
  mutate(year = "year_1")


# into tidy data that looks like this
# make ideal df
tidy_df <- tribble(
  ~year, ~variables, ~areas, ~demographics, ~estimates, ~estimate_ci_min, ~estimate_ci_max,
  "year_1", "variable_1", "area_1", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_1", "area_1", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_1", "area_2", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_1", "area_2", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_2", "area_1", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_2", "area_1", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_2", "area_2", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_2", "area_2", "demographic_2", 67.5, 64.1,    70.8,
) # and so on, much longer in real life 


# I have tried to do it this way
# select the columns by grouping
df_area_1 <- df %>%
  select("na", 
         starts_with("area_1")) %>%
  mutate(year = "year_1", # add columns to identify variables 
         area = "area_1") 

# rename, slice, and pivot for each grouping 
df_area_1 <- df_area_1 %>%
  rename(
    "demographic_1" = "area_1",
    "demographic_2" = "area_1_3",
    "demographic_3" = "area_1_6",
    "demographic_4" = "area_1_9"
  ) %>%
  slice(3:7) %>%
  pivot_longer(
    cols = c(2,5,8,11),
    names_to = "demographic",
    values_to = "estimate"
  )

# and repeat until I am ready for a full_join 

Silly. I meant to post an image of the original data as well. To see what I'm working with.


Screenshot 2021-03-05 184054

Yeah it can be messy. You can paste the col headers together, do a pivot, and then separate them out. transpose (t()) can also be useful. Sometimes you need to use a loop to slice and splice the Excel sheet. Reading it in as all character can make it simpler sometimes. I put the result in long format since it's more versatile.

library(tidyverse)

# how to efficiently take data that looks like this (much wider in real life)
df <- tribble(
  ~na,  ~area_1,         ~area_1_1,         ~area_1_2,         ~area_1_3,         ~area_1_4,         ~area_1_5,         ~area_1_6,         ~area_1_7,         ~area_1_8,         ~area_1_9,         ~area_1_10,         ~area_1_11,         ~area_2_1,         ~area_2_2,         ~area_2_3,         ~area_2_4,         ~area_2_5,         ~area_2_6,         ~area_2_7,         ~area_2_8,         ~area_2_9,         ~area_2_10,         ~area_2_11,         ~area_2,
  "na", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4",
  "na",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",
  "variable_1",          "72.1",          "69.1",          "75.2",          "67.5",          "64.1",          "70.8",          "79.1",          "71.9",          "86.2",          "79.5",          "64.5",          "94.5",          "72.6",          "67.4",          "77.7",          "70.9",          "65.9",          "75.9",          "73.3",          "60.3",          "86.4",          "81.3",          "50.6",         "100.0",
  "variable_2",          "66.9",          "64.1",          "69.7",          "70.4",          "67.6",          "73.3",          "57.0",          "48.5",          "65.4",          "37.7",          "20.0",          "55.3",          "71.0",          "66.1",          "76.0",          "72.9",          "68.3",          "77.4",          "64.9",          "48.9",          "81.0",          "17.0",           "0.0",          "41.3",
  "variable_3",          "61.0",          "57.4",          "64.5",          "59.9",          "56.2",          "63.6",          "65.0",          "56.3",          "73.7",          "52.0",          "25.6",          "78.3",          "64.6",          "58.8",          "70.5",          "64.8",          "59.0",          "70.6",          "65.4",          "50.4",          "80.5",          "42.5",           "0.0",         "100.0",
  "variable_4",          "35.7",          "33.0",          "38.3",          "38.6",          "35.8",          "41.4",          "24.3",          "17.3",          "31.4",          "35.9",          "15.5",          "56.3",          "39.2",          "34.5",          "43.9",          "40.5",          "35.9",          "45.2",          "29.3",          "16.1",          "42.4",          "66.0",          "23.1",         "100.0",
  "variable_5",          "68.4",          "64.5",          "72.3",          "73.6",          "69.4",          "77.8",          "61.7",          "53.3",          "70.1",          "50.3",          "30.0",          "70.6",          "71.5",          "65.0",          "78.0",          "75.9",          "69.1",          "82.6",          "66.6",          "52.2",          "81.0",          "42.7",           "2.9",          "82.5"
) %>%
  mutate(year = "year_1")

# paste col names
names(df) <- paste(names(df), as.character(df[1,]), as.character(df[2,]), sep = "+")
df %>% 
  slice(-1,-2) %>%  # drop rows 1 and 2
  rename(
    variable = `na+na+na`,
    year = `year+year_1+year_1`
  ) %>% 
  pivot_longer(starts_with("area")) %>% # pivot
  separate(name, into = c("area", "demographic", "statistic"), sep = "\\+") %>% # split names
  mutate(
    variable = str_replace(variable, "variable_", ""),
    year = str_replace(year, "year_", ""),
    area = str_extract(area, "(?<=area_)[0-9]+"),
    demographic = str_replace(demographic, "demographic_", "")
  )
#> # A tibble: 120 x 6
#>    variable year  area  demographic statistic value
#>    <chr>    <chr> <chr> <chr>       <chr>     <chr>
#>  1 1        1     1     1           estimate  72.1 
#>  2 1        1     1     1           ci_min    69.1 
#>  3 1        1     1     1           ci_max    75.2 
#>  4 1        1     1     2           estimate  67.5 
#>  5 1        1     1     2           ci_min    64.1 
#>  6 1        1     1     2           ci_max    70.8 
#>  7 1        1     1     3           estimate  79.1 
#>  8 1        1     1     3           ci_min    71.9 
#>  9 1        1     1     3           ci_max    86.2 
#> 10 1        1     1     4           estimate  79.5 
#> # ... with 110 more rows

Created on 2021-03-06 by the reprex package (v0.3.0)

2 Likes

This package may help you:
GitHub - nacnudus/tidyxl: Read untidy Excel files in R https://nacnudus.github.io/tidyxl/

1 Like

This definitely works for the demo data set. In my real life dataset I notice that the area names are messier than I had replicated. This is probably where a loop might come in handy. For example, these are the column headings:

names(df)
[1] "...1+NA+NA"                                                       
 [2] "UACC Catchment...2+All Respondents+Mean"                          
 [29] "Pima...29+All Respondents+Mean"                                   
 [56] "Pinal, Gila...56+All Respondents+Mean"         

If, before I apply the pivot_longer() from your solution I first select(starts_with("UACC") I can tidy the data in groups by each area then join them after. But is there a way to pivot all at once without having to do select and pivot for each area?

Also, thanks to you, I have learned more about how to use paste() and separate().

I often use read_excel() %>% janitor::clean_names() to clean the names from Excel sheets.

Yes you can pivot on as many columns as you want. I'm not sure if you can use multiple tidyselect helpers (like starts_with()) but you could constructs a vector of column names and then use all_of()

For example:

cols_to_pivot <- c(str_extract(names(df), "^UACC.*"), 
                   str_extract(names(df), "^Pinal.*"), 
                   ...)

df %>%
  pivot_longer(all_of(cols_to_pivot), ...
3 Likes

Looks like there's a bit of a learning curve on this one. Have you got any positive experience using this package?

Wow! read_excel() %>% janitor::clean_names() is a gift from the R gods.

I had another question which you answered in a different response to the post Elegant Code with Case_when

My question was how to clean up the area names after the pivot. This was my solution to that question:

brfss_2014 <- brfss_2014 %>%
  mutate(area = case_when(
    str_detect(brfss_2014$area, "^uacc") ~ "uacc",
    str_detect(brfss_2014$area, "^pima") ~ "pima",
    str_detect(brfss_2014$area, "^pinal") ~ "pinal",
    str_detect(brfss_2014$area, "^santa") ~ "santa_cruz",
    str_detect(brfss_2014$area, "^yuma") ~ "yuma",
  ))

Now I need to repeat everything for each of the three sheets on that Excel workbook. Thanks for your help. I could have done it the long way in the time it took to solve but now I have a much better idea how to do this more efficiently in the future.

1 Like

Obviously you can make this a bit simpler by just grabbing the first word of the name (e.g. up until a non a-z character). You can do a lot of clever things with regex and stringr but there's a learning curve.

I would tend to make a vector of pretty names like this. It's useful to have both simple names for coding and pretty names for output.

pretty_names <- c("uacc" = "UACC", "pima" = "Pima", "santa" = "Santa Cruz", ...)

brfss_2014 <- brfss_2014 %>%
  mutate(
    area = str_extract(brfss_2014$area, "^[a-z]+"),
    pretty = pretty_names[area]
  )
1 Like

That package is designed for properly messy Excel data. If yours is easily resolved by standard means, then there is no need to learn its features.

Luckily I just about never need to interact with Excel files.

1 Like

Yes, that did the trick for me. I've got quite a bit more to learn. Thank you for your help and taking the time to work through these questions with me.

In my world, I'm given data in Excel and the expectation is that the final deliverable is presented in Excel.

1 Like

There are a couple of packages that allow you to write Excel workbooks, or you could just save output as a csv file, which Excel can open easily also.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.