Help with tidying messy data from MS Excel

uazccrenedario · March 6, 2021, 2:12am

Please help me to understand how I can efficiently tidy this data. I'm given data in MS Excel. It is a very wide table. I am comfortable being able to read each of the Excel sheets into R (I would use readxl::read_excel).

My issue is, it seems that for me to turn this abomination of data into something useable I must spend a great deal of time to tidy it. I'm sure there is a better way than what I have attempted.

Launch the reprex!

# set up
library(here)
#> here() starts at C:/Users/renedherrera/AppData/Local/Temp/Rtmps3EihZ/reprex31ecd507146
library(tidyverse)

# how to efficiently take data that looks like this (much wider in real life)
df <- tribble(
  ~na,  ~area_1,         ~area_1_1,         ~area_1_2,         ~area_1_3,         ~area_1_4,         ~area_1_5,         ~area_1_6,         ~area_1_7,         ~area_1_8,         ~area_1_9,         ~area_1_10,         ~area_1_11,         ~area_2_1,         ~area_2_2,         ~area_2_3,         ~area_2_4,         ~area_2_5,         ~area_2_6,         ~area_2_7,         ~area_2_8,         ~area_2_9,         ~area_2_10,         ~area_2_11,         ~area_2,
          "na", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4",
          "na",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",
  "variable_1",          "72.1",          "69.1",          "75.2",          "67.5",          "64.1",          "70.8",          "79.1",          "71.9",          "86.2",          "79.5",          "64.5",          "94.5",          "72.6",          "67.4",          "77.7",          "70.9",          "65.9",          "75.9",          "73.3",          "60.3",          "86.4",          "81.3",          "50.6",         "100.0",
  "variable_2",          "66.9",          "64.1",          "69.7",          "70.4",          "67.6",          "73.3",          "57.0",          "48.5",          "65.4",          "37.7",          "20.0",          "55.3",          "71.0",          "66.1",          "76.0",          "72.9",          "68.3",          "77.4",          "64.9",          "48.9",          "81.0",          "17.0",           "0.0",          "41.3",
  "variable_3",          "61.0",          "57.4",          "64.5",          "59.9",          "56.2",          "63.6",          "65.0",          "56.3",          "73.7",          "52.0",          "25.6",          "78.3",          "64.6",          "58.8",          "70.5",          "64.8",          "59.0",          "70.6",          "65.4",          "50.4",          "80.5",          "42.5",           "0.0",         "100.0",
  "variable_4",          "35.7",          "33.0",          "38.3",          "38.6",          "35.8",          "41.4",          "24.3",          "17.3",          "31.4",          "35.9",          "15.5",          "56.3",          "39.2",          "34.5",          "43.9",          "40.5",          "35.9",          "45.2",          "29.3",          "16.1",          "42.4",          "66.0",          "23.1",         "100.0",
  "variable_5",          "68.4",          "64.5",          "72.3",          "73.6",          "69.4",          "77.8",          "61.7",          "53.3",          "70.1",          "50.3",          "30.0",          "70.6",          "71.5",          "65.0",          "78.0",          "75.9",          "69.1",          "82.6",          "66.6",          "52.2",          "81.0",          "42.7",           "2.9",          "82.5"
  ) %>%
  mutate(year = "year_1")


# into tidy data that looks like this
# make ideal df
tidy_df <- tribble(
  ~year, ~variables, ~areas, ~demographics, ~estimates, ~estimate_ci_min, ~estimate_ci_max,
  "year_1", "variable_1", "area_1", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_1", "area_1", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_1", "area_2", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_1", "area_2", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_2", "area_1", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_2", "area_1", "demographic_2", 67.5, 64.1,    70.8,
  "year_1", "variable_2", "area_2", "demographic_1", 72.1, 69.1, 75.2,
  "year_1", "variable_2", "area_2", "demographic_2", 67.5, 64.1,    70.8,
) # and so on, much longer in real life 


# I have tried to do it this way
# select the columns by grouping
df_area_1 <- df %>%
  select("na", 
         starts_with("area_1")) %>%
  mutate(year = "year_1", # add columns to identify variables 
         area = "area_1") 

# rename, slice, and pivot for each grouping 
df_area_1 <- df_area_1 %>%
  rename(
    "demographic_1" = "area_1",
    "demographic_2" = "area_1_3",
    "demographic_3" = "area_1_6",
    "demographic_4" = "area_1_9"
  ) %>%
  slice(3:7) %>%
  pivot_longer(
    cols = c(2,5,8,11),
    names_to = "demographic",
    values_to = "estimate"
  )

# and repeat until I am ready for a full_join

uazccrenedario · March 6, 2021, 2:17am

Silly. I meant to post an image of the original data as well. To see what I'm working with.

Screenshot 2021-03-05 184054

woodward · March 6, 2021, 6:15am

Yeah it can be messy. You can paste the col headers together, do a pivot, and then separate them out. transpose (t()) can also be useful. Sometimes you need to use a loop to slice and splice the Excel sheet. Reading it in as all character can make it simpler sometimes. I put the result in long format since it's more versatile.

library(tidyverse)

# how to efficiently take data that looks like this (much wider in real life)
df <- tribble(
  ~na,  ~area_1,         ~area_1_1,         ~area_1_2,         ~area_1_3,         ~area_1_4,         ~area_1_5,         ~area_1_6,         ~area_1_7,         ~area_1_8,         ~area_1_9,         ~area_1_10,         ~area_1_11,         ~area_2_1,         ~area_2_2,         ~area_2_3,         ~area_2_4,         ~area_2_5,         ~area_2_6,         ~area_2_7,         ~area_2_8,         ~area_2_9,         ~area_2_10,         ~area_2_11,         ~area_2,
  "na", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4", "demographic_1", "demographic_1", "demographic_1", "demographic_2", "demographic_2", "demographic_2", "demographic_3", "demographic_3", "demographic_3", "demographic_4", "demographic_4", "demographic_4",
  "na",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",      "estimate",        "ci_min",        "ci_max",
  "variable_1",          "72.1",          "69.1",          "75.2",          "67.5",          "64.1",          "70.8",          "79.1",          "71.9",          "86.2",          "79.5",          "64.5",          "94.5",          "72.6",          "67.4",          "77.7",          "70.9",          "65.9",          "75.9",          "73.3",          "60.3",          "86.4",          "81.3",          "50.6",         "100.0",
  "variable_2",          "66.9",          "64.1",          "69.7",          "70.4",          "67.6",          "73.3",          "57.0",          "48.5",          "65.4",          "37.7",          "20.0",          "55.3",          "71.0",          "66.1",          "76.0",          "72.9",          "68.3",          "77.4",          "64.9",          "48.9",          "81.0",          "17.0",           "0.0",          "41.3",
  "variable_3",          "61.0",          "57.4",          "64.5",          "59.9",          "56.2",          "63.6",          "65.0",          "56.3",          "73.7",          "52.0",          "25.6",          "78.3",          "64.6",          "58.8",          "70.5",          "64.8",          "59.0",          "70.6",          "65.4",          "50.4",          "80.5",          "42.5",           "0.0",         "100.0",
  "variable_4",          "35.7",          "33.0",          "38.3",          "38.6",          "35.8",          "41.4",          "24.3",          "17.3",          "31.4",          "35.9",          "15.5",          "56.3",          "39.2",          "34.5",          "43.9",          "40.5",          "35.9",          "45.2",          "29.3",          "16.1",          "42.4",          "66.0",          "23.1",         "100.0",
  "variable_5",          "68.4",          "64.5",          "72.3",          "73.6",          "69.4",          "77.8",          "61.7",          "53.3",          "70.1",          "50.3",          "30.0",          "70.6",          "71.5",          "65.0",          "78.0",          "75.9",          "69.1",          "82.6",          "66.6",          "52.2",          "81.0",          "42.7",           "2.9",          "82.5"
) %>%
  mutate(year = "year_1")

# paste col names
names(df) <- paste(names(df), as.character(df[1,]), as.character(df[2,]), sep = "+")
df %>% 
  slice(-1,-2) %>%  # drop rows 1 and 2
  rename(
    variable = `na+na+na`,
    year = `year+year_1+year_1`
  ) %>% 
  pivot_longer(starts_with("area")) %>% # pivot
  separate(name, into = c("area", "demographic", "statistic"), sep = "\\+") %>% # split names
  mutate(
    variable = str_replace(variable, "variable_", ""),
    year = str_replace(year, "year_", ""),
    area = str_extract(area, "(?<=area_)[0-9]+"),
    demographic = str_replace(demographic, "demographic_", "")
  )
#> # A tibble: 120 x 6
#>    variable year  area  demographic statistic value
#>    <chr>    <chr> <chr> <chr>       <chr>     <chr>
#>  1 1        1     1     1           estimate  72.1 
#>  2 1        1     1     1           ci_min    69.1 
#>  3 1        1     1     1           ci_max    75.2 
#>  4 1        1     1     2           estimate  67.5 
#>  5 1        1     1     2           ci_min    64.1 
#>  6 1        1     1     2           ci_max    70.8 
#>  7 1        1     1     3           estimate  79.1 
#>  8 1        1     1     3           ci_min    71.9 
#>  9 1        1     1     3           ci_max    86.2 
#> 10 1        1     1     4           estimate  79.5 
#> # ... with 110 more rows

^{Created on 2021-03-06 by the reprex package (v0.3.0)}

martin.R · March 6, 2021, 10:14pm

This package may help you:
GitHub - nacnudus/tidyxl: Read untidy Excel files in R https://nacnudus.github.io/tidyxl/

uazccrenedario · March 11, 2021, 8:18pm

This definitely works for the demo data set. In my real life dataset I notice that the area names are messier than I had replicated. This is probably where a loop might come in handy. For example, these are the column headings:

names(df)
[1] "...1+NA+NA"                                                       
 [2] "UACC Catchment...2+All Respondents+Mean"                          
 [29] "Pima...29+All Respondents+Mean"                                   
 [56] "Pinal, Gila...56+All Respondents+Mean"

If, before I apply the pivot_longer() from your solution I first select(starts_with("UACC") I can tidy the data in groups by each area then join them after. But is there a way to pivot all at once without having to do select and pivot for each area?

Also, thanks to you, I have learned more about how to use paste() and separate().

woodward · March 11, 2021, 9:27pm

I often use read_excel() %>% janitor::clean_names() to clean the names from Excel sheets.

Yes you can pivot on as many columns as you want. I'm not sure if you can use multiple tidyselect helpers (like starts_with()) but you could constructs a vector of column names and then use all_of()

For example:

cols_to_pivot <- c(str_extract(names(df), "^UACC.*"), 
                   str_extract(names(df), "^Pinal.*"), 
                   ...)

df %>%
  pivot_longer(all_of(cols_to_pivot), ...

uazccrenedario · March 11, 2021, 9:29pm

Looks like there's a bit of a learning curve on this one. Have you got any positive experience using this package?

uazccrenedario · March 12, 2021, 12:57am

Wow! read_excel() %>% janitor::clean_names() is a gift from the R gods.

I had another question which you answered in a different response to the post Elegant Code with Case_when

My question was how to clean up the area names after the pivot. This was my solution to that question:

brfss_2014 <- brfss_2014 %>%
  mutate(area = case_when(
    str_detect(brfss_2014$area, "^uacc") ~ "uacc",
    str_detect(brfss_2014$area, "^pima") ~ "pima",
    str_detect(brfss_2014$area, "^pinal") ~ "pinal",
    str_detect(brfss_2014$area, "^santa") ~ "santa_cruz",
    str_detect(brfss_2014$area, "^yuma") ~ "yuma",
  ))

Now I need to repeat everything for each of the three sheets on that Excel workbook. Thanks for your help. I could have done it the long way in the time it took to solve but now I have a much better idea how to do this more efficiently in the future.

woodward · March 12, 2021, 1:13am

Obviously you can make this a bit simpler by just grabbing the first word of the name (e.g. up until a non a-z character). You can do a lot of clever things with regex and stringr but there's a learning curve.

I would tend to make a vector of pretty names like this. It's useful to have both simple names for coding and pretty names for output.

pretty_names <- c("uacc" = "UACC", "pima" = "Pima", "santa" = "Santa Cruz", ...)

brfss_2014 <- brfss_2014 %>%
  mutate(
    area = str_extract(brfss_2014$area, "^[a-z]+"),
    pretty = pretty_names[area]
  )

martin.R · March 12, 2021, 12:46pm

That package is designed for properly messy Excel data. If yours is easily resolved by standard means, then there is no need to learn its features.

Luckily I just about never need to interact with Excel files.

uazccrenedario · March 12, 2021, 8:23pm

Yes, that did the trick for me. I've got quite a bit more to learn. Thank you for your help and taking the time to work through these questions with me.

uazccrenedario · March 12, 2021, 8:46pm

In my world, I'm given data in Excel and the expectation is that the final deliverable is presented in Excel.

woodward · March 14, 2021, 12:59am

There are a couple of packages that allow you to write Excel workbooks, or you could just save output as a csv file, which Excel can open easily also.

system · March 21, 2021, 12:59am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.