Using variable values to create variable names

Ajackson · April 24, 2019, 5:33pm

This is pollen count data that I have read in off a health dept website.

Note that the pollen variety is in the third line of the tibble, but can
occasionally change. The total number of varieties remains constant, but sometimes
a substitution is made. In this case, in the second tibble "Black Gum" has been
replaced by "Alnus(Alder)".

The actual data consists of 75 files, each with 28-31 observations over 47 variables.


library(tidyverse)

df1 <- tribble(
~..1, ~..2, ~..3, ~..4, ~..5,
"Month: JANUARY", NA, NA, NA, NA,
"YEAR: 2013", NA, NA, NA, NA,
"DATE", "Ash", "Ashe Juniper / Bald Cypress", "Black Gum", "Black Walnut", 
"1", NA, NA,  NA, NA, 
"2", NA, "6", NA, NA,
"3", NA, "2", NA, NA,
"4", NA, NA,  NA, NA, 
"5", NA, NA,  NA, NA,
"Total", NA, "8", NA, NA
)

df2 <- tribble(
~..1, ~..2, ~..3, ~..4, ~..5,
"Month: DECEMBER", NA, NA, NA, NA,
"YEAR: 2013", NA, NA, NA, NA,
"DATE", "Ash", "Ashe Juniper / Bald Cypress", "Alnus(Alder)", "Black Walnut", 
"1", NA, NA,  NA, NA, 
"2", NA, NA,  NA, NA,
"3", NA, NA,  NA, NA,
"4", NA, "6", NA, NA, 
"5", NA, "8", NA, NA,
"Total", NA, "14", NA, NA
)

Desired output

The final tibble should have pollen variety as a variable name (with suitable
cleanup), as shown below (column order not important)


dfout <- tribble(
  ~Date, ~Ash, ~Ashe_JuniperOrBald_Cypress, ~Black_Gum, ~Black_Walnut, ~Alnus,
"01/01/2013", NA, NA,  NA, NA, NA,
"01/02/2013", NA, "6", NA, NA, NA,
"01/03/2013", NA, "2", NA, NA, NA,
"01/04/2013", NA, NA,  NA, NA, NA,
"01/05/2013", NA, NA,  NA, NA, NA,
"12/01/2013", NA, NA,  NA, NA, NA, 
"12/02/2013", NA, NA,  NA, NA, NA,
"12/03/2013", NA, NA,  NA, NA, NA,
"12/04/2013", NA, "6", NA, NA, NA,
"12/05/2013", NA, "8", NA, NA, NA
)

I can handle creating the Date values, but I am drawing a blank on how to use the values stored
in the tibble to create variable names.


translate <- tribble(
  ~from, ~to,
  "Ash",                          "Ash",
  "Ashe Juniper / Bald Cypress",  "Ashe_JuniperOrBald_Cypress",
  "Black Gum",                    "Black_Gum",
  "Black Walnut",                 "Black_Walnut",
  "Alnus(Alder)",                 "Alnus"
  
)

#   Do the date (that's easy)

mon <- str_remove(df1[1,]$..1, "Month:\\s*")
yr  <- str_remove(df1[3,]$..1, "YEAR:\\s*")

dates <- df1[,1] %>%
  tail(-5) %>% 
  head(-1) %>% 
  unlist() %>% 
  paste(mon, ., yr) %>% 
  mdy()

joels · April 24, 2019, 6:06pm

Here's a function that cleans one data frame. We then use map to run the function on all of the input data frames and return a single cleaned data frame.

In your question, you said "Note that the pollen variety is in the third line of the tibble, but can occasionally change." I took this to mean that we don't necessarily know which row number has the names of the pollen varieties. In the code below, to find the correct row, I've assumed that the row we want always has the word "DATE" in the first column. The code has some other hardwired assumptions, based on the structure of the data you provided. However, it can be adjusted as needed if there are other types of variations that can occur from one data frame to another.

library(tidyverse)

clean = function(data) {

  # Recode column names
  names.row = grep("DATE", df1[, 1, drop=TRUE])
  recode_vals = translate$to %>% set_names(translate$from)
  old_names = unlist(df1[names.row, ][-1]) 
  names(data) = c("Date", recode(old_names, !!!recode_vals))
  
  # Get Month and Year for dates
  mon <- str_remove(data[1,]$Date, "Month:\\s*")
  mon <- match(mon, toupper(month.name))
  yr  <- str_remove(data[2,]$Date, "YEAR:\\s*")

  # Remove Month, Year, Date and Total rows
  data = data[!grepl("Month|YEAR|DATE|Total", data$Date), ]
  
  # Change Date column to correct dates
  data$Date = paste(yr, mon, data$Date, sep="-")
  data$Date = lubridate::ymd(data$Date)
  
  data
}

# Run the function on all of the data frame and return a single cleaned 
#  data frame
map_df(list(df1, df2), clean)

   Date       Ash   Ashe_JuniperOrBald_Cypress Black_Gum Black_Walnut Alnus
   <date>     <chr> <chr>                      <chr>     <chr>        <chr>
 1 2013-01-01 NA    NA                         NA        NA           NA   
 2 2013-01-02 NA    6                          NA        NA           NA   
 3 2013-01-03 NA    2                          NA        NA           NA   
 4 2013-01-04 NA    NA                         NA        NA           NA   
 5 2013-01-05 NA    NA                         NA        NA           NA   
 6 2013-12-01 NA    NA                         NA        NA           NA   
 7 2013-12-02 NA    NA                         NA        NA           NA   
 8 2013-12-03 NA    NA                         NA        NA           NA   
 9 2013-12-04 NA    6                          NA        NA           NA   
10 2013-12-05 NA    8                          NA        NA           NA

Ajackson · April 24, 2019, 6:41pm

Thanks! It all looks so easy now I wonder why I didn't see it.

joels · April 24, 2019, 6:43pm

I made one change to the code. In the previous version, the recoding of column names will fail if more than one old name is mapped to the same new name. The new code uses the recode function, which will work regardless of whether the old names have a many-to-one relationship to the new names.

hinkelman · April 26, 2019, 10:41pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Ajackson · April 27, 2019, 1:56pm

After some tweaking to handle some more corner cases (column name = NA for example) it worked great.

system · May 4, 2019, 1:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.