Modifying Column Names in a List of Data Frames

nviau · August 8, 2018, 5:48pm

Hello. I have a list of (consistently) poorly formatted data frames. I've spent much of the morning trying to figure out a solution using purrr and it's proving very frustrating. Here's some example data:

my_list <- list(
  data.frame(Jack = c("Anne", "Sophie"), X. = "$", X1 = 2:3, X..1 = "$", X4 = 5:6),
  data.frame(Meg = c("Tom", "Lauren", "Kyle", "Alex", "Seth"), X. = "$", X7 = 8:12, X..1 = "$", X13 = 14:18),
  data.frame(Dylan = "Liz", X. = "$", X19 = 20, X..1 = "$", X21 = 22)
)

my_list 
#> [[1]]
#>     Jack X. X1 X..1 X4
#> 1   Anne  $  2    $  5
#> 2 Sophie  $  3    $  6
#> 
#> [[2]]
#>      Meg X. X7 X..1 X13
#> 1    Tom  $  8    $  14
#> 2 Lauren  $  9    $  15
#> 3   Kyle  $ 10    $  16
#> 4   Alex  $ 11    $  17
#> 5   Seth  $ 12    $  18
#> 
#> [[3]]
#>   Dylan X. X19 X..1 X21
#> 1   Liz  $  20    $  22

Note that each data frame has the same number of columns.

Problems:

The first row of data is stored as the column names.
There are separate columns for dollar signs ($).

What I hope to accomplish:

Move the column names down into a row.
Rename each column (a simple x1:x(n) scheme is fine).
Dropping the columns containing "$" is not a problem, as I can just do it later on. Buuuuut if anyone thinks that it would be better before I combine the data frames, please say so

PS - If you provide any rlang or tidy evaluation context in your response, I would really appreciate it. I'm trying to get a better grasp of it.

MikeBadescu · August 12, 2018, 2:23pm

Here is a proposed solution, I tried to make each step clear to follow, I hope it helps.

clean_one_df <- function(df) {
    
    # remove '$' columns and convert factors to characters
    df1 <- 
        df %>%
        select_if(.p = ~ !all(.x == "$")) %>%
        modify_if(.p = ~ is.factor(.x), .f = as.character)
    
    # the data types for each column
    data_types <- map(df1, .f = ~ mode(.x))
    
    # take the orig names, remove 'X' and change the data type for each column
    first_row <- 
        names(df1) %>%
        gsub("^X", "", .) %>%
        as.list() %>%
        map2_dfc(.y = data_types, .f = ~ as.vector(.x, mode = .y))
    
    # first_row names are V1, V2, ...
    names(df1) <- names(first_row)
    df2 <- bind_rows(first_row, df1)
    
    df2
}

my_list %>% 
    map(clean_one_df) %>% 
    print()

MikeBadescu · August 12, 2018, 2:26pm

Also, please consider how your data got to be in that format - in certain cases it may be easy to import (relatively) cleaner data, see read_csv.

cderv · August 13, 2018, 8:28am

I agree. In your case @nviau, I think with header = FALSE, you should be able to import correctly the first row not as column name.

nviau · August 13, 2018, 3:50pm

Thanks @MikeBadescu. It's not working on my actual data at the moment, but I think I'm close to a solution using your suggestion. I'll follow up later on.

@cderv - I'm a big fan of read_csv ! Unfortunately the data are stored in a series of poorly arranged tables in several PDF documents. I imported data via the tabulizer package.

nviau · August 13, 2018, 4:06pm

Success! Turns out I overlooked a logical column and it was throwing an error. Thanks again @MikeBadescu !

cderv · August 13, 2018, 4:22pm

Did you use the output = "data.frame" argument in extract_tables ?
If it is what I think, this function is expecting by default the table to have a header. Internally, the method to returns data.frame call read.delim, and extract_tables as a ... argument in which you can pass argument to method used. So here, if you add header = FALSE to your call of extract_tables, it should import the table with no header.
Could you try to confirm this ?

nviau · August 13, 2018, 4:41pm

My gosh, you're right! Can't believe I missed that.

cderv · August 13, 2018, 4:45pm

Thanks for the question ! It helps see that the documentation is not fully correct because it says that ... are used with method and not output. (I opened an issue about that)

One great thing about open source software is that you can always go read the source code to understand better what is feasible and how it works. Unfortunately, everything is not always documented and reading a function source code can help a lot to learn a few things.
It is also why documentation is very very important in software development to help user fully understand the power of the software.

Glad it works!