Adding new columns in a nested list - in a tidy way [R]

GS45191 · June 30, 2020, 4:10pm

I'm currently trying to analyse the missigness of data in about 480 xlsx-files built in a similar pattern.
For this purpose I initially imported the xlsx-files and have put them in a list.

Steps I've taken so far:
I imported all the data of interest

# Importing all relevant files
file.list <- list.files(path, pattern = '*.xlsx')
nested.list <- map(file.list, read_xlsx)

I prepared some name lists

# Name lists
colnames <- c("colname1", ..."colname13"
name_list_1 <- c("somename1", ... "somename10")
name_list_2 <- c("somename1_2", ... "somename10_2")

I prepared a function to do the adjusting work for all the files.
Here I want (among others [...]) to

add 2 columns with names from two character lists and
add a column with information about the NA's

adjust <- function(list) {

colnames <- c("colname1", ..."colname13"
name_list_1 <- c("somename1", ... "somename10")
name_list_2 <- c("somename1_2", ... "somename10_2")

     for (i in seq_along(list))  {
      
      result <- list[[i]] %>%
[...]
        setNames(colnames) %>%
        slice(1:444) %>%
        mutate_at(c(4:13), as.numeric) %>%
        mutate(name1 = map(list[[i], name_list_1),
               name2 = map(list[[i], name_list_2),
               missing = pmap(list[[i]], ~rowSums(is.na(.)) 
      
    return(result)
    }
  }

I applied the function on the nested list

nested.list <- map(nested.list, adjust)

Ideally, I would have an output which, for every line and list, contains information about the number of missing values and two columns with thame names in the lists name_list_1 and name_list_2 - which I don't get up to here.

When I remove the following snippet from the function code

 mutate(name1 = map(list[[i], name_list_1),
           name2 = map(list[[i], name_list_2),
           missing = pmap(region_m01.list[[i]], ~rowSums(is.na(.))

I get the result in the attached picture, with the desired adjustments defined in the remaining code.
[Edit 2020-06-30T19:14:00Z ]: I'm in shock right now : every line of code inside my function seems to be copying the values of the first list into all the other 9 lists! Can someone tell me why this is happening? ]

For some reason I'm not able to add the before mentioned columns name1, name2, and missing.

I tried to find some workaround by indexing, using append(), cbind() , map() with ~something. etc. but none of it worked and after all I'd really like to find a neat and "tidy" solution to this.

Can anybody help?

francisbarton · June 30, 2020, 10:34pm

Hi @GS45191 and welcome.

It's not clear to me what your desired output is from the function.
The data you're describing is relatively large and complex - it would be much easier to help if you could provide a smaller and more concise example, as well as evidence of what you've tried already.

The reprex process is really useful for this, as it shows exactly what you've done and what the output was, in a format that other people can access. And - bonus! - you can solve your own problem in the making of it, as often as not.

Here's a few things that might help:

you can set the column type (numeric etc) in your read_xlsx call, using the col_types argument, if all the Excel files have the same structure.
try breaking down your "adjust" function into smaller functions. Easier to debug.
you shouldn't need to explicitly state the return(). R will automatically return the last thing the function produced.
purrr::map works great for iterating over lists. You might want to use this in your function instead of seq_along and all that stuff
in your function where you attempt the mutate, you have got broken brackets (list[[i]) - this needs to be list[[i]]
you've created two vectors, name_list_1 and name_list_2 but then it looks like you've tried to pass these to map as functions. I don't think this will work. It's not clear to me what you're trying to do with these first two mutates.
I might have misunderstood what you're trying to do, but I don't think you need the pmap call in your function. rowSums is already going to cover all the columns. Use pmap when you need to map over multiple lists of variables with a function.

Is this something like what you want?

library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(purrr)


data_list <- list(
  tibble(
  a = sample(c(0:9, NA), 6, replace = TRUE),
  b = sample(c(0:9, NA), 6, replace = TRUE),
  c = sample(c(0:9, NA), 6, replace = TRUE)
  ),
  tibble(
  a = sample(c(0:9, NA), 6, replace = TRUE),
  b = sample(c(0:9, NA), 6, replace = TRUE),
  c = sample(c(0:9, NA), 6, replace = TRUE)
  ),
  tibble(
  a = sample(c(0:9, NA), 6, replace = TRUE),
  b = sample(c(0:9, NA), 6, replace = TRUE),
  c = sample(c(0:9, NA), 6, replace = TRUE)
  )
)

rename_cols <- function(df) {
  # use vectorised functions like paste0 to create vectors more easily
  rename_with(df, ~ paste0("colname", 1:length(df)))
}

add_missing <- function(df) {
  df %>% 
    mutate(missing = rowSums(is.na(.)))
}

data_list2 <- data_list %>% 
  map(rename_cols) %>% 
  map(add_missing)

data_list2
#> [[1]]
#> # A tibble: 6 x 4
#>   colname1 colname2 colname3 missing
#>      <int>    <int>    <int>   <dbl>
#> 1       NA        6        3       1
#> 2        2        1        5       0
#> 3        5        9        5       0
#> 4        1        8        3       0
#> 5        0        9        7       0
#> 6        3        1       NA       1
#> 
#> [[2]]
#> # A tibble: 6 x 4
#>   colname1 colname2 colname3 missing
#>      <int>    <int>    <int>   <dbl>
#> 1        5        5        6       0
#> 2        2        4        1       0
#> 3        4        8        1       0
#> 4        6        4        4       0
#> 5        9        3        6       0
#> 6        2        8        6       0
#> 
#> [[3]]
#> # A tibble: 6 x 4
#>   colname1 colname2 colname3 missing
#>      <int>    <int>    <int>   <dbl>
#> 1        5        8        5       0
#> 2        6        0        1       0
#> 3       NA       NA        6       2
#> 4        9        8       NA       1
#> 5        4        2        1       0
#> 6        3        6        6       0

^{Created on 2020-06-30 by the reprex package (v0.3.0)}

GS45191 · July 2, 2020, 1:28pm

Hi @francisbarton, thank you for your insightful reply!
It's obvious I didn't do my homework, sorry! I will take your suggestion to heart and use reprex in the future (as soon as I learn more about it).

My main goal is to get information about the number of missing values in the data I've imported - row wise, colum wise, list wise and even over all!

Since some older lists I'm suppose to orient myself contain columns with names of my two name lists, I want to add these on my lists as well.

Settings in read_xlsx
I was aware about the settings inside read_xlsx, but because of technical issues (could nut run code as admin...) I opted for a workaround.

Meanwhile, I changed my code for reading the data like this:

[...]
col_types <- c(rep("text", 3), rep("numeric", 10))

nested.list <-
  list.files(pattern = "*.xlsx") %>% 
  map(~read_xlsx(., skip = 5, na = ".", col_names = colnames, col_types = col_types))

Name columns
Unfortunately, there is no such order logic behind name_list_1 and name_list_2.

In general, I seem to be unable to create tidy functions such as yours (again: need to do the homework, I know! But it's urgent right now and I don't have the necessary time for it )

How do I turn for example the following code in to a pretty function I can map on the nested.list?

# Adding columns with names of lists
nested.list[[1]] <- nested.list[[1]] %>% slice(1:444) %>% mutate(name_list_1 = paste(name_list_1[[1]])) %>% mutate(name_list_2 = paste(name_list_2[[1]]))
nested.list[[2]] <- nested.list[[2]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[2]])) %>% mutate(name_list_2 = paste(name_list_2[[2]]))
nested.list[[3]] <- nested.list[[3]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[3]])) %>% mutate(name_list_2 = paste(name_list_2[[3]]))
nested.list[[4]] <- nested.list[[4]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[4]])) %>% mutate(name_list_2 = paste(name_list_2[[4]]))
nested.list[[5]] <- nested.list[[5]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[5]])) %>% mutate(name_list_2 = paste(name_list_2[[5]]))
nested.list[[6]] <- nested.list[[6]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[6]])) %>% mutate(name_list_2 = paste(name_list_2[[6]]))
nested.list[[7]] <- nested.list[[7]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[7]])) %>% mutate(name_list_2 = paste(name_list_2[[7]]))
nested.list[[8]] <- nested.list[[8]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[8]])) %>% mutate(name_list_2 = paste(name_list_2[[8]]))
nested.list[[9]] <- nested.list[[9]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[9]])) %>% mutate(name_list_2 = paste(name_list_2[[9]]))
nested.list[[10]] <- nested.list[[10]] %>% slice(2:445) %>% mutate(name_list_1 = paste(name_list_1[[10]])) %>% mutate(name_list_2 = paste(name_list_2[[10]]))

Anyway, considering my goal, do you have any other suggestion?

francisbarton · July 2, 2020, 3:27pm

Just quickly: thank you for your positive reply. We're all learning here!
Your way of building the col_types vector using rep is pretty neat.

So when it comes to map, what I often do is get the function working using only the first element of the list, using list_name[[1]]. Then once I am happy with the function, I try mapping it along the whole list.

Whenever I see a load of instructions using [[1]], [[2]], that's a clue that map is going to be helpful. Just like you have here with your example.

I haven't tried it out but I'd guess that those paste() commands in there aren't useful: it doesn't look like they're doing anything. So: take your first line and try to write a function that will give the output you want for that line only.

Because you're using elements from multiple lists, pmap is going to be the tool you want.

Build a tibble of the lists you want to map along

my_lists <- tibble(
  nested.list,
  slices, # useful if they are going to change, not necessary if it's always the same slice
  name_list_1,
  name_list_2
)

All those lists need to be same length (or length 1) in order for a tibble to succeed and for pmap to work.

Write a function

my_mutate <- function(df, slice, list1name, list2name) {
  df %>%
    slice(slice) %>%
    mutate(name1 = list1name) %>%
    mutate(name2 = list2name)
}

and test it out like so:

test_out <- my_mutate(nested.list[[1]], slices[[1]], name_list_1[[1]], name_list_2[[1]])

If that looks good to you then you can just convert it to pmap:

final_out <- pmap(my_lists, ~ my_mutate(..1, ..2, ..3, ..4))

I haven't tested any of this code, just wrote it 'freehand' so there may be mistakes.

francisbarton · July 2, 2020, 3:34pm

If you need to get the total numbers of NAs in each column of a data frame, dplyr::summarise is really useful. Since dplyr 1.0.0 you can also use it with across() to get row-wise summaries, as an alternative to the mutate approach you've used here so far.

You can also use summary() I think, and tools like skimr() can be useful.

I haven't used it but I believe there's a specific package for examining missingness called {naniar}.

nirgrahamuk · July 2, 2020, 4:22pm

the first slice is 1:444 the subsequent 9 are 2:445, is that correct ?

GS45191 · July 3, 2020, 8:33pm

Thank you so much!

For my different slices I've built this tibble:

s1 <- 1:444
s2 <- 2:445

slices <- append(data.frame(s1), data.frame(rerun(9, s2))) %>% as_tibble()

And for my_lists I used lst() instead of tibble()

Then tried out your code and it worked out great!

I also looked up the package naniar. I descrovered some really useful features in there!

GS45191 · July 3, 2020, 8:34pm

Yes, that is correct. The format of the first list differs a bit from the rest.

system · July 10, 2020, 8:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.