Merge multiple files and add new column "subject"

jubejube · May 5, 2020, 6:48am

Hi, I've just started to learn Rstudio and coding and I'm having some trouble with a few things. I'm trying to merge 20+ files into one data frame and add a column for subject number/ID which correspond to rows from each datafile.
e.g. all the data rows from file #1 would be labelled "1" in the subject column, etc.

All of the original data files already has a column labelled "subject". However, this column is blank (our experiment didn't output a subject number, but created a column called subject anyways), so there aren't any subject names in any of the original data files.

I tried implementing solutions from this thread, but I received an error that says "Error: file must be a string, raw vector or a connection."

I already read and merged all the data files using purrr:

data = list.files(path = "data", full.names = T) %>%
map(read_csv) %>%
reduce(rbind)

Any help is appreciated!

siddharthprabhu · May 5, 2020, 8:42am

You could do it like this.

library(tidyverse)

df <- list.files(path = "data", full.names = TRUE) %>%
  map_dfr(read_csv, .id = "file_path") %>% 
  group_by(file_path) %>% 
  mutate(subject = group_indices())

jubejube · May 5, 2020, 10:42am

Hi, thanks for the quick response.
I tried running the code, but I received this error:
Error: group_indices.default() should only be called in a data context

siddharthprabhu · May 5, 2020, 10:45am

Can you please post the exact code you ran? I assume you've replaced "data" with the path to the folder where your files are located.

jubejube · May 5, 2020, 11:22am

Actually, the path to the folder is called "data" too!

I ran this:
subjdata <- list.files(path = "data", full.names = T) %>%
map_dfr(read_csv, .id = "file_path") %>%
group_by(file_path) %>%
mutate(subject = group_indices())

siddharthprabhu · May 5, 2020, 11:29am

I'm unable to reproduce your error. Can you please only run list.files(path = "data", full.names = T) and tell me what is the output you see in the console?

jubejube · May 5, 2020, 11:57am

list.files(path = "data", full.names = T)
[1] "data/01.txt" "data/02.txt" "data/03.txt" "data/04.txt" "data/05.txt"
[6] "data/06.txt" "data/07.txt" "data/08.txt" "data/09.txt" "data/10.txt"
[11] "data/11.txt" "data/12.txt" "data/13.txt" "data/14.txt" "data/15.txt"
[16] "data/16.txt" "data/17.txt" "data/18.txt" "data/19.txt" "data/20.txt"
[21] "data/21.txt" "data/22.txt" "data/23.txt" "data/24.txt"

siddharthprabhu · May 5, 2020, 1:25pm

For reading text files, you should generally use read_delim(), not read_csv(). Do you get a single data frame as output after running the map_dfr(...) statement?

jubejube · May 5, 2020, 6:40pm

Ah okay!
I ran up to the map_dfr() and I received a single data frame output which includes the new column! It's called "file_name" and provides the name of the the file "1, 2, etc" (which are numbered anyways).
Thank you!

siddharthprabhu · May 5, 2020, 6:53pm

Okay cool. I didn't know what your files were named, so the group_indices() would help if your files didn't contain a sequence number. Strange that the rest of the code doesn't work for you though.

jubejube · May 5, 2020, 7:35pm

Thankfully, it worked out conveniently for my case!

Would it have to do with how my files are named/formatted or the type of variable? (I'm not well-versed in R so might be a newbie question!)

siddharthprabhu · May 6, 2020, 7:00am

Are you sure that the new column is called "file_name"? It should be "file_path" if you used the code I gave. I'd advise you to create a reprex so that we can see exactly what's happening by following this guide.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

system · May 27, 2020, 7:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.