I have 100 .hdf5 files in a folder. I want to read them, extract some data and then combine those data in 1 data frame (from all 100). .hdf5 files can be read using rhdf5 library in R.
My current code
Using the for-loop I can achieve my objective as follows:
library(rhdf5)
temp = list.files(pattern="*.hdf5")
df_list = list() # initialize a list
# Read all files into a list of data frames
for (i in unique(temp)){
## read 1 folder from the given file
data <- h5read(file = i, name = "data")
### extract the SCC_Follow_Info dataset
df <- data$SCC_Follow_Info
df <- as.data.frame(df)
## assign to the list
df_list[[i]] <- df
}
# Combining all data to 1 data frame ---------------------------
library(data.table)
df_sim <- data.table::rbindlist(df_list, idcol = "file.ID")
Question
Is there a purrr way to achieve my objective here? I read somewhere about this topic but can't seem to find it. It would be great if you could share a blog post doing something similar.
Thanks for your answer. This is very useful. However, I am wondering if this code can be extended for extracting multiple data objects from the .hdf5 file. For example, if I want to extract 5 more objects like SCC_Follow_Info and then finally combine them. I don't want to use h5read multiple times.
I do exactly this with NetCDF files (which, as of version 4, are acutally interoperable with HDF5 files)
library(tidyverse)
library(rhdf5)
# first, here's our extractor function. you can use it anonymously
# inside map_dfr; i'm separating it out here for clarity (and so you
# can reuse it). the extractor needs to accept a filename and
# return a data frame.
hdf5_extractor = function(fname) {
data = h5read(file = fname, name = "data")
# what you do here depends on how objects inside
# the file are structured. if they're just vectors, you can
# create and return a data frame like this:
return(data_frame(
data$SCC_Follow_Info,
data$something_else,
data$another_thing))
# if they aren't vectors, you'll have to think about another way
# to combine them into a data frame...
}
# get the file list and pipe it into our extractor function
df_dim =
list.files(pattern="*.hdf5") %>%
set_names(.) %>%
map_dfr(hdf5_extractor, .id = "file.ID")
If you have large HDF5 files and don't need everything from a particular column, you can also modify this function to filter the contents before you return them
Thanks a lot! This is very easy to understand. However, I am running into another problem now. I know it is different from the original question, but am posting here as the code is the same.
Error with here package
I want to use the here package to locate my files:
> df_sim <- list.files(path = here("data", "raw_data"),
+ pattern="*.hdf5") %>%
+ set_names(.) %>%
+ map_dfr(hdf5_extractor, .id = "file.ID")
Show Traceback
Rerun with Debug
Error in h5checktypeOrOpenLoc(file, readonly = TRUE) :
Error in h5checktypeOrOpenLoc(). Cannot open file. File 'C:\Users\durraniu\Google Drive\Dissertation\Cars_20160601_01.hdf5' does not exist.
This is not what I expected. If I run just the first 2 lines, I get the correct ouput:
/facepalm Yep, I forgot that If you want to isolate the file name later (in order to extract metadata from it), you can pipe the full names through basename() to remove the path and then tidyr::separate() to turn the delimited filename column into several columns