map for importing many stata files - advice

jfca283 · December 23, 2021, 3:31am

Hello,
I've tested this code and It works

library(haven)
library(tidyverse)

statas<-list.files(path = "C:/Users/JB/Downloads",pattern ="^ff-.*\\.dta$",full.names = TRUE)
archives<-map_dfr(statas,read_dta,.id = "source")

The issue is that's useful if I want to loop over each one of the stata files (over 100) in order to retrieve a tabulate as:

archives %>%
group_by(source, var1, var2) %>%
summarise(numbers=n()) %>%
pivot_wider(names_from=c(source,var2),values_from=numbers)

This question is because every stata Filesize is near 200 Mb...Imagine your RAM joining 200 MB files.
Using Stata I just ran a loop to read every dta, and I was never storing the previous one.

What would you do?
I hope I made myself clear.

As always, thanks for your time, interest, and patience

gsapijaszko · December 23, 2021, 10:44pm

Then you should ask yourself, if -- potentially -- you can work on merged dataset itself.

Why do not apply the same approach here? Work on the files subsequently, processing them one after one? Or perhaps you can get a subset of data (particular columns) from the source files lowering the demand of resources? Or compute required values in loop for i in statas and rbind them in another dataframe? Or extract the stata files, wrote them to any text format you like and merge them outside using cat or awk?

There is a few options I would consider.

Regards,
Grzegorz

jfca283 · December 24, 2021, 1:29am

Hello,
Using the same focus of declaring a for i=1 to ... It doesn't seem a purrr-base approach.
I don't know how just importing only the columns I need with purrr.
Maybe I can merge some selected variables. Maybe my RAM will resist...maybe.

jfca283 · December 24, 2021, 10:12pm

This is what I wrote:

for (i in 1:100) {
  
  
 data<-read_dta(list_files[i])  
  
            tabx<-data %>% 
            group_by(var1,var2) %>% 
            summarise(var_y=sum(var3)/1000)  
                                  
            if (i==1) {
              outputs=tabx
            } else { 
             output=bind_rows(output, tabx)
              }

  }

I think It works ok. I don't know if It is inefficient.
The other method was using assign and then declaring "mget" of every tabulate.
Any other suggestion will be appreciated.
Thanks, community

system · January 14, 2022, 10:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.