map for importing many stata files - advice

Hello,
I've tested this code and It works

library(haven)
library(tidyverse)

statas<-list.files(path = "C:/Users/JB/Downloads",pattern ="^ff-.*\\.dta$",full.names = TRUE)
archives<-map_dfr(statas,read_dta,.id = "source")

The issue is that's useful if I want to loop over each one of the stata files (over 100) in order to retrieve a tabulate as:

archives %>%
group_by(source, var1, var2) %>%
summarise(numbers=n()) %>%
pivot_wider(names_from=c(source,var2),values_from=numbers)

This question is because every stata Filesize is near 200 Mb...Imagine your RAM joining 200 MB files.
Using Stata I just ran a loop to read every dta, and I was never storing the previous one.

What would you do?
I hope I made myself clear.

As always, thanks for your time, interest, and patience

Then you should ask yourself, if -- potentially -- you can work on merged dataset itself.

Why do not apply the same approach here? Work on the files subsequently, processing them one after one? Or perhaps you can get a subset of data (particular columns) from the source files lowering the demand of resources? Or compute required values in loop for i in statas and rbind them in another dataframe? Or extract the stata files, wrote them to any text format you like and merge them outside using cat or awk?

There is a few options I would consider.

Regards,
Grzegorz

1 Like

Hello,
Using the same focus of declaring a for i=1 to ... It doesn't seem a purrr-base approach.
I don't know how just importing only the columns I need with purrr.
Maybe I can merge some selected variables. Maybe my RAM will resist...maybe.

This is what I wrote:

for (i in 1:100) {
  
  
 data<-read_dta(list_files[i])  
  
            tabx<-data %>% 
            group_by(var1,var2) %>% 
            summarise(var_y=sum(var3)/1000)  
                                  
            if (i==1) {
              outputs=tabx
            } else { 
             output=bind_rows(output, tabx)
              }

  }

I think It works ok. I don't know if It is inefficient.
The other method was using assign and then declaring "mget" of every tabulate.
Any other suggestion will be appreciated.
Thanks, community

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.