using map_dfr and mutate to read a group of csv files, but filename info is not being retained

chf · January 22, 2022, 5:19pm

Hello, I'm trying to run the code below, which I think is supposed to take a group of 20 csv files and string them together into a single csv. The filenames of the csvs combine one of five scenarios (USG1, USG2, USG3, USG4, and USG5) and separated by a space, one of four discount rates (0.0, 0.01, 0.02, and 0.03) for a total of 20 csvs. The data in each csv records the presence or absence of discontinuities in each of 10,000 monte carlo simulations for each of 9 different years. I put the 20 csv files on github (GitHub - chframe/HFC-social-costs: discontinuities in PAGE integrate assessment model runs). The problem is that the final csv that I get at the end does not have the scenarios (USG1, USG2, etc) recorded in the scenario column of the data frame. Instead the scenario column just has 'NA' all the way down for 1,800,000 rows. I believe everything else came out fine in the final csv

It looks like the code is trying to convert the USG scenarios into more specific names (e.g. "IMAGE", "MERGE Optimistic", etc). I don't need it to do this. But I do need to record the original scenario names in the combined csv. Does anyone have any ideas for why it's doing this? Thanks for your help. Please let me know if there's something I can do to clarify what's happening. Thanks

## Written by: US EPA National Center for Environmental Economics

##########################
#################  LIBRARY
##########################

## Clear worksace
rm(list = ls())
gc()

## This function will check if a package is installed, and if not, install it
list.of.packages <- c('tidyverse','magrittr','stringr')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.rstudio.com/")
lapply(list.of.packages, library, character.only = TRUE)
library(data.table)

##########################
################  PREAMBLE
##########################

## List of gases
hfc_list <- c('125','134a','143a')

##########################
##############  START LOOP
##########################

for (hfc in hfc_list) {
  
page_dir    <- paste0("/home/julia-1.6.3/share/julia/MimiIWG-HFC/data/hfc",hfc,"/page/discontinuity_mismatch") # location of file group
page_files  <- fs::dir_ls(page_dir, regexp = "\\.csv$") # create list of .csv files
page        <- page_files %>% 
                  map_dfr(read_csv, .id = "source") %>% # read in files (map), turn into data frame (df), and row bind (r)
                  as.data.table()

##########################
###################  CLEAN
##########################

page %<>% mutate(source = str_remove(source,paste0("data/hfc",hfc,"/page/discontinuity_mismatch/"))) %>%
          mutate(source = str_remove(source,".csv")) %>%
          separate(source, c("scenario","discount_rate"), " ") %>%
          mutate(scenario = case_when(scenario=="USG1" ~ "IMAGE",
                                      scenario=="USG2" ~ "MERGE Optimistic",
                                      scenario=="USG3" ~ "MESSAGE",
                                      scenario=="USG4" ~ "MiniCAM Base",
                                      scenario=="USG5" ~ "5th Scenario"),
                 discount_rate = paste0(as.numeric(discount_rate)*100,'%')) %>%
          group_by(scenario,discount_rate) %>%
          mutate(trial = seq(n()))

## WIDE TO LONG
years <-  paste(seq(2020,2060,5), sep=", ") # vector of years
page %<>% gather(year,discontinuity,all_of(years)) %>%
          mutate(model = 'PAGE 2009')

##########################
####################  SAVE
##########################

write_csv(page, paste0("/home/julia-1.6.3/share/julia/MimiIWG-HFC/data/hfc",hfc,"_page_discontinuity.csv"))

}

## END OF SCRIPT. Have a great day!

HanOostdijk · January 22, 2022, 8:29pm

Please tell us the exact layout of final csv.
And please note that gather is a superseded function in the tidyr package.

chf · January 23, 2022, 12:08am

Thanks for asking. What does it mean that gather is a superseded function?

The 'page' dataframe and the final csv both have 6 columns (scenario, discount_rate, trial, year, discontinuity, and model). There are 1,800,000 rows.

The first 200,000 rows are year 2020, the next 200,000 are year 2025, the next 200,000 are 2030, etc up to the year 2060 (the years are the same 9 column headers for all of the 20 csvs that are being read). Within each year's set of 200,000 rows, the first 10,000 correspond to the 0% discount rate, the next 10,000 correspond to 1% discount rate, the next 10,000 correspond to 2% discount rate, and the next 10,000 correspond to 3% discount rate. This sequence of discount rates changing every 10,000 rows repeats 5 times, I assume to correspond to the 5 different USG scenarios, but they aren't recorded in the 'scenario' column, it just says 'NA' in that column for every row. The numbers in the trial column fall between 1 and 50,000, but I don't completely understand what's happening there . I think that it makes sense for there to be 50,000 trials because these results are from 10,000 monte carlo simulations that were run for each of the 5 USG scenarios. So there were actually 50,000 unique trials.

This is what one row of page looks like when I print it:

> page[40001, ]
# A tibble: 1 × 6
# Groups:   scenario, discount_rate [1]
  scenario discount_rate trial year  discontinuity model    
  <chr>    <chr>         <int> <chr> <lgl>         <chr>    
1 NA       0%            10001 2020  FALSE         PAGE 2009

This is a screenshot of the final csv opened in spreadsheet:

Unfortunately it's 59 MB so I can't load it in one piece on github. If it would help, I can split it up and put a piece on github.

HanOostdijk · January 23, 2022, 9:33pm

See ?gather for some words about 'superseded'.

I think the following will also work for you (I use only two of your github files!) :

suppressPackageStartupMessages(
  {
library(dplyr)
library(purrr)
library(readr)
library(stringr)
library(tidyr)
  }
)

f1 <- "https://raw.githubusercontent.com/chframe/HFC-social-costs/main/USG1%200.0.csv"
f2 <- "https://raw.githubusercontent.com/chframe/HFC-social-costs/main/USG2%200.02.csv"
pagefiles <- c(f1,f2)

do_one_csv <- function (filename) {
   shortf <- stringr::str_extract(filename,"USG.*csv$") 
   shortf <- stringr::str_remove(shortf,"\\.csv$")
   shortf <- stringr::str_replace_all(shortf,"%20"," ")
   df <- read_delim(filename, delim="," ,
               col_types=cols("l","l","l","l","l","l","l","l","l")
               ) %>%
     mutate(source=shortf,
            trial=row_number()) %>%
     separate(source, c("scenario","discount_rate"), " ") %>%
     mutate(scenario = case_when(scenario=="USG1" ~ "IMAGE",
                                      scenario=="USG2" ~ "MERGE Optimistic",
                                      scenario=="USG3" ~ "MESSAGE",
                                      scenario=="USG4" ~ "MiniCAM Base",
                                      scenario=="USG5" ~ "5th Scenario"),
                 discount_rate = paste0(as.numeric(discount_rate)*100,'%')) 
}

page        <- pagefiles %>% 
                  map_dfr(do_one_csv) 

page %>%
  pivot_longer(cols=`2020`:`2060`,names_to="year",values_to = "discontinuity" ) %>%
  mutate(model="PAGE 2009") %>%
  select(scenario,discount_rate,trial,year,discontinuity,model) %>% 
  head(15)
#> # A tibble: 15 x 6
#>    scenario discount_rate trial year  discontinuity model    
#>    <chr>    <chr>         <int> <chr> <lgl>         <chr>    
#>  1 IMAGE    0%                1 2020  FALSE         PAGE 2009
#>  2 IMAGE    0%                1 2025  FALSE         PAGE 2009
#>  3 IMAGE    0%                1 2030  FALSE         PAGE 2009
#>  4 IMAGE    0%                1 2035  FALSE         PAGE 2009
#>  5 IMAGE    0%                1 2040  FALSE         PAGE 2009
#>  6 IMAGE    0%                1 2045  FALSE         PAGE 2009
#>  7 IMAGE    0%                1 2050  FALSE         PAGE 2009
#>  8 IMAGE    0%                1 2055  FALSE         PAGE 2009
#>  9 IMAGE    0%                1 2060  FALSE         PAGE 2009
#> 10 IMAGE    0%                2 2020  FALSE         PAGE 2009
#> 11 IMAGE    0%                2 2025  FALSE         PAGE 2009
#> 12 IMAGE    0%                2 2030  FALSE         PAGE 2009
#> 13 IMAGE    0%                2 2035  FALSE         PAGE 2009
#> 14 IMAGE    0%                2 2040  FALSE         PAGE 2009
#> 15 IMAGE    0%                2 2045  FALSE         PAGE 2009
Created on 2022-01-23 by the reprex package (v2.0.1)

chf · January 26, 2022, 12:11am

Thanks, yes, this worked. And I can understand what this code is doing.

system · February 16, 2022, 12:11am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.