how to sum a specific column from multiple csv files

rks004 · January 7, 2022, 8:43am

Hello Everyone,
As a coding newbie, I am finding difficult to add a specific column from almost 3500 different CSV files. Each csv files contains 4 columns (ID, x (Long), y (Lat), and Value (Population)). I want to add the "value" column of 3500 csv files and make separate CSV indicating their sum and also its corresponding csv file name.
CSV file Structure:
|ID|X(Lon)|Y(lat)|value|
|4|76.54881139|9.02487862|2588.940186|
|5|76.55790669|9.024840158|2604.013184|
|6|76.56700194|9.024801471|2826.956543|
|7|76.57609716|9.02476256|2546.4375|
|8|76.58519233|9.024723423|2185.00708|
|9|76.59428747|9.024684061|2459.988525|
|10|76.60338256|9.024644474|2826.956787|
|11|76.61247761|9.024604662|2826.957031|
|12|76.62157262|9.024564625|2826.956543|

JackDavison · January 7, 2022, 9:28am

I'd use purrr for this:

Here's an example (note that I've put my three files in a folder in My Documents called "csv-files").

library(tidyverse)

# see the files
read_csv("~/csv-files/file1.csv", show_col_types = F)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     5
#> 2     2    10
#> 3     3    15
#> 4     4    20
#> 5     5    25
read_csv("~/csv-files/file2.csv", show_col_types = F)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     4
#> 3     3     6
#> 4     4     8
#> 5     5    10
read_csv("~/csv-files/file3.csv", show_col_types = F)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     6
#> 3     3     9
#> 4     4    12
#> 5     5    15

# function to summarise file
sum_file = function(path){
  
  dat = read_csv(path, show_col_types = F)
  
  tibble(file = path,
         sum = sum(dat$y, na.rm = T))
  
}

# list files
files = list.files("csv-files", pattern = ".csv", full.names = T)

# summarise all
summary = map_dfr(files, sum_file)

summary 
#> # A tibble: 3 x 2
#>   file                  sum
#>   <chr>               <dbl>
#> 1 csv-files/file1.csv    75
#> 2 csv-files/file2.csv    30
#> 3 csv-files/file3.csv    45

write_csv(summary, "~/summary_file.csv")

^{Created on 2022-01-07 by the reprex package (v2.0.1)}

rks004 · January 7, 2022, 10:59am

Thank you so much @JackDavison. God Bless you. How to save the extracted map_dfr values into a single csv file.

JackDavison · January 7, 2022, 11:11am

Ah yes, I forgot about that part!

I'd do something like:

summary = map_dfr(files, sum_file)

readr::write_csv(summary, "~/summary_file.csv")

Where "~/summary_file.csv" can be replaced with some other path to wherever you want to save your csv file.

rks004 · January 7, 2022, 12:01pm

Thank you again @JackDavison . I have a mix of NA and numerical values in some of the CSV files . Do I need to make any changes in the above code?

JackDavison · January 7, 2022, 12:14pm

Yes you would, instead of:

sum(dat$y)

You'd write:

sum(dat$y, na.rm = TRUE)

This tells sum() to ignore the NAs in your data.

rks004 · January 7, 2022, 3:24pm

You are a lifesaver.

system · January 28, 2022, 3:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.