Need some help with function to read multiple csv files and calculate the mean() of columns ingoring NA

NobelRobin · December 11, 2019, 10:42am

pollutantmean <- function(directory, pollutant, ind)
  {
  directory <- "specdata"
  z <- list.files("directory")
  print(z)
  ind
  i <- ind[1]
  totaal <- read.table(z[1])
  hulpspec <- data.frame()
  for (i in ind){
    i <- i+1
    hulpspec <- read.table(z[i])
    totaal <- rbind(totaal,hulpspec)
  }
  mean(totaal[ind], na.rm == FALSE)
}
**gave me this error message on windows10 home edition when executing "pollutantmean("specdata","sulfate", 1:10)"** 
" Error in file(file, "rt") : invalid 'description' argument 
4.
file(file, "rt") 
3.
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...) 
2.
read.csv(z[1]) at pollutantmean.R#4
1.
pollutantmean("specdata", "sulfate", 1:10) "

Wath did i wrong? Help me thx a lot
Nobel

FJCC · December 11, 2019, 2:01pm

When you call

pollutantmean("specdata","sulfate", 1:10)

the variable directory within pollutantmean gets the value "specdata". So,

There is no need to set the value of directory to be "specdata" within the function.
When you call

z <- list.files("directory")

the term directory should not be in quotes.

Also, it is a bad idea to manually increment the i variable within the for loop. Let the for loop do the incrementing.

Finally, I do not think you want to write

mean(totaal[ind], na.rm == FALSE)

because ind is a vector. Don't you want to use the parameter pollutant there? And your title says you want to ignore NA, so use na.rm = TRUE.

I would write the function like this.

pollutantmean <- function(directory, pollutant, ind)
  {
  z <- list.files(directory)
  totaal <- data.frame()
  for (i in ind){
    hulpspec <- read.table(z[i])
    totaal <- rbind(totaal,hulpspec)
  }
  mean(totaal[, pollutant], na.rm = TRUE)
}

I have not tested that, since I do not have your data, so it may have mistakes.

martin.R · December 11, 2019, 2:44pm

Here is an alternative solution which excludes the need for the ind variable.

This assumes that "sulfate" is the name of a column:

library(tidyverse)

pollutantmean_1 <- function(directory, col) {
  list.files(directory) %>% 
    map_dfr(read.table, sep = "") %>% # change sep as required
    summarise(mean({{col}}, na.rm = TRUE)) %>% 
    pull()
}
pollutantmean_1("specdata", sulfate) # no quotes for the column name


pollutantmean_2 <- function(directory, col) {
  list.files(directory) %>% 
    map_dfr(read.table, sep = "") %>% # change sep as required
    summarise(mean(!!sym(col), na.rm = TRUE)) %>% 
    pull()
}
pollutantmean_2("specdata", "sulfate") # quotes for the column name

NobelRobin · December 13, 2019, 9:15am

thx for replying , i learned a lot of yours solution

system · January 3, 2020, 9:15am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.