Using as.data.table to calculate the mean

Micaela · October 2, 2020, 10:11pm

I am trying to modify existing code that calculates the mean of results within a dataframe and then assesses them over different averaging periods. First off, I would really love some help breaking down exactly what the existing chunk of code is doing since I am not familiar with the as.data.table function. Here is the existing code:

W_Objectives<-as.data.table(W_Objectives)[,mean(Result),list(AnalyteName,BeneficialUse
,UnitName,StationCode,ProjectName,SampleDate,MatrixName,FractionName
,TargetLatitude,TargetLongitude,Waterbody,WBID,Wbtype,Objective,AveragingPeroid
,Objective_Language,Evaluation_Guideline,Objective_Ref_Number,Eval_Ref_Number, Comment)]
names(W_Objectives)[names(W_Objectives)=='V1']<-"Result"
W_Objectives<-tbl_df(W_Objectives)

I am trying to pull specific data out of my dataframe to average on a monthly basis instead. I did this like so:

monthlymean<-W_Objectives[W_Objectives$Comment == "monthly mean",]
monthlymean <- monthlymean %>%
mutate(
year=year(SampleDate), # extract parts
month=month(SampleDate)
)

My first question is: should I extract the data to be evaluated by a monthly mean prior to running the first chunk?

My second question is regarding how I would format a monthly mean using the as.data.table function. I was originally planning on using the aggregate function using month and year, but if there is an easy way to do this using as.data.table it would keep the code more streamlined for future use.

I appreciate any guidance as I am fairly new to R!

nirgrahamuk · October 3, 2020, 12:20pm

Here is documentation for datatable.
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

datatable is useful when you want to emphasise speed of calculation, but many people prefer the easier to follow syntax of dplyr when performance isnt the highest concern.

jrkrideau · October 3, 2020, 3:32pm

This seems to be sub-setting a data.frame and converting it to a data.table

W_Objectives <- as.data.table(W_Objectives)[ , mean(Result) , list(AnalyteName , BeneficialUse  , UnitName , StationCode , ProjectName , SampleDate , MatrixName , FractionName , TargetLatitude , TargetLongitude , Waterbody , WBID , Wbtype , Objective , AveragingPeroid , Objective_Language , Evaluation_Guideline , Objective_Ref_Number , Eval_Ref_Number ,  Comment)]

I am not sure how it works. I very seldomly use data.table so I may be missing something but I do not see how one can calculate mean(Result) while converting the data.frame.

names(W_Objectives)[names(W_Objectives)=='V1'] <- "Result"

Here you appear to renaming the first variable in the data table but I am not sure if the syntax works

W_Objectives <- tbl_df(W_Objectives)

Now you are converting the data.table back into a data.frame---essentially a tibble is a data.frame. As far as I can see the conversion to and from a data.table is redundant.

I am trying to pull specific data out of my dataframe to average on a monthly basis instead. I did this like so:

monthlymean <- W_Objectives[W_Objectives$Comment == "monthly mean" , ]
monthlymean  <-  monthlymean %>%
mutate(
year=year(SampleDate) ,  # extract partslibrary(lubridate)
month=month(SampleDate)
)

I am lost. You appear to be pulling something out of the comment column but without seeing some raw data I don't understand what is happening. I am pretty sure that you cannot use mutate here.

Here, very roughly is what I think your code is doing in very simple form.

library(tidyverse)
library(data.table)
library(lubridate)

dat1  <-  data.frame(aa = letters[1:5],
                     bb = 1:5,
                     cc = 5:1,
                     dd = LETTERS[5:1])

dtb1  <-  as.data.table(dat1[, c("aa", "bb", "cc", "dd")])

tibb1  <-  tibble(dtb1)

What you seem to be trying to do with the dates is

tt1  <-  ymd("2020-09-12", "2021-08-05")

tt2  <-  year(tt1)
class(tt2)
mean(tt2)

I do not think that is what you intended.

Some data would be nice. Please have a look at

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Micaela · October 4, 2020, 11:50pm

Ok, I was unsure of whether the first chunk of code would be calculating the mean of results or not...and based on what. The second chunk I included is sub setting data that has a comment to calculate based on monthly mean....so I am just creating a new dataframe to be evaluated in that manner. I then used mutate to add numeric month, day, and year columns as I was originally going to use aggregate to calculate mean based on month and year columns. I did not want to do this if the mean was already calculated in the first chunk using as.data.table though.

jrkrideau · October 5, 2020, 1:05am

Ok, I was unsure of whether the first chunk of code would be calculating the mean of results or not...and based on what

As far as I can see it does nothing.

The second chunk I included is sub setting data that has a comment to calculate based on monthly mean.

Without data, meaningless.

The second chunk I included is sub setting data that has a comment to calculate based on monthly mean....so I am just creating a new dataframe to be evaluated in that manner.

I do not understand this. You seem to be just creating a new version of the original data.table == data.frame.

I think you have reasoned yourself into a corner. What we need is a very basic statement of the problem.

I then used mutate to add numeric month, day, and year columns as I was originally going to use aggregate to calculate mean based on month and year columns.*

I don't think dates work this way but we need an expert to comment.

I also think you need to read up on R.

More seriously we need a clear statement of the actual research problem. Without a clear idea of why you need these statistics we are lost.

system · October 26, 2020, 1:06am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.