How can I write a loop in R for the following problem?

london42 · August 18, 2022, 12:14pm

I want to write a for loop for my problem. I want to do column-based normalization within each year group, so I want to write a for loop function that first filters the year does the normalization (with my function lapply(tmp[2:3], function(tmp) bestNormalize(tmp , standardize=TRUE, quiet = TRUE)) for each column and then pass to next year and so on and want to save the results to a list. My data look like

Year	Score 1	Score 2
2012	34	45
2012	41	46
2013	31	44
2013	44	33
2014	35	56
2014	42	21

I wrote this but it gives me the final year only, I am a newbie and could not find the similar example as my case, can someone help me?

i=2012
for (i in 1:3){
  tmp = newdf[newdf$Year==i+2011,]
  abc = lapply(tmp[2:3], function(tmp) bestNormalize(tmp , standardize=TRUE, quiet = TRUE))
  print(abc)

}

FactOREO · August 18, 2022, 1:21pm

Hello,

could you give some more details or a wanted outcome? So do you want to standardize (e.g. subtract mean and divide by standard deviation) all values from the data.frame (e.g. Score 1 and Score 2) within a given year? Or only all values from Score 1 and Score 2 by group separately?

I assume the result would be a list with a data.frame for every year, containing 2 columns (Score 1 and Score 2 normalized?). But maybe you can clarify it a bit, so I can think of an optimal solution.

Thanks and kind regards

FactOREO · August 18, 2022, 1:49pm

As a first try, maybe this is what you want:

library(collapse)
#> collapse 1.8.6, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attache Paket: 'collapse'
#> Das folgende Objekt ist maskiert 'package:stats':
#> 
#>     D

data <- data.frame(
  year = c(2012,2012,2013,2013,2014,2014),
  score_1 = c(34,41,31,44,35,42),
  score_2 = c(45,46,44,33,56,21)
)

bestNormalize <- function(x,standardize=TRUE,quiet = TRUE){
  # do some stuff
  result <- (x - mean(x))/sd(x)
  return(result)
}

data |>
  fgroup_by(year) |>
  fmutate(
    score_1_norm = bestNormalize(score_1),
    score_2_norm = bestNormalize(score_2)
  ) |>
  fungroup() |>
  rsplit(~ year)
#> $`2012`
#>   score_1 score_2 score_1_norm score_2_norm
#> 1      34      45   -0.7071068   -0.7071068
#> 2      41      46    0.7071068    0.7071068
#> 
#> $`2013`
#>   score_1 score_2 score_1_norm score_2_norm
#> 1      31      44   -0.7071068    0.7071068
#> 2      44      33    0.7071068   -0.7071068
#> 
#> $`2014`
#>   score_1 score_2 score_1_norm score_2_norm
#> 1      35      56   -0.7071068    0.7071068
#> 2      42      21    0.7071068   -0.7071068

^{Created on 2022-08-18 by the reprex package (v2.0.1)}

The result is a list, named with the corresponding years. The given values are normalized using the defined function.

Kind regards

nirgrahamuk · August 18, 2022, 1:53pm

I found I had to use out_of_sample param as with 2 entries per variable per year, there was insufficient data to do k-fold stuff. I thought I should use $x.t to get just the transformed data

library(tidyverse)
library(bestNormalize)
in_df<- tribble(~Year,~Score1	,~Score2,
2012,34	,45,
2012,41	,46,
2013,31	,44,
2013,44	,33,
2014,35	,56,
2014,42	,21)

in_df |> group_by(Year) |>
  summarise(across(starts_with("Score"),
                   ~bestNormalize(.x, quiet = TRUE,
                                  out_of_sample = FALSE)$x.t))

# A tibble: 6 x 3
# Groups:   Year [3]
   Year Score1 Score2
  <dbl>  <dbl>  <dbl>
1  2012 -0.707 -0.707
2  2012  0.707  0.707
3  2013 -0.707  0.707
4  2013  0.707 -0.707
5  2014 -0.707  0.707
6  2014  0.707 -0.707

london42 · August 18, 2022, 2:13pm

Hi thank you, this is "list with a data.frame for every year, containing 2 columns" exactly what I want but each score needs to be normalized within each year group.

FactOREO · August 18, 2022, 3:16pm

Then you can use the code above, or with bestNormalize (didn't know this is an actual library):

library(collapse)
#> collapse 1.8.6, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attache Paket: 'collapse'
#> Das folgende Objekt ist maskiert 'package:stats':
#> 
#>     D
library(bestNormalize)

data <- data.frame(
  year = c(2012,2012,2013,2013,2014,2014),
  score_1 = c(34,41,31,44,35,42),
  score_2 = c(45,46,44,33,56,21)
)

data |>
  fgroup_by(year) |>
  fsummarise(
    score_1 = bestNormalize(score_1, out_of_sample = FALSE, quiet = TRUE)$x.t,
    score_2 = bestNormalize(score_2, out_of_sample = FALSE, quiet = TRUE)$x.t) |>
  rsplit(~ year)
#> $`2012`
#>      score_1    score_2
#> 1 -0.7071068 -0.7071068
#> 2  0.7071068  0.7071068
#> 
#> $`2013`
#>      score_1    score_2
#> 1 -0.7071068  0.7071068
#> 2  0.7071068 -0.7071068
#> 
#> $`2014`
#>      score_1    score_2
#> 1 -0.7071068  0.7071068
#> 2  0.7071068 -0.7071068

^{Created on 2022-08-18 by the reprex package (v2.0.1)}

The result is a list (as required), with the standardized outputs for every year group (as above as well). As @nirgrahamuk mentioned, out_of_sample = FALSE has to be called as well.

Kind regards

system · August 25, 2022, 3:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.