split() output categories not allowing for usage in for loop

LucasGomes · September 10, 2020, 11:06pm

I'm having a problem with allocating data from a split vector inside a for-loop.
My question is: Is there a way of using the for-loop and get the variable i to work with p$``?

P is rainfall data. Containing numbers imported from a .txt file
I wanted to divide the hourly rainfall data into years 365*24=8760. So I use the split function.
There are 34 years.
Then I want to get the max rainfall from each year. Basically, the max number from each category of p.

p <- split(P, ceiling(seq_along(P)/8760))

Max <- numeric(34)
i <- 1
for(i in 1:34) {
  Max[i] <- max(p$`i`)
  i <- i+1

}

The problem is the use of "i" when using max()
It seems like p$i returns "None" and Max returns: a vector of -Inf
displaying this message: " In max(p$i) : no non-missing arguments to max; returning -Inf"

I can do this another way without using a for-loop that works. But it's not the optimal solution.

Max[1] = max(p$`1`)
Max[2] = max(p$`2`)
Max[3] = max(p$`3`)
(...)
Max[33] = max(p$`33`)
Max[34] = max(p$`34`)

Is there a way of using the for-loop and get the variable i to work with p$``?
Thank you!!!

P.s.I'm a noob in R. I'm an immigrant from MATLAB

AlexisW · September 11, 2020, 6:25am

To access elements of a list, you can use either $ or [[. $ is practical in daily life, but [[ is the one that can take either numbers or names, and that you can use with variables. So in your case, it's just:

Max[i] <- max(p[[i]])

The $ operator only works for names, it's a handy shortcut.

A bit of style: what you wrote is perfectly correct except for that subsetting, but it's not the preferred style with R. You can obtain the result directly with a map family function:

library(tidyverse)
P <- runif(8760*34)
p <- split(P, ceiling(seq_along(P)/8760))

map_dbl(p, max)

LucasGomes · September 11, 2020, 1:52pm

Thank you for your solution,

Max[i] <- max(p[[i]])

This ^ did work

However I cad a problem running tidyverse.
It seems to be a similar problem to when I try to import data from Excel and try to use Rcpp.
It gives me this error:

Error: package or namespace load failed for ‘tidyverse’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/libs/Rcpp.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/libs/Rcpp.so, 6): Symbol not found: _EXTPTR_PTR
  Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/libs/Rcpp.so
  Expected in: /Library/Frameworks/R.framework/Resources/lib/libR.dylib
 in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/libs/Rcpp.so
In addition: Warning message:
package ‘tidyverse’ was built under R version 4.0.2

Then map_dbl(p, max) gives me this error:

Error in map_dbl(p, max) : could not find function "map_dbl"

I'm trying to solve the import datasets and Rcpp in another thread. But, I haven't had any luck yet.

Do you know how to solve the tidyverse problem?

Thank you!

AlexisW · September 11, 2020, 2:15pm

I think this error is to to a bug that was corrected when you use the latest R version (see e.g. here). You should probably install R 4.0.2.

map_dbl() is only defined in the tidyverse library, so if the library loading failed, the function doesn't exist.

LucasGomes · September 11, 2020, 2:52pm

I downloaded the latest version: "Version 1.3.1073"
But it's still "R version 4.0.0 (2020-04-24) -- "Arbor Day""
Maybe it's because I'm on mac.
Now when I check for updates it just says that I'm using the newest version of R studio.

Maybe I'll just wait for the update and hope for the best.

Back to the earlier topic.
I've identified the index of each of the maximums within each category of p:

Pos[i] <- which(p[[i]] %in% Max[i])

This returns the index e.g. 1007 for p[[1]] and 1251 for p[[2]].
I now want to access position 1250 of p[[2]].
I tried p[[2[1250]]]
It doesn't work.
Do you know how I can access this each position of p[[2]] and eventually p[[i]]?
Thank you again!

AlexisW · September 11, 2020, 3:01pm

R and Rstudio are 2 separate software. Rstudio is an IDE that calls R the language interpreter. You have the latest version of RStudio (1.3.1073) but not the latest version of R (4.0.0, the version 4.0.2 solved a few bugs introduced in that major update). To update R, you have to download the installer and install the latest version.

Almost there: you could do it in two steps

x <- p[[2]]
y <- x[1250]

so if you just replace, what you need is p[[2]][1250]. What you wrote is equivalent to:

x <- 2[1250]
y <- p[[x]]

and of course you can't subset the number 2.

I'm wondering, what are you trying to do? This whole approach "feels" inefficient, it is rare that you really need to subset lists and vectors that way, there might be a simpler solution.

LucasGomes · September 11, 2020, 10:53pm

That was the solution. The update. And I had to download a package called "stringi"
The map_dbl(p, max) worked.

This worked : "p[[2]][1250]"

What I'm trying to do in theory is look at the relationship between the rainfall max rainfall in each year and the rainfall of the previous hours.
So I have to find the hour of max rainfall and find the values of the previous hours and then do some statistical modelling.

What do you think, is there a faster way? I've heard of some function that finds the "previous values" within a vector.

Btw, you've also solved my problem for Rcpp and importing datasets from excel. So, this has been spectacular help. Thank you.

Why is the Rstudio community so quick to respond and helpful. It's really impressive.

AlexisW · September 12, 2020, 8:36pm

The function that finds previous values exists, it's called lag() (there is one in base R and another one in dplyr), but it's not great here: it only returns a single value.

We can create our own:

lag_n <- function(x, n = 1){
  nb_elements <- length(x)
  x[(nb_elements - n):(nb_elements - 1)]
}

lag_n(1:10, 1)
# [1] 9
lag_n(1:10, 4)
# [1] 6 7 8 9

Or the slightly more advanced version, that lets us select from an end point:

lag_n <- function(x, end = length(x)-1, n = 1){
  nb_elements <- length(x)
  stopifnot(end <= nb_elements)
  
  x[(end+1 - n):end]
}

lag_n(1:10)
# [1] 9
lag_n(1:10, n = 4)
# [1] 6 7 8 9
lag_n(1:10, end=7, n=3)
# [1] 5 6 7

Now, for your project, I'm worried about one aspect: you seem to assume all years have the same number of days and hours, but on 34 years about 8-9 should be leap years with 366 days. So your split() should give incorrect results (with an error that accumulates with each leap year). The package lubridate provides tools to work with dates and date-times.

Let's suppose you only know that the measures were done every day between 1986-01-01 and 2019-01-31. We can create a sequence of dates with appropriate leap days:

library(lubridate)

seq_dates <- seq.Date(from = as_date("1986-01-01"),
         to=as_date("2019-12-31"),
         by="day")

View(seq_dates)

If we assume there is no daylight saving (so we assume every day has 24 hours), we can create a simple vector of hours with 1:24. If you need to take into account changes in time-zone, you could also generate date-times directly (that can become tricky though, see the lubridate help and that book chapter for details).

In R, the most powerful data structure is probably the data.frame (or its equivalent the tibble), we can store our data in that format to get useful tools. We use expand_grid() to get automatically all the combinations of date and hour.

library(tidyverse)

rain_data <- expand_grid(date = seq_dates,
                         hour = 1:24)

Note that we have here 298,032 measurements, vs 34*365*24 = 297,840 if we forget leap days.

So now we have the basic data.frame structure, we can insert the actual measurements in there (if the dates come with your data, its better to use it directly), and extract the year from the date.

# dummy measurements, use the ones you have
P <- rpois(n = 298032, lambda = 5)


rain_data <- rain_data %>%
  mutate(rain = P,
         year = year(date))

Finally, we have all the data in one place, we can go to processing. Rather than explicitly splitting the values, we can use group_by() to perform each processing per year. For example, if we want the max, the day of max, and the mean for each year:

rain_data %>%
  group_by(year) %>%
  summarize(max_rain = max(rain),
            mean_rain = mean(rain),
            day_of_max = date[which.max(rain)],
            nb_hours_in_year = n()) %>%
  View()

And of course you can plot the whole thing:

rain_data %>%
  group_by(date) %>%
  summarize(mean_per_day = mean(rain)) %>%
  ggplot() +
  geom_line(aes(x=date, y = mean_per_day))

So now we get to the crux of the problem, for each year, we want to find the max, and find the n values immediately before. Let's first try to find the n days before, to check that it works as expected:

rain_data %>%
  group_by(year) %>%
  summarize(day_of_max = date[which.max(rain)],
            hour_of_max = hour[which.max(rain)],
            previous_hours = lag_n(hour,
                                  end = which.max(rain),
                                  n = 5),
            previous_days = lag_n(date,
                                  end = which.max(rain),
                                  n = 5)) %>%
  View()

It does look like the result we expect. So we can make the actual computation:


previous <- rain_data %>%
  group_by(year) %>%
  summarize(previous = lag_n(rain,
                             end = which.max(rain),
                             n = 5))

Alright, we did it. We get a data.frame in "long" format, with the previous observations stacked on one column. We could put it in the wide format of we prefer:

previous %>%
  add_column(hour_before_max = paste0("t-",rep(5:1, 34),"h")) %>%
  pivot_wider(names_from = "hour_before_max",
              values_from = "previous")

That can be handy to convert to a matrix for example:

mat_previous <- previous %>%
  add_column(hour_before_max = paste0("t-",rep(5:1, 34),"h")) %>%
  pivot_wider(names_from = "hour_before_max",
              values_from = "previous") %>%
  ungroup() %>%
  select(-year) %>%
  as.matrix()

dim(mat_previous)
# [1] 34  5
mat_previous[1:3, 1:3]
#      t-5h t-4h t-3h
# [1,]    9    8    4
# [2,]    3    4    5
# [3,]    7    3    5

But if you want to model, there is one more advanced trick that is very useful:

previous <- rain_data %>%
  group_by(year) %>%
  summarize(previous = list(lag_n(rain,
                             end = which.max(rain),
                             n = 5)),
            hours_before_max = list(5:1))

Here we store the result as a list, so we can directly pick up vectors from that list to give it to the modeling function:

previous %>%
  mutate(mod = map2(hours_before_max, previous, ~lm( .y ~ .x)),
         coef = map_dbl(mod, ~ .x$coefficients[2]))

LucasGomes · September 14, 2020, 5:48pm

Hello again!
Waw, this is a great help!
This lag_n worked. And the more advanced version is exactly what I needed.

I don't have a problem with leap years. We don't account for that in the data.
So I did this:

seq_dates <- seq.Date(from = as_date("1986-01-01"),
         to=as_date("2019-12-31"),
         by="day")
seq_dates <- seq_dates[format(seq_dates,"%m-%d") != "02-29"]

It's giving me length 297,840. Which is the length of the actual data we have on excel. However, the data begins at 10 am 30/9 and finishes at 9am 30/9. Also, the hours from 10am to midnight count to the next day.
so basically days begin at 10 am and finish at 9 am. And that day is the day of the 00h00 (...) 09h00 measurements. (I know it's weird...)

So now I'm not sure how to set the dates in this new function. Because the first element is 10am 30/09/1980. But that actually counts towards the day of 1/10/1980.
Is this a problem? Do the actual dates set matter. Or will I always use numbers as the index?
I'm not sure it will mess this up:

rain_data <- rain_data %>%
  mutate(rain = P,
         year = year(date))

Btw, what does %>% do?

I think this new method is a lot better. Because I'm going to do more regions with different dates. And I have to do hourly and daily analysis. I'm doing hourly for now.

Thank you!

AlexisW · September 14, 2020, 11:45pm

You could always use numbers as index, but having actual dates is a big advantage, that way you can group_by() year or day, or filter() with whatever criteria, and don't care about the actual structure. Especially true if you add the region as a column, then you can do average across regions by hour or anything you need just by giving the variable names.

%>% is a pipe, it takes the result from the left-hand operation and passes it as first argument of right-hand function. These are equivalent:

a <- function1(xxx)
b <- function2(a)

# same as 
b <- xxx %>%
  function1() %>%
  function2()

See here for more details.

For the times starting at 10am, that can be simply done with:

seq_hours <- c(10:24, 1:9)
expand_grid(dates = seq_dates,
            hours = seq_hours)

with the appropriate seq_dates. The function expand_grid() gives you all the combinations of the arguments, so you just need to give arguments with the correct values/order.

LucasGomes · September 15, 2020, 9:32pm

Thank You.
I've gotten it down and integrated it with my code.
You've been a great help!

system · September 22, 2020, 9:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.