Hello! I'm new to R, sorry if I can't explain well!
I am trying the following:
In table_base, I have the start and end columns. I am trying to check which table_target rows (toFind column) are between these start and end values. At the end I create a tibble where each row has the result, with the table_base and table_base rows matching.
My solution was from below, using a loop. However the loop takes too long. Is there another way?
#This is just to show a progress status
progress <- seq(1, nrow(table_target),1000) #Loop start!
for (i in 1:nrow(table_target)) {
#The the respective start and end values
start_number <- as.numeric(table_target$start[i])
end_number <- as.numeric(table_target$end[i]) filter table_base$toFind with start and end values
searching <- table_base %>% filter(toFind>= start_number & toFind <= end_number)
test the results and, if it work, send to results tibble
if(nrow(searching) != 0) {
results <- results %>% bind_rows(searching)
}
#Just show the progrees
if(i %in% progress) {
cat("\014")
cat("\n", round(i/nrow(table_target)*100),"%")
Key is map_dfr which loops over a list of targets and applies a function to each element in the target list.
The function tests whether the target list element is within start and end. map_dfr returns a data frame with all results combined by rows.
library(tidyverse)
DF1 = data.frame(start = runif(1000, min = 1000, max = 9999)) %>%
mutate(end = start + runif(1000, min = 100, max = 999))
start end
1 9364.815 10184.800
2 6560.811 7521.991
3 8422.298 9179.108
4 9088.313 9274.061
5 4865.341 5725.937
6 7239.143 7922.612
DF2 <- data.frame(toFind = runif(100, min = 1000, max = 9999))
toFind
1 1370.765
2 6881.400
3 7265.235
4 5341.615
5 5405.194
6 6282.752
-- Mapping function --
Note that I flipped the test: toFind >= start_number is also start < find
find_targets <- function(find) {
result <- DF1 %>% filter(start < find & end > find)
result$toFind <- find
return(result)
}
When I added your numbers, 732,854 for target and 47,773 for base, my Mac ran out of space.
I changed the data.frame to data.table and it processed a 1/10 of that (73,285 and 4,777) generating 30 million rows. So it may or may not work with your numbers. Maybe your actual numbers yield fewer combinations of base between start and end. You could split your data and run it in parallel on a larger machine.
Anyway, here's the data.table code:
library(tidyverse)
library(data.table)
DF1 <- data.table(start = runif(73285L, min = 1000, max = 9999))
DF1[, end := start + runif(1L, min = 100, max = 999)]
DF2 <- data.table(toFind = runif(4777L, min = 1000, max = 9999))
find_targets <- function(find) {
result <- DF1[start < find & end > find]
result$toFind <- find
return(result)
}
DF3 <- map_dfr(DF2$toFind, find_targets)
It could be that there's no result, so adding a column to an empty result could give an error. You could add the test for empty result before adding the $toFind column.
So, from the beginning I was working with data.tibble, so I had no problem with data.frame speed or memory consuption. I was able to solve:
Error in $<-.data.frame ( *tmp* , toFind, value = 91194674) :
replacement has 1 row, data has 0
by adding an "if" to preventing it from adding something to the results when no match was found. I ran my data without memory issues and was at least 3 minutes faster than my original loop:
find_targets <- function(find) {
result <- table_target %>% filter(start <= find & end >= find)
if(nrow(result) > 0) {
result$find <- find
}
return(result)
}
a<- Sys.time()
test <- map_dfr(table_base$toFind, find_targets)
print(Sys.time()-a)
Time difference of 7.743048 mins
nrow(test) #matches
[1] 7524
I will now try to put this into parallel processing to try to accelerate more, although I have no idea yet how to do it!