Cut an interval in several when it intersects with point values

AlexisW · February 17, 2022, 4:30am

I have a data frame that contains ranges (defined by their start and end), and a second data frame that contains cut points. There can be 0, 1, or several cut points within each range. Whenever a cut point falls within a range, I want to separate that range in two rows, using the cut point as end and start of the new rows.

Illustration with fake data:

suppressPackageStartupMessages(library(tidyverse))

source_df <- tibble(id = LETTERS[1:4],
                    start = as.integer(10*(1:4)),
                    end = as.integer(10*(1:4)+5))

cutting_df <- tibble(cutpoint = c(12L, 23L, 32L,33L))



expected_result <- tibble(id    = c("A1","A2","B1","B2","C1","C2","C3","D"),
                          start = c( 10L, 12L, 20L, 23L, 30L, 32L, 33L, 40L),
                          end   = c( 12L, 15L, 23L, 25L, 32L, 33L, 35L, 45L))

source_df
#> # A tibble: 4 x 3
#>   id    start   end
#>   <chr> <int> <int>
#> 1 A        10    15
#> 2 B        20    25
#> 3 C        30    35
#> 4 D        40    45
cutting_df
#> # A tibble: 4 x 1
#>   cutpoint
#>      <int>
#> 1       12
#> 2       23
#> 3       32
#> 4       33
expected_result
#> # A tibble: 8 x 3
#>   id    start   end
#>   <chr> <int> <int>
#> 1 A1       10    12
#> 2 A2       12    15
#> 3 B1       20    23
#> 4 B2       23    25
#> 5 C1       30    32
#> 6 C2       32    33
#> 7 C3       33    35
#> 8 D        40    45

plot(c(1,50), c(0,3), col="white")
for(i in seq_len(nrow(source_df))){
  lines(c(source_df[i, "start"], source_df[i, "end"]), c(2,2))
  text((source_df[i, "end"]-source_df[i, "start"])/2 + source_df[i, "start"], 2.2, source_df[i, "id"])
}
for(i in seq_len(nrow(cutting_df))){
  points(cutting_df$cutpoint[i], 1.5, col = "darkred")
}
for(i in seq_len(nrow(expected_result))){
  jitter <- runif(1,max = .2)
  lines(x = c(expected_result[i, "start"], expected_result[i, "end"]),
        y = c(.75,.75) + jitter,
        col = rainbow(9)[i])
  text(x = (expected_result[i, "end"]-expected_result[i, "start"])/2 + expected_result[i, "start"],
       y = .5+jitter,
       labels = expected_result[i, "id"],
       col = rainbow(9)[i])
}

The numbers are guaranteed to be all integers, my data frames have at least 10s of thousands, potentially millions of rows, so performance is a concern.

I feel like there should be a dplyr/tidyverse solution, but I can't figure it out.

Side note: actually, the real data is genomic coordinates, so I'm also exploring IRanges/GRanges solutions. But in that case too, I can't come up with an approach that doesn't require looping over the data frames several times. And since I'm doing the other analyses with the tidyverse, a dplyr solution would be more elegant.

nirgrahamuk · February 17, 2022, 10:27am

suppressPackageStartupMessages(library(tidyverse))

source_df <- tibble(
  id = LETTERS[1:4],
  start = as.integer(10 * (1:4)),
  end = as.integer(10 * (1:4) + 5)
)

cutting_df <- tibble(cutpoint = c(12L, 23L, 32L, 33L))


expected_result <- tibble(
  id = c("A1", "A2", "B1", "B2", "C1", "C2", "C3", "D"),
  start = c(10L, 12L, 20L, 23L, 30L, 32L, 33L, 40L),
  end = c(12L, 15L, 23L, 25L, 32L, 33L, 35L, 45L)
)

(df0 <- expand_grid(source_df, cutting_df) %>% rowwise() %>% filter(between(cutpoint, start, end)))
(df1 <- group_by(
  df0,
  id, start, end
) %>% summarise(points = list(cutpoint)))
# (df1 <- group_by(df0,id) %>%


mycutf <- function(start, end, points) {
  labs <- levels(cut(seq(from = start, to = end, by = 1),
                     breaks = c(start, points, end),
                     include.lowest = TRUE))
  data.frame(str_split_fixed(labs, pattern = ",", 2)) %>%
    map_dfc(parse_number) %>%
    mutate_all(as.integer) %>%
    set_names("lower", "upper")
}
# test
mycutf(30, 35, c(32, 33))


(df2 <- transmute(df1,
  id = id,
  s1 = list(mycutf(start, end, unlist(points)))
))

(df3 <- unnest(df2, cols = s1) %>% group_by(id) %>% transmute(
  id = paste0(id, row_number()),
  start = lower,
  end = upper
))
(missingdf <- source_df %>% filter(!id %in% unique(pull(df2, id))))
(df4 <- union_all(df3, missingdf) %>% ungroup() %>% arrange(1, 2, 3))

identical(expected_result, df4)

AlexisW · February 17, 2022, 4:48pm

Wow, that is a clever use of extracting the levels of cut()!

It works perfectly, thank you!

nirgrahamuk · February 17, 2022, 4:56pm

Glad it helps. I found it a challenging request, and therefore interesting. I made at least two naive attempts at solving that didnt work, before I settled on the cut approach.

system · February 24, 2022, 4:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.