Progress Bars in a dplyr Pipeline?

robynmeadows · May 23, 2025, 3:29pm

Hello!

I've been down a rabbit hole on this for hours and have made no progress (pun intended) so I thought I would see if anyone has ideas for this.

I work with fairly large data frames (>100k rows) and am often running dplyr pipelines for data cleaning that can take a while to run. When I'm running for-loops, I have progress bars set up and it has helped a ton. I am now trying to figure out if there is a way I can get a progress bar integrated in with my dplyr pipelines so that I can see how much longer it needs to run.

I will note -- I do NOT understand functions super well and I am honestly still relatively new to tidyverse in general, so more explanation is better!

As an example, I'll be using the flights data from the nycflights13 package.
The code I have written here is complete nonsense.
I just tried to add as much stuff as I could so it takes a second to run instead of being virtually instantaneous so that a progress bar would actually like... work.

flights %>%
  mutate(departure = make_datetime(year, month, day, hour, minute))  %>%
  arrange(desc(departure)) %>%
  mutate(Q1_flag = ifelse(departure %within% interval(start=ymd("2013-01-01"),end=ymd("2013-03-31")),1,0)) %>%
  mutate(Q2_flag = ifelse(departure %within% interval(start=ymd("2013-04-01"),end=ymd("2013-06-30")),1,0)) %>%
  mutate(Q3_flag = ifelse(departure %within% interval(start=ymd("2013-07-01"),end=ymd("2013-09-30")),1,0)) %>%
  mutate(Q4_flag = ifelse(departure %within% interval(start=ymd("2013-10-01"),end=ymd("2013-12-31")),1,0)) %>%
  mutate(month_name = case_when(
    month==1 ~ "January",
    month==2 ~ "February",
    month==3 ~ "March",
    month==4 ~ "April",
    month==5 ~ "May",
    month==6 ~ "June",
    month==7 ~ "July",
    month==8 ~ "August",
    month==9 ~ "September",
    month==10 ~ "October",
    month==11 ~ "November",
    month==12 ~ "December"
  )) %>%
  mutate(sun_sign = case_when(
    departure %within% interval(start=ymd("2013-01-20"),end=ymd("2013-02-18")) ~ "Aquarius",
    departure %within% interval(start=ymd("2013-02-19"),end=ymd("2013-03-20")) ~ "Pisces",
    departure %within% interval(start=ymd("2013-03-21"),end=ymd("2013-04-19")) ~ "Aries",
    departure %within% interval(start=ymd("2013-04-20"),end=ymd("2013-05-20")) ~ "Taurus",
    departure %within% interval(start=ymd("2013-05-21"),end=ymd("2013-06-20")) ~ "Gemini",
    departure %within% interval(start=ymd("2013-06-21"),end=ymd("2013-07-22")) ~ "Cancer",
    departure %within% interval(start=ymd("2013-07-23"),end=ymd("2013-08-22")) ~ "Leo",
    departure %within% interval(start=ymd("2013-09-23"),end=ymd("2013-10-22")) ~ "Virgo",
    departure %within% interval(start=ymd("2013-10-23"),end=ymd("2013-11-21")) ~ "Libra",
    departure %within% interval(start=ymd("2013-11-22"),end=ymd("2013-12-21")) ~ "Sagittarius",
    TRUE ~ "Capricorn",
  )) %>%
  filter(carrier %in% c("AA","DL","UA","WN")) %>%
  arrange(carrier) %>%
  filter(dest %in% c("BUR","LAX","SNA","LGB")) %>%
  filter(origin == "JFK") %>%
  filter(dep_delay<0) %>%
  filter(dep_time<1200 & dep_time>600) -> df

Now obviously this actual pipeline only takes like a second to run -- but humor me and pretend it's taking several minutes.

What I'm trying to figure out is how to incorporate a progress bar into this pipeline so that as it's running, I can see something like this:

|========= > ---------------| 48%

Is this possible? Does this exist?

Thank you in advance!

AlexisW · May 26, 2025, 3:33pm

That seems quite hard to do with a pipeline like this one: for a progress bar, you (usually) need to know the total number of operations, so that you can update the progress bar every time you've run 1% of the total (for example).

So this works quite well if you have a single loop: you know before starting how many iterations you'll have, and it's quite easy to know what iteration number you're currently running. That can be done in a for loop, for example with {progressr} or {progress}, or it can be done easily in tidyverse with map() functions.

With a pipeline, it's harder: each step is not aware of the other steps, so don't really know where in the total pipeline they are. There is also a big question of whether of of the steps is slow and all the others fast, or if all the steps are about the same duration.

If all the steps were the same duration, you could split it and update between each step:

x <- flights
cat("#")
x <- mutate(x, departure = make_datetime(year, month, day, hour, minute))
cat("#")
x <- arrange(x, desc(departure))
cat("#")
...

(you would use something like progress::pb$tick() instead of the cat() commands)

But this approach would be meaningless if some steps are dominating the total time. Take this pipeline:

wait <- function(x, t){
  Sys.sleep(t)
  return(x)
}

flights %>%
  wait(.01) %>%
  wait(.01) %>%
  wait(.01) %>%
  wait(10) %>%
  wait(.01) %>%
  wait(.01) %>%
  wait(.01) %>%
  wait(.01)

You have 7 operations that take 10 ms each, and 1 operation taking 10 seconds. For a progress bar to be meaningful, it would have to only apply to the wait(10000) step, not the others.

The code you shared appears relatively slow because the dataset is big, but it's not easy to put a progress bar because it is a pipeline including some slow and some fast steps, and you don't know a priori which steps are slower.

If however the slow operation is a loop (and that's usually the case for slow code), then you can much more easily put a progress bar on/in the loop itself.

robynmeadows · May 27, 2025, 12:42pm

I see, this totally makes sense! I can sort of figure out which steps are taking longer than others just as I'm building my code -- in my actual code, I am filtering one variable to be between two dates, and then another variable to be between two dates, and I know that step is taking up most of the time because the pipeline was running quickly before I added those two filter statements.

If I did know which steps were taking all the time in advance and I split them up as you described, would that code look something like this?

x <- flights
progress::pb$tick()
x <- mutate(x, departure = make_datetime(year, month, day, hour, minute))
progress::pb$tick()
x <- arrange(x, desc(departure))
progress::pb$tick()
...
x <- filter(x, carrier %in% c("AA", "DL", "UA", "WN"))
progress::pb$tick()
x <- arrange(x, carrier)
progress::pb$tick()
x <- filter(x, dest %in% c("BUR", "LAX", "SNA", "LGB"))
progress::pb$tick()
...

Or is there more setup work with other functions in the progress package? Again, please excuse my ignorance!

And thank you so much!

AlexisW · May 28, 2025, 5:02pm

To figure out which step is slow, you can use the profiler (in RStudio, go to the menu "Profile>Profile selected line(s)", without RStudio use profvis::profvis()), see here for details.

From what you describe, the slow step is inside a filter(). So, my point from the previous post is that any tick() outside the filter() will be uninformative: you will jump from ~0% to ~50% with the first filter, and from ~50% to ~100% with the second one. In principle, you would need to have tick() inside the filter(), which is not possible easily.

I think in your situation the first question is: why is it so slow? Can you rewrite it to make it a bit faster, so the progress bar becomes irrelevant?

For example, here filtering 10,000,000 dates takes 0.2 s:

library(tidyverse)
my_df <- tibble(date = as.Date(runif(1e7, min = 0, max = 20000)))

start_date <- as.Date("2005-02-13")
end_date <- as.Date("2015-06-20")

bench::mark(
  filtered_df <- my_df |>
    filter(date > start_date,
           date < end_date)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 filtered_df <- filter(my_df, date >… 262ms  265ms      3.78     214MB     5.67

^{Created on 2025-05-28 with reprex v2.1.1}

That doesn't seem too slow. How come your filtering is that much longer? If you use %in%, that's a lot slower, can you avoid that somehow? Maybe with a join, see for example:

library(tidyverse)
my_df <- tibble(date = as.Date(as.character(as.Date(runif(1e3, min = 0, max = 20000)))))


start_date <- as.Date("2005-02-13")
end_date <- as.Date("2015-06-20")


dates_range <- seq(from = start_date, to = end_date, by = 1)

dates_range_df <- data.frame(date = dates_range)

bench::mark(
  lt = {
    my_df |>
      filter(date > start_date,
             date < end_date)
  },
  `in` = {
    my_df |>
      filter(date %in% dates_range)
  },
  join = {
    my_df |>
      semi_join(dates_range_df,
                by = "date")
  }
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lt          882.4µs  940.4µs     1037.    1.25MB    17.7 
#> 2 in           5.06ms   5.54ms      179.  661.98KB     4.16
#> 3 join         1.19ms   1.26ms      769.  711.86KB    12.7

^{Created on 2025-05-28 with reprex v2.1.1}