How to set a seed for dplyr SQL?

rbcavanaugh · May 30, 2023, 6:02pm

I am using dbplyr/dplyr to query a redshift database.

I have a query that looks like this:

tbl(con, "table") %>%
     mutate(random_number = sql("RANDOM()"))

or this

tbl(con, "table") %>%
     mutate(random_number = random())

How can I set a seed so that the results are returned predictably? I see in the redshift documentation I would write:

set seed to .25;
select cast (random() * 100 as int);
int4

Is there a straightforward way to incorporate this into my dplyr chain?

thanks!

technocrat · May 31, 2023, 6:50am

I must not understand well. This snippet shows one way to make it predictable, but there's no advantage over just using a constant.

rando <- function() 
  {set.seed(42)
  return(sample(100:500,1))
}
rando()
#> [1] 148
rando()
#> [1] 148
(head(mtcars[1:5,1:5]) |> dplyr::mutate(random_number = rando()))
#>                    mpg cyl disp  hp drat random_number
#> Mazda RX4         21.0   6  160 110 3.90           148
#> Mazda RX4 Wag     21.0   6  160 110 3.90           148
#> Datsun 710        22.8   4  108  93 3.85           148
#> Hornet 4 Drive    21.4   6  258 110 3.08           148
#> Hornet Sportabout 18.7   8  360 175 3.15           148

^{Created on 2023-05-30 with reprex v2.0.2}

rbcavanaugh · May 31, 2023, 3:18pm

This only works if you are working with data locally - not in a db backend.

technocrat · May 31, 2023, 11:02pm

redshift must work differently from other backend setups like MySQL, then, where a con streams to the local process.

system · June 21, 2023, 11:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.