Dplyr::arrange .by_group=TRUE fails with SQL backend

jdlong · April 15, 2018, 2:37am

I've either got a misunderstanding or a bug... I think it's a bug.

It seems that dplyr::arrange() with the .by_group=TRUE parameter set produces SQL with an error. Here's how to reprex it:

On the DB (Redshift in my case) set up a dummy table:

drop TABLE sandbox.testorder; 

CREATE TABLE sandbox.testorder (
    grp varchar(255),
    n  DOUBLE PRECISION
);

INSERT INTO sandbox.testorder (grp , n) VALUES ('a',3.3);
INSERT INTO sandbox.testorder (grp , n) VALUES ('a',1.1);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',2.2);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',4.4);

Then from R (presuming a connection to the DB called con and already loaded dbplyr

testorder  <- tbl(con, "testorder")

testorder %>%
  group_by( grp ) %>%
  arrange( n, .by_group=TRUE) ->
out_test

show_query(out_test)

which generates the following SQL:

SELECT *
FROM "testorder"
ORDER BY "n", TRUE

which fails if I try to collect(out_test) with the following error:

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  non-integer constant in ORDER BY

the rub seems to be the ,TRUE there at the end. If I remove it, I get the followable runnable SQL:

SELECT *
FROM "testorder"
ORDER BY "n"

My guess is that the routine that generates the SQL has a glitch. Looks like it's just passing , TRUE instead of adding in the group by variables.

I have not dug into the dbplyr or dplyr code to try and find the magic incantation. Not being familiar with the code base I didn't really feel up to digging into that beast tonight.

If this should go somewhere else, like the dbplyr github issues log, let me know. I wasn't 100% sure where the right place to post is.

mara · April 15, 2018, 1:16pm

Because dbplyr and dplyr are so intricately intertwined, issues for either and/or both go in the dplyr repo issues (dbplyr doesn't have an option to file issues, so that makes the decision pretty easy).

In general, if you're pretty sure it's a bug (and, as a general note to all, it's always totally fine to ask here first), then filing in the GitHub repos is the way to go, since we can track things there, and there's all sorts of helpful GitHub link-i-ness for closing issues when the code's been fixed etc.

Looks like you've got the reprex and description ready and raring to go, but I'm linking to the Filing issues guide from the tidyverse page for posterity.

edgararuiz · April 15, 2018, 1:22pm

Hi, I'm just trying to understand why this is not sufficient:

testorder %>%
  arrange(grp, n)

jdlong · April 15, 2018, 1:52pm

Edgar, in practice that is sufficient. However in reading the documents, I interpreted them as telling me that in order to sort by both the group bys and also the values, I needed the .by_group=TRUE.

From the docs:

.by_group If TRUE, will sort first by grouping variable. Applies to grouped data frames only.

source: Order rows using column values — arrange • dplyr

I have not tested the sort behavior as it relates to the group_by columns.

jdlong · April 15, 2018, 1:58pm

thanks. I popped this into the issue tracker. Apologies for the cross posting.

mara · April 15, 2018, 3:19pm

No problem, can you just link to the issue here as well? Thanks.

jdlong · April 15, 2018, 3:37pm

raised as an issue on Github:

github.com/tidyverse/dplyr

`dplyr::arrange()` with the `.by_group=TRUE` parameter set produces SQL with an error

opened 01:56PM - 15 Apr 18 UTC

closed 11:47AM - 25 Jun 18 UTC

CerebralMastication

bug

This is a cross post from the RStats Community, where I initially posted it: htt…ps://community.rstudio.com/t/dplyr-arrange-by-group-true-fails-with-sql-backend/7232 I've either got a misunderstanding or a bug... I *think* it's a bug. It seems that `dplyr::arrange()` with the `.by_group=TRUE` parameter set produces SQL with an error. Here's how to reprex it: On the DB (Redshift in my case) set up a dummy table: ``` drop TABLE sandbox.testorder; CREATE TABLE sandbox.testorder ( grp varchar(255), n DOUBLE PRECISION ); INSERT INTO sandbox.testorder (grp , n) VALUES ('a',3.3); INSERT INTO sandbox.testorder (grp , n) VALUES ('a',1.1); INSERT INTO sandbox.testorder (grp , n) VALUES ('b',2.2); INSERT INTO sandbox.testorder (grp , n) VALUES ('b',4.4); ``` Then from `R` (presuming a connection to the DB called `con` and already loaded `dbplyr` ``` testorder <- tbl(con, "testorder") testorder %>% group_by( grp ) %>% arrange( n, .by_group=TRUE) -> out_test show_query(out_test) ``` which generates the following SQL: ``` SELECT * FROM "testorder" ORDER BY "n", TRUE ``` which fails if I try to `collect(out_test)` with the following error: ``` Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: non-integer constant in ORDER BY ``` the rub seems to be the `,TRUE` there at the end. If I remove it, I get the followable runnable SQL: ``` SELECT * FROM "testorder" ORDER BY "n" ``` My guess is that the routine that generates the SQL has a glitch. Looks like it's just passing `, TRUE` instead of adding in the `group by` variables.

edgararuiz · April 15, 2018, 3:46pm

Cool, that's good to know. I was wondering about how we could interpret .by_group for a possible translation. Id really be nice to find a reprex that could not be created w/o that argument. Thanks JD!