Error: `by` can't contain join column `shopper_id` which is missing from LHS - But it is!

dougfir · December 18, 2018, 1:57am

I have two tbls based on a odbc connection:

> glimpse(order_discrepancy)
Observations: ??
Variables: 6
$ cqfescioqy.shopper_id  <chr> "1234", "5678", "9101112", "12141516", "65758946", "8756745346", "76543456", "8765", "10536771", "98765", "264957...
$ cqfescioqy.tos_signed_date <date> NA, 2015-03-19, NA, 2016-01-21, NA, NA, 2015-10-07, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2016-10-03, NA, NA, NA, NA, 2017-02-2...
$ cqfescioqy.order_date  <date> 2018-12-07, NA, 2018-12-07, NA, 2018-12-07, 2018-12-07, NA, 2018-09-17, 2018-12-09, 2018-12-09, 2018-12-08, 2018-12-10, 2018-12-10, 2...
$ cqfescioqy.date_diff_days  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ cqfescioqy.inorder         <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, ...
$ cqfescioqy.intos           <lgl> NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, TRUE...

and

> glimpse(merges)
Observations: ??
Variables: 13
$ okmalpjbbd.merge_progressid    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,...
$ okmalpjbbd.gainingid           <chr> "5843743", "5843743", "3779857", "3779857", "4899246", "4899246", "4377979", "4377979", "2972481", "2972481", "3720865", "3720865"...
$ okmalpjbbd.losingid            <chr> "5626534", "5626534", "5625603", "5625603", "5755277", "5755277", "1683489", "1683489", "2976089", "2976089", "5380632", "5380632"...

These are tbls pre collect(). I do not know why the variable names are prepended with cqfescioqy and okmalpjbbd. If I apply regular dplyr functions such as filter, I can just type the regular feature name and ignore these prepended field strings e.g. merges %>% filter(losingid = 5626534) # works

However, I'm trying to join the two tbls like so:

order_discrepancy <- order_discrepancy %>% left_join(merges, by = c("shopper_id" = "losingid"))

This gives an error:
"Error: by can't contain join column shopper_id which is missing from LHS. "

However, using glimpse() I can see that it does exist on the left hand side.

I tried adding explain() after my left join but received the same error.

I know that the behavior of dplyr will differ with TBLs based on a live connection to our database, however this looks like a fairly standard task so presumably I'm not doing it right. I cannot tell if the fact that glimpse shows the feature names prepended with these odd strings is relevant to this problem or not.

How can I left join these twi TBLs on shopper_id = losingid?

technocrat · December 18, 2018, 3:00am

A FAQ: What's a reproducible example (`reprex`) and how do I do one? would allow me to offer more than a guess, because I can't see all the variable names.

I don't like piping into joins for just this reason, it's not exactly clear whether the . from the pipe is the implicit first or second argument to inner_join

When I'm debugging I also don't like overwriting the source, at least until I'm sure it's working, so

test_me <- inner_join(order_discrepancy, merge, by = [something])

So, now we're certain that order_discrepancy is the LHE and merges is the RHE.

The next question is whether it's possible to fool by into using losingid as the key field by sending it an alias with the name shopper_id, and without the reprex it's too much effort to gin up a data set to try.

You can always

test_me_2 <- order_discrepancy %>% mutate(shopper_id = losingid)

and then

result <- inner_joint(test_me, test_me2, by = "shopper_id")

if it turns out that your alias construct doesn't work.

dougfir · December 18, 2018, 3:25am

Hi, thanks for the suggestions. After trying to first avoid a pipe and create a new data frame new <- left_join(t1, t2, by = c("shopper_id" = "losingid")) I received the same error.

Then I found something interesting. When I try to mutate a new name like you suggested:
merges <- tbl(con, sql("select * from somecompany.merges")) %>% mutate(shopper_id = losingid)

When I try to glimpse or call the above I get
"> glimpse(merges)
Observations: ??
Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'zzz11.tx_source_database': (possible column names are: tx_source_database, tx_source_table, tx_action, tx_write_time, tx_source_id, ".

Not sure what this is, or what zzz11 is but I wonder if this is the issue?

technocrat · December 18, 2018, 3:48am

First, :blush; silly me. I had my laminated cheat sheet next to me and your alias construct is completely canonical. Thanks, knowing that will make things interesting.

Error 10004 in SQL-land is a fail to connect. Along with the words query and table gives me the hunch that the messages are coming from the odbc.

Can you extract the two objects cleanly so that the persist even after the obdc connection is closed? The rerunning either version will be able to disprove my hunch. I'm assuming, of course, that you have the option of re-opening the obdc and writing back.

dougfir · December 18, 2018, 3:57am

Hi, I do not understand what your asking me to do, could you clarify?
When you say "Can you extract the two objects cleanly" do you mean to run collect() after calling them and pulling them into completely? Yes, I am able to do that and when I do I am able to left join with them.

It's only when they are in their odbc state, if that's the phrase, that I'm unable to make this minor transformation, e.g. mutate().

Though I can overcome my here and now issue by pulling the two tables completely into r, I was just lucky because most of our tables are very large and this won't be so easily done, so it would be good to know how to overcome this.

I'm not 100% following what you are asking, so I tried closing my session and restarting r. I then reconnected in the usual way and had the same problem. In case it's useful information, here is how I make my connections:

library(odbc)
library(tidyverse)
library(dbplyr)
library(lubridate)
library(zoo)

# hive connection
## initialize kerberos ticket in new terminal
## kinit myname@AB1.CORP.BLAH
con <- dbConnect(odbc::odbc(), "HiveProd")


# tables
order_discrepancy <- tbl(con, sql("select * 
                                  from myname.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)"))

and then

merges <- tbl(con, sql("select * from company_name.merge_progress_snap"))

Does this help?

technocrat · December 18, 2018, 5:07am

Yes, thanks. So, I didn't know about dbplyr ("A 'dplyr' back end for databases that allows you to work with remote database tables as if they are in-memory data frames.") At Hadley's rate of production, we may soon have a fork of R called T (for tidy).

So, I guess I was kinda right about it being on the server end, because dbplyr is lazy.

You correctly interpreted my experiment with collect(), so the problem is definitely one with the difference on how joins work in dplyr vs. dbplyr. Something that I don't understand (since this is the first time I discovered there was such a thing) is how you would pre-process one of the databases with an ALTER statement, say, to harmonize the key column names. Or, I guess, you could duplicate one of the two fields so that they would be in both tables. I'm going to flag this one for attention by someone who knows more than I about this strange new beast.

dougfir · December 18, 2018, 5:46am

Hi Technocrat. Alright, thanks for your help thus far.

martin.R · December 18, 2018, 10:05am

If you want to merge then you need to include the prepended dummy names, i.e.
order_discrepancy <- order_discrepancy %>% left_join(merges, by = c("cqfescioqy.shopper_id" = "okmalpjbbd.losingid"))

Why there are prepended dummy names is a separate question, which I think you previously posed, but have not received an answer for yet.

mishabalyasin · December 18, 2018, 12:22pm

You most certainly don't need to include those random names in since they will actually be different every time you run your code. dbplyr creates intermediate tables all the time and assigns them random names:

library(magrittr)
df <- dbplyr::memdb_frame(x = runif(100), y = runif(100), z = rep(c("a", "b"), times = 50))

df %>% dplyr::arrange(x) %>% dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM `qosqntiahl`
#> ORDER BY `x`

^{Created on 2018-12-18 by the reprex package (v0.2.1)}
This qosqntiahl is going to be different every time, so you can't rely on it.

But back to your question, @dougfir, when I try replicating your error I don't get anything suspicious, it works exactly as expected:

library(magrittr)
df <- dbplyr::memdb_frame(x = runif(100), y = runif(100), z = rep(c("a", "b"), times = 50))

df %>% dplyr::arrange(x) %>% dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM `ljybibfyqy`
#> ORDER BY `x`

df2 <- dbplyr::memdb_frame(x2 = runif(100), y2 = runif(100), z2 = rep(c("a", "b"), times = 50))
df %>%
  dplyr::left_join(df2, by = c("z" = "z2"))
#> # Source:   lazy query [?? x 5]
#> # Database: sqlite 3.22.0 [:memory:]
#>        x     y z         x2     y2
#>    <dbl> <dbl> <chr>  <dbl>  <dbl>
#>  1 0.119 0.636 a     0.0715 0.871 
#>  2 0.119 0.636 a     0.0801 0.580 
#>  3 0.119 0.636 a     0.0950 0.577 
#>  4 0.119 0.636 a     0.120  0.754 
#>  5 0.119 0.636 a     0.136  0.0504
#>  6 0.119 0.636 a     0.166  0.765 
#>  7 0.119 0.636 a     0.175  0.338 
#>  8 0.119 0.636 a     0.185  0.0368
#>  9 0.119 0.636 a     0.188  0.106 
#> 10 0.119 0.636 a     0.197  0.0802
#> # ... with more rows

^{Created on 2018-12-18 by the reprex package (v0.2.1)}

So, one more thing you can try is that you can issue dplyr::compute on each of the tables you are joining to see if that helps. So, it'll be something like this:

order_discrepancy <- order_discrepancy %>% dplyr::compute()

Same goes for the second table. You can then try joining them again to see if that helps.

Keep in mind that dplyr::compute will actually compute the table in DB, so it might take a while depending on your data.

tbradley · December 18, 2018, 1:51pm

Unless you specifically place a . as the second argument, %>% will always put the output of the left hand side as the first argument of the right hand side. The behavior does not change for the *_join functions.

dougfir · December 18, 2018, 7:22pm

Hi and thanks for the info. When I recreate you df I do not experience any issues so presumably this is to do with my odbc and/or use of dbplyr.

I tried calling compute like you suggested and received the following error message:

"Error: 'CREATE TEMPORARY TABLE lixhbfdwtr AS SELECT zzz8.pro_shopper_id, zzz8.tos_signed_date, zzz8.uds_order_date, zzz8.date_diff_days, zzz8.inorder, zzz8.intos
FROM (select *
from myname.pro_programe_compare_uds_order
where (inorder is null) or (intos is null)) rvinwyrzca'
nanodbc/nanodbc.cpp:1587: 42000: [Hortonworks][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException 0:0 Error creating temporary folder on: hdfs://production/apps/hive/warehouse. Error encountered near token 'TOK_TMP_FILE' "

Does this tell us anything? I wonder if my odbc.ini file has a configuration issue.

Note I am able to pull the table in using collect().

system · January 8, 2019, 7:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.