Bigrquery - large datasets

JacobB · November 8, 2017, 6:18pm

Hi, we're in the process of switching our data warehouse to Bigquery and overall it's going great and I'm enjoying using the bigrquery package.

The only issue I have is that if I write a query that has a large result set (i.e 40m records) it takes about an hour to download the data which is dramatically slower than if I save the output as a csv in google cloud storage and curl the result manually.

I've tried with and without writing to an intermediate table and also setting a much larger page size but nothing seems to make a difference. Would be really interested to know if anybody else has had experience of this or has worked out a way round it.

Thanks
Jacob

iain · November 8, 2017, 8:28pm

Interesting, I don't have a solution but I found something similar when importing data. It was much faster to upload multiple json files to google cloud storage and then import them into a table.

JacobB · November 8, 2017, 9:03pm

Interesting - which function were you using to write? Think there are a couple of options but dbWritetable is very fast - 10m records in about 2mins during my test!

Hoyt · November 8, 2017, 9:45pm

How many columns are in that 40m set?

iain · November 8, 2017, 9:49pm

Can't quite remember, but currently I am using bigrquery to create the api call to load in the json files (https://github.com/iainmwallace/DataDepository/blob/master/R/loadJsonFiles2BQ.r).

Now that I think about it, the advantage for me of loading in json files was that the automatic schema detection worked better than loading in a csv file but I could have been doing something weird.

edgararuiz · November 8, 2017, 9:56pm

Hi, I'm just curious about the reason for downloading that volume of data? Is it modeling, visualization?

JacobB · November 8, 2017, 10:21pm

It's just 4 columns, total data size is about 1GB.

JacobB · November 8, 2017, 10:24pm

@edgar - this particular exercise is about visualisation, it's basically the forward looking availability by day for a massive amount of products.

This is probably the largest dataset I deal with in R on a regular basis but 10m ish isn't unusual when analysing transactional data or site traffic data.

Hoyt · November 8, 2017, 10:31pm

And are you just using query_exec( ) from the bigrquery package? Are you processing this in RStudio on your local computer or on RServer?

edgararuiz · November 8, 2017, 10:36pm

Cool, maybe this article can help with your needs: http://db.rstudio.com/best-practices/visualization/. I'm also working on a package that will help you push some plot types calculations to the DB and collect the results (like the histogram's bins) for visualization, it is very WIP but maybe it can further assist: https://github.com/edgararuiz/dbplot

JacobB · November 9, 2017, 8:23am

@Hoyt - yes I'm using query_exec() and it's on Rstudio server on a remote machine.

@edgararuiz - thanks that's interesting stuff. In this case I'm actually combining the data with that from another source then doing a little bit of modelling before writing a table back to the data warehouse - the actual visualisation happens in tableau (unfortunately).

MarkeD · November 11, 2017, 5:09pm

Perhaps it may help but I have a big query package that extracts the data to GCS, for most cases bigrquery is by far superior but for edge cases like this it may help. http://code.markedmondson.me/bigQueryR/query.html It does as you specify with curl but via R, so not much improvement I guess.

JacobB · November 13, 2017, 1:43pm

thanks @MarkeD - looks great, going to give it a go!

Jess · March 28, 2018, 12:33pm

Hi JacobB,
I encountered exactly the same problem as you described it. Retrieving data in query_exec() takes about 2h for 1.2 GB. Have you found why this is the case and how to speed up things?

Thanks for your comments.

JacobB · March 28, 2018, 9:42pm

Hi Jess, it's basically a 'feature' of BQ - changing the page size to 50k or 100k helps but the only way to really get good speed is to export the table that your query creates to Google Cloud Storage and then download the CSV(s) into R data frames.

I responded on this issue how I automated this process https://github.com/r-dbi/bigrquery/issues/199