I wish to read from MySQL database chunkwise, create TWO summary variables and write chunkwise to another table. Here is how I think we may do this seperately as follows(I have not tried the code):
library(RMySQL)
library(chunked)
library(dplyr)
connection <- dbConnect(dbDriver("MySQL"),host = "myhost",
db="mydb",user="myuser",password="mypass")
# Computing the number of members in a household.
tbl(connection,"table1") %>%
read_chunkwise() %>%
select(hid,year) %>%
group_by(hid,year) %>%
summarise(Total_members= n()) %>%
write_chunkwise("table2")
# Extracting the age of the head of household, I know that for
# each hid, year there will be only one member with
# relation_with_hoh == "hoh"
tbl(connection,"table1") %>%
read_chunkwise() %>%
select(hid,year,relation_with_hoh,age) %>%
filter(relation_with_hoh=="hoh") %>%
mutate(hoh_age = age) %>%
select(hid,year,hoh_age) %>%
write_chunkwise("table2")
I don't have experience using 'chunked', though initial reading makes it seem straightforward, unfortunately I can't reconcile what you are asking to be able to do with my knowledge of databases in general and what they do or don't allow.
It seems your ultimate goal is to simultaneously extend the same table (table2) with differently structured data. I would assume this is not possible whether or not chunking was a desired feature. Databases generally establish a structure for insertion that then must be respected in subsequent insertions. Unless perhaps you are allowing for 'dummy columns' that some insertions will pad with missing values, and other insertions wont ?
In fact, there are cautions given on group_by / summarise that you don't seem to have taken on board in the way you constructed your example :
summarize and group_by are implemented but generate a warning: they operate on each chunk and not on the whole data set. However this makes is more easy to process a large file, by repeatedly aggregating the resulting data.
tmp <- tempfile()
write.csv(iris, tmp, row.names=FALSE, quote=FALSE)
iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example
iris_cw %>%
group_by(Species) %>% # group in each chunk
summarise( m = mean(Sepal.Width) # and summarize in each chunk
, w = n()
) %>%
as.data.frame %>% # since each Species has 50 records, results will be in multiple chunks
group_by(Species) %>% # group the results from the chunk
summarise(m = weighted.mean(m, w)) # and summarize it again
i.e. note the doubled use of group_by/.summarise with the second variation reformulated to use a weighted mean
yes, but you are drawing the wrong conclusion. I'm saying that even sticking with separate queries for both, you will need to rewrite your first query to compute a correct result, by adding a second level of grouping and summarising.
I will borrow your original idea of a dummy column. That is the answer I think.
tbl(connection,"table1") %>%
read_chunkwise() %>%
select(hid,year) %>%
group_by(hid,year) %>%
summarise(Total_members= n()) %>%
# A column with a constant value, this could be buggy for the moment
mutate(hoh_age = age[which(relation_with_hoh=="hoh")]) %>%
collect() %>%
group_by(hid,year) %>%
summarise(Total_members = sum(Total_members), hoh_age = mean(hoh_age) %>%
write_chunkwise("table2")
# Here is the correct way
tbl(connection,"table1") %>%
read_chunkwise() %>%
select(hid,year) %>%
group_by(hid,year) %>%
summarise(Total_members= n(),hoh_age = ifelse(length(which(relation_with_hoh=="hoh"))==1,age[which(relation_with_hoh=="hoh")],NA)) %>%
collect() %>%
group_by(hid,year) %>%
# hoh_age will be different to NA exactly once in a group.
# A check to ensure that EXACTLY one row in each group of hid,year has relation_with_hoh="hoh"
summarise(Total_members = sum(Total_members), hoh_age = mean(hoh_age,na.rm=TRUE),check = length( is.na(hoh_age)==1)) %>%
write_chunkwise("table2")
# The naive way which was confusing me
tbl(connection,"table1") %>%
read_chunkwise() %>%
select(hid,year) %>%
group_by(hid,year) %>%
summarise(Total_members= n()) %>%
# here is the point of confusion, the filter in the next line will
# not work for the entire group of hid and year
# since we have already done a summarise.
filter(relation_with_hoh =="hoh") %>%
mutate(hoh_age = age) %>%
collect() %>%
group_by(hid,year) %>%
summarise(Total_members = sum(Total_members), hoh_age = mean(hoh_age,na.rm=TRUE),check = length( is.na(hoh_age)==1)) %>%
write_chunkwise("table2")