creation of new db variables from R

I have a large postgres db table, about 7 billion rows x 32 columns or ~1.2TB total.

problem in words:
I need to split one of the column variables ("key") in the db table into, say, three new variables according to a function.

I have such a function in R, but I need to ALTER the database table - permanently - for indexing and seaching and db stuff off the to-be-created variables.

problem as toy example:

toy_db_as_df <- data.frame(key = paste0(rownames(mtcars), "_", mtcars$hp),
                                 cyl = mtcars$cyl)

# With dplyr and/or tidyr, I could simply mutate what I need, eg

toy_db_as_df  %>% tidyr::separate(key, into = c("car", "hp"), sep = "_")

In reality, my separation function is more complex, but even the toy is too complicated for {dbplyr} to translate into valid SQL.

Still, I'm wondering if there's some way do this from R (maybe change the schema and then some rolling UPDATE?), given that I have ready code to tackle the task at hand.

I'm grateful in advance for any advice.

The lege artis way of doing this operation in an database environment is creating a new table (empty at first) and then running an insert select from original to new table. Then compare results and if OK drop the original table.

I would be cautious about running an update on table of this size.

For the insert you could in theory use R. You will be RAM constrained and it is highly unlikely you could process the table in one go. But if you have a way to partition the table for reading into R in bite sized chunks (some index, date valid field or what not) you should be able to run this operation in a cycle - read n rows as a dataframe, digest it in R, and insert it into the new table. Then rinse & repeat, until the end is reached.

It will take a while and you will want to be cautious with checking the result.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.