Efficient equivalent to str_replace_all with a named vector for duckdb?

Hi all! I'm new to working with databases and I'm trying to figure out how to efficiently replace partial strings with duckdb. When working with normal data tables, I usually use a named vector with str_replace_all(), like this:

input <- c("1", "2", "3")
output <- c("one", "two", "three")
names(output) <- input

df |>
  mutate(col_to_change = str_replace_all(col_to_change, output)

When I try this with a duckdb table, I get an error that "argument 'replacement' is missing, with no default", so it's not picking up the names that are supposed to be used to replace the pattern.

What would be the most efficient way to do something like when working with duckdb/databases/SQL? Some of the inputs contain regex and I'm usually matching only to partial strings, so I haven't been able to do a simple join. I'm trying to find an efficient method since I'm working with large datasets. Also if there's any good resources out there for stuff like this, please let me know!

1 Like

Here is a an approach using a custom R function to write out a duckDB compatible chained regexp_replace functions

library(duckdb)
library(tidyverse)
library(dbplyr)
library(glue)

con <- dbConnect(duckdb())

data <- data.frame(a = c("a 1 1","b 2","c 3c 3"))
path <- tempfile(fileext = ".csv")

write.csv(data, path, row.names = FALSE)

duckdb_read_csv(con, "data", path, colClasses = "character")

duckdbframe <- tbl(con, "data")

input <- c("1", "2", "3")
output <- c("one", "two", "three")

 
# what do we start with ? 
select(duckdbframe,everything())
# # Source:   table<"data"> [3 x 1]
# # Database: DuckDB v1.1.0 [NirGraham@Windows 10 x64:R 4.4.0/:memory:]
# a     
# <chr> 
# 1 a 1 1 
# 2 b 2   
# 3 c 3c 3

  
# Function to generate the SQL query
generate_sql_query <- function(table,column, patterns, replacements) {

  replace_statements <- map2(patterns, replacements, ~ glue("regexp_replace ( '{.x}', '{.y}', 'g')"))
  # print(replace_statements)
  replace_chain <- paste0(replace_statements, collapse = ".")
  # print(replace_chain)
  # Generate the final SQL query
  sql_query <- glue("
    UPDATE {table} 
    SET {column} = {column}.{replace_chain}
  ")
  print("Your SQL is ")
  print(sql_query)
  return(sql_query)
}
# generate_sql_query("data","a",input,output)
# run the thing
dbExecute(con,generate_sql_query("data","a",input,output))
  
# the result 
select(duckdbframe,everything())
# # Source:   table<"data"> [3 x 1]
# # Database: DuckDB v1.1.0 [NirGraham@Windows 10 x64:R 4.4.0/:memory:]
# a             
# <chr>         
# 1 a one one     
# 2 b two         
# 3 c threec three
2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.