ETL: Change tracking during an ETL process

jdlong · January 27, 2019, 8:02pm

I have a number of business processes where I regularly cache calculations. Right now I typically run my process daily and because I don't cache that much data I just save a complete daily tagged snapshot of all the results. Now I have a situation where the data changes very slowly and there's quite a few rows of data. Since it's very slow moving, this feels like a slow changing dimension in an ETL process. When I have dealt with these in the past someone on the database team typically handled this using logic built into the database. Are there any R packages that would help with the extraction of slow moving data and snapshotting?

I took a peak at the CRAN Task View on Databases and didn't see anything applicable. Python has a few things like pragmaetl but it seems really complicated and possibly overkill.

technocrat · January 28, 2019, 5:30am

I don't know of any R based solutions, but there's a lot to be said for db-side transaction logging for whenever you run a calculation based on a data snapshot and save it. See https://mariadb.com/kb/en/library/mariadb-audit-plugin-log-settings/ as one example. In my idealized metaverse, everything an ETL wrote to a db, it would include the code, as a blob if necessary what that made extract_{time_{i}} become load_{t_i}.

jlacko · January 28, 2019, 8:43am

I doubt that a purely R based solution is feasible.

IMHO the cleanest solution of the SCD2 issue is doing UPSERT (insert ... on conflict update), which is a part of SQL that has so far resisted standardization. Any solution is bound to be very platform specific.

It is best left to the backend, and by extension database people.

jdlong · February 18, 2019, 8:43am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.