Identify best practices for working with changing data

What will work depends on the skills of the analysts and obtaining the cooperation of gatekeepers like the IT department.

The original sin is if the csv files derive from spreadsheets, which are vulnerable to entry errors and untrapped logic. Using a relational data base management system for data entry provides the opportunity for validation to at least prevent errors like entering a literal string $1,023.10 in a numeric field. On the other hand, in addition to user training on data entry, hosting is needed that may not fit comfortably on existing servers and sooner or later a database administrator will be needed to keep the system running.

Coming back to Earth, the obvious candidate is a combination of git and scripting.

  • Does file system data show any change to the csv as to date and file size? If not, skip.
  • What are the differences? Display the diff to decide whether to run the whole suite of data wrangling or if the differences are few and trivial enough to run by hand (a correction in a handful of data entry points, for example).
  • Version control the Rds for recovery, assuming there is someway to detect inadvertent corruption.

The most difficult, because it requires the least mechanical work but the most skull sweat is to break up the target Rds object into pieces to isolate stable and volatile parts of the source data. This would be a candidate for persistent storage in an RDMS that would not be as big a deal to implement as the more fulsome data entry requirement. Drawing the same big object as the current Rds object is attractive but it is likely that the use to be made of the big object may require that everything be included but not necessarily all at once. That comes back to the design of the data store and arranging sub-objects to things that change often and those that don't and those that are usually handled together and those that are usually handled in isolation. The principle would be to avoid unnecessary work disagregating a big object while at the same time to ease work assembling the desired output by creating standardized parts.

1 Like