Context: Raw data (numerous csv files) exist on a disk. Each week, one or more of those raw data are updated. Each month the analyst reads the entire set of data (numerous csv files) in to the R environment, then does the following:
makes the data tidy
performs a join; the numerous csv are now a single object in the environment
transforms the single object with additional dummy variables
produces a single validated RDS file of the single object and saves to disk
The validated RDS (saved to disk) is then used for further analysis and reporting.
Question: My concern is that the analyst spends too much time reading data that has not changed. This is redundant because most of the data has not changed. And if a mistake is made during the read-tidy-transform process, the analyst risks overwriting the validated RDS with erroneous data. Given the above context, what potential problems may arise from doing the above process? How might you suggest a different approach for the analyst?
What will work depends on the skills of the analysts and obtaining the cooperation of gatekeepers like the IT department.
The original sin is if the csv files derive from spreadsheets, which are vulnerable to entry errors and untrapped logic. Using a relational data base management system for data entry provides the opportunity for validation to at least prevent errors like entering a literal string $1,023.10 in a numeric field. On the other hand, in addition to user training on data entry, hosting is needed that may not fit comfortably on existing servers and sooner or later a database administrator will be needed to keep the system running.
Coming back to Earth, the obvious candidate is a combination of git and scripting.
Does file system data show any change to the csv as to date and file size? If not, skip.
What are the differences? Display the diff to decide whether to run the whole suite of data wrangling or if the differences are few and trivial enough to run by hand (a correction in a handful of data entry points, for example).
Version control the Rds for recovery, assuming there is someway to detect inadvertent corruption.
The most difficult, because it requires the least mechanical work but the most skull sweat is to break up the target Rds object into pieces to isolate stable and volatile parts of the source data. This would be a candidate for persistent storage in an RDMS that would not be as big a deal to implement as the more fulsome data entry requirement. Drawing the same big object as the current Rds object is attractive but it is likely that the use to be made of the big object may require that everything be included but not necessarily all at once. That comes back to the design of the data store and arranging sub-objects to things that change often and those that don't and those that are usually handled together and those that are usually handled in isolation. The principle would be to avoid unnecessary work disagregating a big object while at the same time to ease work assembling the desired output by creating standardized parts.