Tensorflow Data Validation is an amazing data validation library recently released by Google. It has two main strengths:
- It's built on Apache Beam, meaning that statistical computations can be easily run in distributed mode on the cloud, or on a machine with a large number of cores. My machine has 24 CPU cores, and if I manage to use the CUDA cores (which I don't think is possible, but I have to try harder!) I can scale up to 81920 cores (!). Or I can deploy the model on GCP and have as many cores as I want.
- It's very robust and automatable: if you have multiple files (I had a dozen Gb of ~ 100 Mb data), you can create a schema from a file and quickly compare all other files against the schema, to find out similarities and differences without having to look at more than one file. It's pretty amazing!!
Since we have native access to Tensorflow through the tensorflow
package, do we also have access to Tensorflow Data Validation? I guess we would have access through reticulate
anyway, but I was looking for a way to run it natively in R, rather than run Python code in R. I can spin up machines with any version of any language I need, so if I need Python code, for me it makes more sense to use the Python interpreter. However, if I could use TFDV natively....