If an engineer wants to test a table or dataset in dev and then run the exact same test on a table or dataset with the same name in prod, how should they keep track of the name of the dataset in Great Expectations so it’s clear what data set is being referred to?
Another variation of this question would be for file systems. On file systems, sometimes data engineers use a pattern for versioning data as follows:
data/
v1/my_dataset.csv
v2/my_dataset.csv
v3/my_dataset.csv
If an engineer wants to run the same expectations on multiple versions of the same dataset at the same time, how should they keep track of the name of the dataset in Great Expectations?
@nok - I posted this question to address one of your issues. Hopefully we’ll get some back and forth here from others, but I wanted to prime the pump. I think you’re going to want to use environment variables for the dev and prod case and something similar for the version case. You should be able to set variables for the different prefixes in your python code and use them to set the data set name in batch kwargs.
I only find the notebook using either a pre-loaded pandas dataframe or using a path key, could you highlight which block are u refering to?
I end up trying to construct the absolute path with python code for these:
v1/2020-xx-xx/my_dataset.csv
v2/2020-xx-xx/my_dataset.csv
v3/2020-xx-xx/my_dataset.csv
Unfortunately, it won’t work… A datasource is still needed, but it only works for subdirectory, not sub directory of subdirectory. So I had to create a new datasource for every date “2020-xx-xx” at the end.