Testing datasets with the same name in dev and in prod or in different version folders

If an engineer wants to test a table or dataset in dev and then run the exact same test on a table or dataset with the same name in prod, how should they keep track of the name of the dataset in Great Expectations so it’s clear what data set is being referred to?

Another variation of this question would be for file systems. On file systems, sometimes data engineers use a pattern for versioning data as follows:
data/
v1/my_dataset.csv
v2/my_dataset.csv
v3/my_dataset.csv
If an engineer wants to run the same expectations on multiple versions of the same dataset at the same time, how should they keep track of the name of the dataset in Great Expectations?

1 Like

@nok - I posted this question to address one of your issues. Hopefully we’ll get some back and forth here from others, but I wanted to prime the pump. I think you’re going to want to use environment variables for the dev and prod case and something similar for the version case. You should be able to set variables for the different prefixes in your python code and use them to set the data set name in batch kwargs.

When you are saying dataset name, do u refer to data asset name? It’s default to the filename, is it something configurable?

I actually have something like this, subdir reader seems cannot handle this. I don’t want to create a new data source for every timestamp.

v1/2020-xx-xx/my_dataset.csv
v2/2020-xx-xx/my_dataset.csv
v3/2020-xx-xx/my_dataset.csv

@eugene.mandel - do you have any examples we can share of how to overwrite the default to filename for asset name?

The key data_asset_name cab be added to batch_kwargs when specifying the batch to be validated, like shown in this notebook: https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/init_notebooks/pandas/validation_playground.ipynb

I only find the notebook using either a pre-loaded pandas dataframe or using a path key, could you highlight which block are u refering to?

I end up trying to construct the absolute path with python code for these:
v1/2020-xx-xx/my_dataset.csv
v2/2020-xx-xx/my_dataset.csv
v3/2020-xx-xx/my_dataset.csv

Unfortunately, it won’t work… A datasource is still needed, but it only works for subdirectory, not sub directory of subdirectory. So I had to create a new datasource for every date “2020-xx-xx” at the end.