Testing datasets with the same name in dev and in prod or in different version folders

bhcastleton · September 16, 2020, 5:05am

If an engineer wants to test a table or dataset in dev and then run the exact same test on a table or dataset with the same name in prod, how should they keep track of the name of the dataset in Great Expectations so it’s clear what data set is being referred to?

Another variation of this question would be for file systems. On file systems, sometimes data engineers use a pattern for versioning data as follows:
data/
v1/my_dataset.csv
v2/my_dataset.csv
v3/my_dataset.csv
If an engineer wants to run the same expectations on multiple versions of the same dataset at the same time, how should they keep track of the name of the dataset in Great Expectations?

bhcastleton · September 16, 2020, 5:28am

@nok - I posted this question to address one of your issues. Hopefully we’ll get some back and forth here from others, but I wanted to prime the pump. I think you’re going to want to use environment variables for the dev and prod case and something similar for the version case. You should be able to set variables for the different prefixes in your python code and use them to set the data set name in batch kwargs.

nok · September 16, 2020, 6:06am

When you are saying dataset name, do u refer to data asset name? It’s default to the filename, is it something configurable?

nok · September 18, 2020, 3:11pm

I actually have something like this, subdir reader seems cannot handle this. I don’t want to create a new data source for every timestamp.

v1/2020-xx-xx/my_dataset.csv
v2/2020-xx-xx/my_dataset.csv
v3/2020-xx-xx/my_dataset.csv

bhcastleton · September 21, 2020, 1:45pm

@eugene.mandel - do you have any examples we can share of how to overwrite the default to filename for asset name?

eugene.mandel · September 21, 2020, 11:07pm

The key data_asset_name cab be added to batch_kwargs when specifying the batch to be validated, like shown in this notebook: https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/init_notebooks/pandas/validation_playground.ipynb

nok · September 22, 2020, 1:17am

I only find the notebook using either a pre-loaded pandas dataframe or using a path key, could you highlight which block are u refering to?

I end up trying to construct the absolute path with python code for these:
v1/2020-xx-xx/my_dataset.csv
v2/2020-xx-xx/my_dataset.csv
v3/2020-xx-xx/my_dataset.csv

Unfortunately, it won’t work… A datasource is still needed, but it only works for subdirectory, not sub directory of subdirectory. So I had to create a new datasource for every date “2020-xx-xx” at the end.

Topic		Replies	Views
How to configure a Great Expectations deployment for use in production and dev environments Archive	1	1179	December 21, 2020
Help understanding datasource_name config for RuntimeBatchRequest GX Core Support s3 , datasource	5	389	September 14, 2023
Unable to see data asset name in data doc GX Core Support	8	284	April 22, 2024
Bug in checkpoint.py GX Core Support	1	50	July 10, 2024
How to organize validation results comes from multiple pipelines? Archive	5	748	November 3, 2020

Testing datasets with the same name in dev and in prod or in different version folders

Related topics