How to organize validation results comes from multiple pipelines?

dennys · August 27, 2020, 10:06pm

I just wanna ask you about a known pattern to store validations/data-docs generated by multiple pipelines?
My first guess is use different prefixes in the same bucket, something like this:

pipeline/simple-pipeline/validations
pipeline/simple-pipeline/data-docs/site
pipeline/other-pipeline/validations
pipeline/other-pipeline/data-docs/site

The only trade off will be not to have an html file to index all pipelines docs

eugene.mandel · August 28, 2020, 3:53pm

I see at least two approaches:

If each pipeline has its own Great Expectations Data Context (with its own great_expectations.yml config file), then the approach described in the question is the way to go: configure each Data Docs site to write to a different prefix in the same S3 bucket.

The second approach is for the two pipelines to share a Data Context (one great_expectations.yml config file). When you validate data by running a Validation Operator (or a Checkpoint), you pass data_asset_name. As long as these names in the two pipelines do not collide, the validation results coming from both can co-exist in the same Data Docs site.

dennys · August 28, 2020, 4:16pm

Interesting…I like the second approach, maybe I can use the pipeline name as prefix to avoid collisions.

dennys · September 24, 2020, 8:13pm

I’m getting E TypeError: run() got an unexpected keyword argument 'data_asset_name'
trying to run the following:

results = context.run_validation_operator(
        assets_to_validate=[batch],
        run_id=airflow_run_id,        
        validation_operator_name="run_warning_and_failure_expectation_suites",
        data_asset_name=f'{ras_code}_{model}'
    )

Where I should pass data_asset_name value?

eugene.mandel · September 25, 2020, 11:00pm

run_validation_operator does not accept the data_asset_name argument.

I assume that you obtained the batch that you are passing to the Validation Operator by calling

batch = context.get_batch(batch_kwargs, expectation_suite_name)

Then you should set a value of “data_asset_name” in your batch_kwargs to the string you want Data Docs to use as the data asset name.

rubenssoto · November 3, 2020, 8:30pm

I am having trouble with many expectations with the same data docs, I have two pipelines with GE and every run my data docs is overwritten.

Could you help me?

Topic		Replies	Views
Validations need to be ran twice to store and create data docs Archive how-to , help-wanted	1	496	March 8, 2021
Appending validations to data docs GX Core Support	4	331	September 26, 2023
Unable to see data asset name in data doc GX Core Support	8	333	April 22, 2024
How are validation results stored and visualized in Data Docs? Archive	0	497	October 21, 2020
Profiling overwriting validations and vice versa Archive s3 , expectation-request	2	761	December 18, 2020

How to organize validation results comes from multiple pipelines?

Related topics