How to organize validation results comes from multiple pipelines?

I just wanna ask you about a known pattern to store validations/data-docs generated by multiple pipelines?
My first guess is use different prefixes in the same bucket, something like this:

pipeline/simple-pipeline/validations
pipeline/simple-pipeline/data-docs/site
pipeline/other-pipeline/validations
pipeline/other-pipeline/data-docs/site

The only trade off will be not to have an html file to index all pipelines docs

1 Like

I see at least two approaches:

If each pipeline has its own Great Expectations Data Context (with its own great_expectations.yml config file), then the approach described in the question is the way to go: configure each Data Docs site to write to a different prefix in the same S3 bucket.

The second approach is for the two pipelines to share a Data Context (one great_expectations.yml config file). When you validate data by running a Validation Operator (or a Checkpoint), you pass data_asset_name. As long as these names in the two pipelines do not collide, the validation results coming from both can co-exist in the same Data Docs site.

Interesting…I like the second approach, maybe I can use the pipeline name as prefix to avoid collisions.

1 Like

I’m getting E TypeError: run() got an unexpected keyword argument 'data_asset_name'
trying to run the following:

results = context.run_validation_operator(
        assets_to_validate=[batch],
        run_id=airflow_run_id,        
        validation_operator_name="run_warning_and_failure_expectation_suites",
        data_asset_name=f'{ras_code}_{model}'
    )

Where I should pass data_asset_name value?

1 Like

run_validation_operator does not accept the data_asset_name argument.

I assume that you obtained the batch that you are passing to the Validation Operator by calling

batch = context.get_batch(batch_kwargs, expectation_suite_name)

Then you should set a value of “data_asset_name” in your batch_kwargs to the string you want Data Docs to use as the data asset name.

I am having trouble with many expectations with the same data docs, I have two pipelines with GE and every run my data docs is overwritten.

Could you help me?