Write/Update Only S3-hosted Data Docs on S3

cityfish · December 1, 2020, 11:54pm

Can someone explain better how does skip_and_clean_missing works for DefaultSiteIndexBuilder . Do I understand correctly that if true, it will remove any html that has no related suite/expectation in the current config?

I am trying to run GE on airflow, with one expectation (and local config) per Operator, but hopefully one data docs store (on s3) for all dags and all expectations, so I definitely don’t want to delete other reports on validation. . On top of that, my organization simply does not allow Airflow to delete anything from S3, which makes sense.

Seems that the only way would be to create custom SiteIndexBuilder with this option being set to false (or based on config), is that right?

eugene.mandel · December 2, 2020, 10:33pm

If the skip_and_clean_missing flag in DefaultSiteIndexBuilder.build is set to True (this is the default), then, when an index page is being built and an existing HTML page does not have corresponding source data (i.e. an expectation suite or validation result was removed from source store), the HTML page is automatically deleted and will not appear in the index. This ensures that the expectations store and validations store are the source of truth for Data Docs.

Option 1: All the Airflow DAGs share the same GE Data Context (the config file).
Whether the Data Context’s expectations store uses the filesystem or S3, each DAG will have access to Expectations Suites for all the DAGs. This means that the flag being True does not delete any Expectation Suite HTML files.

Option 1.a.
If the Data Context’s validations_store is a shared one (e.g., S3), there is no issue with deleting validation results’ HTML files.

Option 1.b.
However, if it is configured to use the filesystem, each Operator will delete the HTML files of others’ validation results. To avoid this, you would have to extend DefaultSiteIndexBuilder and set the flag to false in your child class. Then you can set the site_index_builder config property to your class name.

Option 2: Each Airflow DAG has its own GE Data Context (the config file).
In this case, each Data Context can configure its Data Docs site to use the same S3 bucket, but different prefix. This way there will be a site per DAG, which might be pretty convenient. You may choose to add another HTML file manually that would link to all these sites.

cityfish · December 3, 2020, 2:31am

Hi Eugene, and thanks for such a detailed explanation.
One question - can we achieve having a shared validation store on s3, as you described while having separate GE Data Contexts (just with the same validation store parameters)?

PS Actually I checked my yaml config, and it has the s3 config for validations

validations_S3_store:
        class_name: ValidationsStore
        store_backend:
            class_name: TupleS3StoreBackend
            bucket: ${AWS_BUCKET}
            prefix: research/great_expectations/validations/

should that be sufficient? It still triggers GE to try remove keys from s3, so…

eugene.mandel · December 3, 2020, 9:04pm

These separate Data Contexts can use the same S3 bucket for their validations store, but each have its own prefix, in order to not step over each other’s data.

cityfish · December 4, 2020, 3:44am

Makes sense. It won’t solve my issue with documents that way, though, right?
Let me try writing a custom SiteIdexBuilder, then

Topic		Replies	Views
Profiling overwriting validations and vice versa Archive s3 , expectation-request	2	753	December 18, 2020
Unable to configure the GE checks to validate parquet files on s3 GX Core Support airflow , help-wanted , s3	1	53	March 31, 2025
How to organize validation results comes from multiple pipelines? Archive	5	748	November 3, 2020
How do I clean old html files from data docs? GX Core Support	0	28	June 6, 2025
Publishing Docs based on Environment Archive how-to	1	523	June 16, 2020

Write/Update Only S3-hosted Data Docs on S3

Related topics