How do I programmatically validate expectations?

ashwin153 · May 5, 2021, 7:04pm

I would like to create a function, validate(df: pyspark.sql.DataFrame, expectations: List[great_expectations.expectations.Expectation]) -> None that validates the expectations on the df. How would I go about implementing this function? After browsing the documentation and the codebase a little, I think I need to convert the df to a Batch using a DataContext, bind the expectations to a ExpectationConfiguration, and then validate the ExpectationConfiguration over the Batch. Does this seem reasonable? Is there a function somewhere in the library that already does this?

ashwin153 · May 5, 2021, 9:36pm

Also, completely tangential question, but how did you guys get the notebook auto-completion in your examples (e.g., How to quickly explore Expectations in a notebook — great_expectations documentation) to work?

ashwin153 · May 6, 2021, 12:56am

I have made a little bit of progress. I think I need to do something like this.

def validate(df, expectations) -> None:
    with tempfile.TemporaryDirectory() as temp:
        data_context_config = ge.data_context.types.base.DataContextConfig(
            datasources={
                "spark": {
                    "data_asset_type": {
                        "class_name": "SparkDFDataset",
                        "module_name": "great_expectations.dataset",
                    },
                    "class_name": "SparkDFDatasource",
                    "module_name": "great_expectations.datasource",
                    "batch_kwargs_generators": {},
                },
            },
            stores=ge.data_context.types.base.FilesystemStoreBackendDefaults(
                root_directory=temp,
            ))

        data_context = ge.data_context.BaseDataContext(
            project_config=data_context_config)
            
        expectation_suite = data_context.create_expectation_suite(
            expectation_suite_name="expectation_suite")

        batch = data_context.get_batch(
            batch_kwargs={
                "dataset": event.data.to_spark(),
                "datasource": "spark"
            },
            expectation_suite_name="expectation_suite")

The part I’m struggling with is how to run the expectations on the batch. Do you guys have any suggestions? The documentation makes it seem like you have to call the expectation functions on the batch, but I want to create the expectations and then call them on the batch later.

eugene.mandel · May 17, 2021, 4:33pm

This how-to guide shows how to validate a Spark dataframe: How to load a Spark DataFrame as a Batch — great_expectations documentation
I think you can assemble your function from its parts.

Topic		Replies	Views
How to validate Spark DataFrames in 0.13 Archive	3	1260	July 19, 2021
How to create custom Expectations for Spark Archive how-to , help-wanted	9	2723	June 3, 2021
Not able to create expectation suite and data docs in databricks using spark GX Core Support	0	22	July 9, 2025
How to validate data without a Checkpoint Feedback how-to , help-wanted	4	1164	September 28, 2021
Validate different dataframes with respective expectation suites using checkpoint GX Core Support how-to , databricks	4	273	October 10, 2024

How do I programmatically validate expectations?

Related topics