I would like to create a function, validate(df: pyspark.sql.DataFrame, expectations: List[great_expectations.expectations.Expectation]) -> None
that validates the expectations
on the df
. How would I go about implementing this function? After browsing the documentation and the codebase a little, I think I need to convert the df
to a Batch
using a DataContext
, bind the expectations
to a ExpectationConfiguration
, and then validate the ExpectationConfiguration
over the Batch
. Does this seem reasonable? Is there a function somewhere in the library that already does this?
Also, completely tangential question, but how did you guys get the notebook auto-completion in your examples (e.g., How to quickly explore Expectations in a notebook ā great_expectations documentation) to work?
I have made a little bit of progress. I think I need to do something like this.
def validate(df, expectations) -> None:
with tempfile.TemporaryDirectory() as temp:
data_context_config = ge.data_context.types.base.DataContextConfig(
datasources={
"spark": {
"data_asset_type": {
"class_name": "SparkDFDataset",
"module_name": "great_expectations.dataset",
},
"class_name": "SparkDFDatasource",
"module_name": "great_expectations.datasource",
"batch_kwargs_generators": {},
},
},
stores=ge.data_context.types.base.FilesystemStoreBackendDefaults(
root_directory=temp,
))
data_context = ge.data_context.BaseDataContext(
project_config=data_context_config)
expectation_suite = data_context.create_expectation_suite(
expectation_suite_name="expectation_suite")
batch = data_context.get_batch(
batch_kwargs={
"dataset": event.data.to_spark(),
"datasource": "spark"
},
expectation_suite_name="expectation_suite")
The part Iām struggling with is how to run the expectations on the batch. Do you guys have any suggestions? The documentation makes it seem like you have to call the expectation functions on the batch, but I want to create the expectations and then call them on the batch later.
This how-to guide shows how to validate a Spark dataframe: How to load a Spark DataFrame as a Batch ā great_expectations documentation
I think you can assemble your function from its parts.