GE in Databricks

Hi all, I am trying to integrate GE with Databricks following https://www.unsupervised-learnings.co.uk/post/setting-your-data-expectations-data-profiling-and-testing-with-the-great-expectations-library/ and I would like to save our own expectations without using the full profiling expectations to fit our needs.
But I am running into trouble when saving the expectation suite:
context.save_expectation_suite(discard_failed_expectations = False)
save_expectation_suite_usage_statistics() got an unexpected keyword argument 'discard_failed_expectations’Am I missing something? Thank you!

This blog post is indeed awesome!

Looks like a mistake sneaked in into this code snippet in a recent edit:

spark_data.expect_column_values_to_not_be_null("PROJECT_ID")
spark_data.expect_column_values_to_be_in_set("DELIVERY_GROUP_LATEST", set(['IP Legacy', 'IP AM SP&C']))

spark_data.validate()

context.save_expectation_suite(discard_failed_expectations = False)

It should be “spark_data.save_expectation_suite(discard_failed_expectations = False)”, not context.

Thanks @eugene.mandel!
I think things are getting mixed up in my config, here is the error I get

Unable to save config: filepath or data_context must be available.

I had to change some of the config options in the yml to make it work initially, but I am not sure I have the correct syntax:

datasources:
  spark:
    class_name: SparkDFDatasource
    batch_kwargs_generators:
      passthrough:
        class_name: DatabricksTableBatchKwargsGenerator
    data_asset_type:
      module_name:
      class_name: SparkDFDataset

Thanks for the help!

I see - “spark_data. save_expectation_suite” can be called with these arguments if spark_data is associated with a Data Context object. Usually, we obtain a data asset (in this case “spark_data”) from Data Context, but to make it work in this code snippet, we can provide a reference to a DataContext when we create the spark_data object:

instead of :
spark_data = SparkDFDataset(data)

this:
spark_data = SparkDFDataset(data, data_context=context)

Fantastic!
I think I’m almost there…
I have now an expectation json file with only the one I specified, but I can’t seem to define the output filename, its always saved in default.json

data = spark.read.parquet("/xxxx/business_classification_dim_parquet")
data_asset_name = "business_classification_dim_parquet"

context.create_expectation_suite(data_asset_name, True)

spark_data = SparkDFDataset(data, data_context=context)

# create expectations:
spark_data.expect_table_columns_to_match_ordered_list(["xxxx", "xxx"])

spark_data.validate()

# save expectations
spark_data.save_expectation_suite("business_classification_dim_parquet")

Thank you so much!

Ok I think I figured it out, this seems to work now:

spark_data.save_expectation_suite(filepath="/dbfs/ge_er_test/expectations/business_classification11-05-2020_expectations.json")

Now one last question, is it possible to read delta tables instead of parquet? I added

reader_method: delta

in the config file, but it doesn’t seem that spark.read.parquet can be replaced by spark.read.delta?

Thanks again @eugene.mandel !

1 Like