Hi all, I am trying to integrate GE with Databricks following https://www.unsupervised-learnings.co.uk/post/setting-your-data-expectations-data-profiling-and-testing-with-the-great-expectations-library/ and I would like to save our own expectations without using the full profiling expectations to fit our needs.
But I am running into trouble when saving the expectation suite:
context.save_expectation_suite(discard_failed_expectations = False)
save_expectation_suite_usage_statistics() got an unexpected keyword argument 'discard_failed_expectations’Am I missing something? Thank you!
This blog post is indeed awesome!
Looks like a mistake sneaked in into this code snippet in a recent edit:
spark_data.expect_column_values_to_not_be_null("PROJECT_ID")
spark_data.expect_column_values_to_be_in_set("DELIVERY_GROUP_LATEST", set(['IP Legacy', 'IP AM SP&C']))
spark_data.validate()
context.save_expectation_suite(discard_failed_expectations = False)
It should be “spark_data.save_expectation_suite(discard_failed_expectations = False)”, not context.
Thanks @eugene.mandel!
I think things are getting mixed up in my config, here is the error I get
Unable to save config: filepath or data_context must be available.
I had to change some of the config options in the yml to make it work initially, but I am not sure I have the correct syntax:
datasources:
spark:
class_name: SparkDFDatasource
batch_kwargs_generators:
passthrough:
class_name: DatabricksTableBatchKwargsGenerator
data_asset_type:
module_name:
class_name: SparkDFDataset
Thanks for the help!
I see - “spark_data. save_expectation_suite” can be called with these arguments if spark_data is associated with a Data Context object. Usually, we obtain a data asset (in this case “spark_data”) from Data Context, but to make it work in this code snippet, we can provide a reference to a DataContext when we create the spark_data object:
instead of :
spark_data = SparkDFDataset(data)
this:
spark_data = SparkDFDataset(data, data_context=context)
Fantastic!
I think I’m almost there…
I have now an expectation json file with only the one I specified, but I can’t seem to define the output filename, its always saved in default.json
data = spark.read.parquet("/xxxx/business_classification_dim_parquet")
data_asset_name = "business_classification_dim_parquet"
context.create_expectation_suite(data_asset_name, True)
spark_data = SparkDFDataset(data, data_context=context)
# create expectations:
spark_data.expect_table_columns_to_match_ordered_list(["xxxx", "xxx"])
spark_data.validate()
# save expectations
spark_data.save_expectation_suite("business_classification_dim_parquet")
Thank you so much!
Ok I think I figured it out, this seems to work now:
spark_data.save_expectation_suite(filepath="/dbfs/ge_er_test/expectations/business_classification11-05-2020_expectations.json")
Now one last question, is it possible to read delta tables instead of parquet? I added
reader_method: delta
in the config file, but it doesn’t seem that spark.read.parquet can be replaced by spark.read.delta?
Thanks again @eugene.mandel !