When you are using the CLI to scaffold a new expectation suite, a jupyter notebook is created to help you run the code to create the scaffold. In environments with no shell and only notebook access like spark and databricks, you can use this code in your notebook to create the scaffold. Replace the “xyz1” with your preferred expectation suite name.
Scaffold a new Expectation Suite (Experimental)
This process helps you avoid writing lots of boilerplate when authoring suites by allowing you to select columns you care about and letting a profiler write some candidate expectations for you to adjust.
Expectation Suite Name: xyz1
Cell 1: Imports
import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
ValidationResultIdentifier,
)
Cell 2: Data Context
context = ...
#Follow this guide to instantiate a Data Context: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_an_emr_spark_cluster.html```
Cell 3: Create Expectation Suite
expectation_suite_name = "xyz1"
suite = context.create_expectation_suite(
expectation_suite_name, overwrite_existing=True
)
Cell 4: Create a Batch
Follow this to create a batch: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches.html#
Select the columns on which you would like to scaffold expectations
Great Expectations will choose which expectations might make sense for a column based on the data type and cardinality of the data in each selected column.
Cell 5: Specify Included Columns
included_columns = [
"col1",
"col2",
...
]
Run the scaffolder
The suites generated here are not meant to be production suites - they are scaffolds to build upon.
To get to a production grade suite, you will definitely want to edit this suite after scaffolding gets you close to what you want.
This is highly configurable depending on your goals. You can include or exclude columns, and include or exclude expectation types (when applicable). The Expectation Glossary contains a list of possible expectations.
Cell 6: Run the Scaffolder
# Wipe the suite clean to prevent unwanted expectations in the batch
suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)
batch = context.get_batch(batch_kwargs, suite)
# In the scaffold_config, included or excluded expectation names should be strings.
scaffold_config = {
"included_columns": included_columns,
# "excluded_columns": [],
# "included_expectations": [],
# "excluded_expectations": [],
}
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)
Save the scaffolded Expectation Suite
Let’s save the scaffolded expectation suite as a JSON file in the great_expectations/expectations directory of your project and rebuild the Data Docs site to make it easy to review the scaffolded suite.
Cell 7: Save the scaffolded Expectation Suite
context.save_expectation_suite(suite, expectation_suite_name)
Cell 8: Run the Expectation Suite on your Batch
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs(validation_result_identifier)