Profiling a dataset / scaffolding an expectation suite without using the CLI (e.g. spark, databricks)

When you are using the CLI to scaffold a new expectation suite, a jupyter notebook is created to help you run the code to create the scaffold. In environments with no shell and only notebook access like spark and databricks, you can use this code in your notebook to create the scaffold. Replace the “xyz1” with your preferred expectation suite name.


Scaffold a new Expectation Suite (Experimental)
This process helps you avoid writing lots of boilerplate when authoring suites by allowing you to select columns you care about and letting a profiler write some candidate expectations for you to adjust.

Expectation Suite Name: xyz1


Cell 1: Imports

import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
    ValidationResultIdentifier,
)

Cell 2: Data Context

context = ...

#Follow this guide to instantiate a Data Context: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_an_emr_spark_cluster.html```


Cell 3: Create Expectation Suite

expectation_suite_name = "xyz1"
suite = context.create_expectation_suite(
    expectation_suite_name, overwrite_existing=True
)

Cell 4: Create a Batch

Follow this to create a batch: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches.html#


Select the columns on which you would like to scaffold expectations
Great Expectations will choose which expectations might make sense for a column based on the data type and cardinality of the data in each selected column.


Cell 5: Specify Included Columns

included_columns = [
    "col1",
    "col2",
    ...
]

Run the scaffolder
The suites generated here are not meant to be production suites - they are scaffolds to build upon.

To get to a production grade suite, you will definitely want to edit this suite after scaffolding gets you close to what you want.

This is highly configurable depending on your goals. You can include or exclude columns, and include or exclude expectation types (when applicable). The Expectation Glossary contains a list of possible expectations.


Cell 6: Run the Scaffolder

# Wipe the suite clean to prevent unwanted expectations in the batch
suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)
batch = context.get_batch(batch_kwargs, suite)

# In the scaffold_config, included or excluded expectation names should be strings.
scaffold_config = {
    "included_columns": included_columns,
    # "excluded_columns": [],
    # "included_expectations": [],
    # "excluded_expectations": [],
}
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)

Save the scaffolded Expectation Suite
Let’s save the scaffolded expectation suite as a JSON file in the great_expectations/expectations directory of your project and rebuild the Data Docs site to make it easy to review the scaffolded suite.


Cell 7: Save the scaffolded Expectation Suite

context.save_expectation_suite(suite, expectation_suite_name)

Cell 8: Run the Expectation Suite on your Batch

results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs(validation_result_identifier)
3 Likes

In step 7, where in databricks does this json file get saved? how can I access it from within databricks?

Hi Sandra! If you are using the TupleFilesystemStoreBackend then these json files will be written to DBFS. There are a few methods to interact with DBFS, so you may wish to review this document to choose which one is suited for your environment/workflow: https://docs.databricks.com/data/databricks-file-system.html. Personally I like the CLI and dbutils.
If you wish to edit the expectation suite, you can do so in a databricks notebook as well. It may help to open a jupyter notebook with an example Great Expectations setup to see which code loads the expectation suite and instructions on editing and saving.

2 Likes