Profiling a dataset / scaffolding an expectation suite without using the CLI (e.g. spark, databricks)

anthony · August 19, 2020, 8:37pm

When you are using the CLI to scaffold a new expectation suite, a jupyter notebook is created to help you run the code to create the scaffold. In environments with no shell and only notebook access like spark and databricks, you can use this code in your notebook to create the scaffold. Replace the “xyz1” with your preferred expectation suite name.

Scaffold a new Expectation Suite (Experimental)
This process helps you avoid writing lots of boilerplate when authoring suites by allowing you to select columns you care about and letting a profiler write some candidate expectations for you to adjust.

Expectation Suite Name: xyz1

Cell 1: Imports

import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
    ValidationResultIdentifier,
)

Cell 2: Data Context

context = ...

#Follow this guide to instantiate a Data Context: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_an_emr_spark_cluster.html```

Cell 3: Create Expectation Suite

expectation_suite_name = "xyz1"
suite = context.create_expectation_suite(
    expectation_suite_name, overwrite_existing=True
)

Cell 4: Create a Batch

Follow this to create a batch: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches.html#

Select the columns on which you would like to scaffold expectations
Great Expectations will choose which expectations might make sense for a column based on the data type and cardinality of the data in each selected column.

Cell 5: Specify Included Columns

included_columns = [
    "col1",
    "col2",
    ...
]

Run the scaffolder
The suites generated here are not meant to be production suites - they are scaffolds to build upon.

To get to a production grade suite, you will definitely want to edit this suite after scaffolding gets you close to what you want.

This is highly configurable depending on your goals. You can include or exclude columns, and include or exclude expectation types (when applicable). The Expectation Glossary contains a list of possible expectations.

Cell 6: Run the Scaffolder

# Wipe the suite clean to prevent unwanted expectations in the batch
suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)
batch = context.get_batch(batch_kwargs, suite)

# In the scaffold_config, included or excluded expectation names should be strings.
scaffold_config = {
    "included_columns": included_columns,
    # "excluded_columns": [],
    # "included_expectations": [],
    # "excluded_expectations": [],
}
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)

Save the scaffolded Expectation Suite
Let’s save the scaffolded expectation suite as a JSON file in the great_expectations/expectations directory of your project and rebuild the Data Docs site to make it easy to review the scaffolded suite.

Cell 7: Save the scaffolded Expectation Suite

context.save_expectation_suite(suite, expectation_suite_name)

Cell 8: Run the Expectation Suite on your Batch

results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs(validation_result_identifier)

Sandra · October 22, 2020, 11:26pm

In step 7, where in databricks does this json file get saved? how can I access it from within databricks?

anthony · October 23, 2020, 2:22pm

Hi Sandra! If you are using the TupleFilesystemStoreBackend then these json files will be written to DBFS. There are a few methods to interact with DBFS, so you may wish to review this document to choose which one is suited for your environment/workflow: https://docs.databricks.com/data/databricks-file-system.html. Personally I like the CLI and dbutils.
If you wish to edit the expectation suite, you can do so in a databricks notebook as well. It may help to open a jupyter notebook with an example Great Expectations setup to see which code loads the expectation suite and instructions on editing and saving.

Topic		Replies	Views
GE in Databricks Archive	5	1294	May 11, 2020
Not able to create expectation suite and data docs in databricks using spark GX Core Support	0	35	July 9, 2025
How to create a new Expectation Suite using suite scaffold Archive how-to , help-wanted	2	1025	November 20, 2020
How to implement my own custom profiler? Archive	2	836	September 2, 2020
Is there a way to run the scaffold CLI command in a Python script like a function? Archive	4	1035	January 19, 2021

Profiling a dataset / scaffolding an expectation suite without using the CLI (e.g. spark, databricks)

Related topics