Is there a way to run the scaffold CLI command in a Python script like a function?

I haven’t found a doc or example on how to call CLI commands as functions in Python, I’m specially interested in the ‘scaffold’ command.


Thanks for the great question! Can you tell me a little bit more about the intended use case? When you say you’re especially interested in the ‘scaffold’ command, do you mean just the scaffolding of a GE project directory structure? Or by ‘scaffold’, do you mean the great_expectations init command?

When you run great_expectations init in the command line and complete the guided process, a number of things happen:

  • scaffolding of a Great Expectations project directory structure and creation of a great_expectations.yml config file
  • addition of a Datasource
  • creation of an example Expectation Suite and validation from above Datasource
  • creation of a Data Docs HTML site

It’s generally preferable to go through the CLI init process, but it is possible to achieve the above steps in a script by calling the same methods directly. If you give me a bit more info, I can put together an example script you can work off of.

Sure! by ‘scaffold’ I meant great_expectations suite scaffold SUITE_NAME command

I have a big in-memory Pandas dataframe (It’s an intermediate step of a big Machine Learning pipeline) and I would like to use the scaffold command to automatically generate expectations.

I’ve tried using Pandas ‘to_csv()’ function and run the CLI scaffold command on that CSV, the problem is that over a 100 out-of-the-box “expect_column_values_to_be_unique” expectations are failing because of the following issue:

In the Pandas dataframe I have many float features with the value: “0.0” but Pandas is saving this features as “0” on the CSV, GE is picking that difference as failure. I really don’t know if GE should consider 0.0 = 0, I’ll leave that to you.

To avoid all this “parsing to file” problem I was hoping to call a Python function that could do the same as the CLI scaffold command.

Another Use-Case I have is that this is only 1 intermediate step in the data transformation pipeline, I would like to programmatically “scaffold” every intermediate dataframe so that GE can help us identify not only “what” changed but also on which step is the data being altered incorrectly.

I see what you mean. See the below code snippet - it’s very close to what you’d get in the Jupyter notebook after running great_expectations suite scaffold SUITE_NAME (the command basically generates a Jupyter notebook, which yields the scaffolded suites after you run all the cells). The major difference that’s relevant to your use case is passing in a pandas df directly in batch_kwargs, under a “dataset” key (vs. a filepath). Let me know if that’s what you were looking for or if you have any other questions.

from datetime import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
import pandas as pd

context = ge.data_context.DataContext(
context_root_dir='/Users/roblim/superconductive/test_ge_project_dir/great_expectations'  # You'll want to enter the
    # path to your ge project dir here

expectation_suite_name = "test_suite"  # You'll want to enter your own suite name here
suite = context.create_expectation_suite(
    expectation_suite_name, overwrite_existing=True

pandas_df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [1, 2, 3, 4, None]})

batch_kwargs = {
    "dataset": pandas_df,  # You can pass in your big Pandas dataframe directly here
    "datasource": "files_datasource",  # You'll want to enter the appropriate datasource name here
batch = context.get_batch(batch_kwargs, suite)

# The scaffold config is optional - if not provided, the profiler will just look at all columns and expectation types
included_columns = batch.get_table_columns()
scaffold_config = {
    # "included_columns": included_columns,
    # "excluded_columns": [],
    # "included_expectations": [],  # These are expectation types, for e.g. "expect_column_values_to_be_of_type"
    # "excluded_expectations": [],
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)

# This persists the expectation suite to the configured expectations store
context.save_expectation_suite(suite, expectation_suite_name)

# The section below is optional, if you'd like to view the scaffolded suite and sample validation results in Data Docs
# If you'd just like to see the expectation suite (and not sample validation results), you can just call
# context.build_data_docs(resource_identifiers=[expectation_suite_identifier])
# context.open_data_docs(expectation_suite_identifier)

# Let's make a simple sortable timestamp. Note this could come from your pipeline runner.
run_id = datetime.utcnow().strftime("%Y%m%dT%H%M%S.%fZ")

results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch], run_id=run_id)
expectation_suite_identifier = list(results["details"].keys())[0]
validation_result_identifier = ValidationResultIdentifier(

Whats the actual purpose of entering a datasource name here?

batch_kwargs = {
“dataset”: pandas_df, # You can pass in your big Pandas dataframe directly here
“datasource”: “invoices”, # You’ll want to enter the appropriate datasource name here

Reading get_batch(), I see it is necessary, but can’t we go simply without defining in the yml file?