Is there a way to run the scaffold CLI command in a Python script like a function?

RomanAguilar · May 18, 2020, 10:06pm

I haven’t found a doc or example on how to call CLI commands as functions in Python, I’m specially interested in the ‘scaffold’ command.

roblim · May 18, 2020, 10:42pm

Thanks for the great question! Can you tell me a little bit more about the intended use case? When you say you’re especially interested in the ‘scaffold’ command, do you mean just the scaffolding of a GE project directory structure? Or by ‘scaffold’, do you mean the great_expectations init command?

When you run great_expectations init in the command line and complete the guided process, a number of things happen:

scaffolding of a Great Expectations project directory structure and creation of a great_expectations.yml config file
addition of a Datasource
creation of an example Expectation Suite and validation from above Datasource
creation of a Data Docs HTML site

It’s generally preferable to go through the CLI init process, but it is possible to achieve the above steps in a script by calling the same methods directly. If you give me a bit more info, I can put together an example script you can work off of.

RomanAguilar · May 18, 2020, 11:34pm

Sure! by ‘scaffold’ I meant great_expectations suite scaffold SUITE_NAME command

I have a big in-memory Pandas dataframe (It’s an intermediate step of a big Machine Learning pipeline) and I would like to use the scaffold command to automatically generate expectations.

I’ve tried using Pandas ‘to_csv()’ function and run the CLI scaffold command on that CSV, the problem is that over a 100 out-of-the-box “expect_column_values_to_be_unique” expectations are failing because of the following issue:

In the Pandas dataframe I have many float features with the value: “0.0” but Pandas is saving this features as “0” on the CSV, GE is picking that difference as failure. I really don’t know if GE should consider 0.0 = 0, I’ll leave that to you.

To avoid all this “parsing to file” problem I was hoping to call a Python function that could do the same as the CLI scaffold command.

Another Use-Case I have is that this is only 1 intermediate step in the data transformation pipeline, I would like to programmatically “scaffold” every intermediate dataframe so that GE can help us identify not only “what” changed but also on which step is the data being altered incorrectly.

roblim · May 20, 2020, 12:19am

I see what you mean. See the below code snippet - it’s very close to what you’d get in the Jupyter notebook after running great_expectations suite scaffold SUITE_NAME (the command basically generates a Jupyter notebook, which yields the scaffolded suites after you run all the cells). The major difference that’s relevant to your use case is passing in a pandas df directly in batch_kwargs, under a “dataset” key (vs. a filepath). Let me know if that’s what you were looking for or if you have any other questions.

from datetime import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
    ValidationResultIdentifier,
)
import pandas as pd

context = ge.data_context.DataContext(
context_root_dir='/Users/roblim/superconductive/test_ge_project_dir/great_expectations'  # You'll want to enter the
    # path to your ge project dir here
)

expectation_suite_name = "test_suite"  # You'll want to enter your own suite name here
suite = context.create_expectation_suite(
    expectation_suite_name, overwrite_existing=True
)

pandas_df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [1, 2, 3, 4, None]})

batch_kwargs = {
    "dataset": pandas_df,  # You can pass in your big Pandas dataframe directly here
    "datasource": "files_datasource",  # You'll want to enter the appropriate datasource name here
}
batch = context.get_batch(batch_kwargs, suite)

# The scaffold config is optional - if not provided, the profiler will just look at all columns and expectation types
included_columns = batch.get_table_columns()
scaffold_config = {
    # "included_columns": included_columns,
    # "excluded_columns": [],
    # "included_expectations": [],  # These are expectation types, for e.g. "expect_column_values_to_be_of_type"
    # "excluded_expectations": [],
}
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)

# This persists the expectation suite to the configured expectations store
context.save_expectation_suite(suite, expectation_suite_name)

# The section below is optional, if you'd like to view the scaffolded suite and sample validation results in Data Docs
# If you'd just like to see the expectation suite (and not sample validation results), you can just call
# context.build_data_docs(resource_identifiers=[expectation_suite_identifier])
# context.open_data_docs(expectation_suite_identifier)

# Let's make a simple sortable timestamp. Note this could come from your pipeline runner.
run_id = datetime.utcnow().strftime("%Y%m%dT%H%M%S.%fZ")

results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch], run_id=run_id)
expectation_suite_identifier = list(results["details"].keys())[0]
validation_result_identifier = ValidationResultIdentifier(
    expectation_suite_identifier=expectation_suite_identifier,
    batch_identifier=batch.batch_kwargs.to_id(),
    run_id=run_id
)
context.build_data_docs()
context.open_data_docs(validation_result_identifier)

rpedrotti · January 19, 2021, 12:45pm

Whats the actual purpose of entering a datasource name here?

batch_kwargs = {
“dataset”: pandas_df, # You can pass in your big Pandas dataframe directly here
“datasource”: “invoices”, # You’ll want to enter the appropriate datasource name here
}

Reading get_batch(), I see it is necessary, but can’t we go simply without defining in the yml file?

Topic		Replies	Views
Profiling a dataset / scaffolding an expectation suite without using the CLI (e.g. spark, databricks) Archive how-to	2	1156	October 23, 2020
How to create a new Expectation Suite using suite scaffold Archive how-to , help-wanted	2	1023	November 20, 2020
A super-simple alternative introduction to Great Expectations Archive	6	3821	March 27, 2020
How to implement my own custom profiler? Archive	2	835	September 2, 2020
How do I programmatically validate expectations? Archive	3	591	May 17, 2021

Is there a way to run the scaffold CLI command in a Python script like a function?

Related topics