I haven’t found a doc or example on how to call CLI commands as functions in Python, I’m specially interested in the ‘scaffold’ command.
Thanks for the great question! Can you tell me a little bit more about the intended use case? When you say you’re especially interested in the ‘scaffold’ command, do you mean just the scaffolding of a GE project directory structure? Or by ‘scaffold’, do you mean the great_expectations init
command?
When you run great_expectations init
in the command line and complete the guided process, a number of things happen:
- scaffolding of a Great Expectations project directory structure and creation of a great_expectations.yml config file
- addition of a Datasource
- creation of an example Expectation Suite and validation from above Datasource
- creation of a Data Docs HTML site
It’s generally preferable to go through the CLI init process, but it is possible to achieve the above steps in a script by calling the same methods directly. If you give me a bit more info, I can put together an example script you can work off of.
Sure! by ‘scaffold’ I meant great_expectations suite scaffold SUITE_NAME
command
I have a big in-memory Pandas dataframe (It’s an intermediate step of a big Machine Learning pipeline) and I would like to use the scaffold command to automatically generate expectations.
I’ve tried using Pandas ‘to_csv()’ function and run the CLI scaffold command on that CSV, the problem is that over a 100 out-of-the-box “expect_column_values_to_be_unique” expectations are failing because of the following issue:
In the Pandas dataframe I have many float features with the value: “0.0” but Pandas is saving this features as “0” on the CSV, GE is picking that difference as failure. I really don’t know if GE should consider 0.0 = 0, I’ll leave that to you.
To avoid all this “parsing to file” problem I was hoping to call a Python function that could do the same as the CLI scaffold command.
Another Use-Case I have is that this is only 1 intermediate step in the data transformation pipeline, I would like to programmatically “scaffold” every intermediate dataframe so that GE can help us identify not only “what” changed but also on which step is the data being altered incorrectly.
I see what you mean. See the below code snippet - it’s very close to what you’d get in the Jupyter notebook after running great_expectations suite scaffold SUITE_NAME
(the command basically generates a Jupyter notebook, which yields the scaffolded suites after you run all the cells). The major difference that’s relevant to your use case is passing in a pandas df directly in batch_kwargs, under a “dataset” key (vs. a filepath). Let me know if that’s what you were looking for or if you have any other questions.
from datetime import datetime
import great_expectations as ge
from great_expectations.profile import BasicSuiteBuilderProfiler
from great_expectations.data_context.types.resource_identifiers import (
ValidationResultIdentifier,
)
import pandas as pd
context = ge.data_context.DataContext(
context_root_dir='/Users/roblim/superconductive/test_ge_project_dir/great_expectations' # You'll want to enter the
# path to your ge project dir here
)
expectation_suite_name = "test_suite" # You'll want to enter your own suite name here
suite = context.create_expectation_suite(
expectation_suite_name, overwrite_existing=True
)
pandas_df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [1, 2, 3, 4, None]})
batch_kwargs = {
"dataset": pandas_df, # You can pass in your big Pandas dataframe directly here
"datasource": "files_datasource", # You'll want to enter the appropriate datasource name here
}
batch = context.get_batch(batch_kwargs, suite)
# The scaffold config is optional - if not provided, the profiler will just look at all columns and expectation types
included_columns = batch.get_table_columns()
scaffold_config = {
# "included_columns": included_columns,
# "excluded_columns": [],
# "included_expectations": [], # These are expectation types, for e.g. "expect_column_values_to_be_of_type"
# "excluded_expectations": [],
}
suite, evr = BasicSuiteBuilderProfiler().profile(batch, profiler_configuration=scaffold_config)
# This persists the expectation suite to the configured expectations store
context.save_expectation_suite(suite, expectation_suite_name)
# The section below is optional, if you'd like to view the scaffolded suite and sample validation results in Data Docs
# If you'd just like to see the expectation suite (and not sample validation results), you can just call
# context.build_data_docs(resource_identifiers=[expectation_suite_identifier])
# context.open_data_docs(expectation_suite_identifier)
# Let's make a simple sortable timestamp. Note this could come from your pipeline runner.
run_id = datetime.utcnow().strftime("%Y%m%dT%H%M%S.%fZ")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch], run_id=run_id)
expectation_suite_identifier = list(results["details"].keys())[0]
validation_result_identifier = ValidationResultIdentifier(
expectation_suite_identifier=expectation_suite_identifier,
batch_identifier=batch.batch_kwargs.to_id(),
run_id=run_id
)
context.build_data_docs()
context.open_data_docs(validation_result_identifier)
Whats the actual purpose of entering a datasource name here?
batch_kwargs = {
“dataset”: pandas_df, # You can pass in your big Pandas dataframe directly here
“datasource”: “invoices”, # You’ll want to enter the appropriate datasource name here
}
Reading get_batch(), I see it is necessary, but can’t we go simply without defining in the yml file?