Validate different dataframes with respective expectation suites using checkpoint

I am using Databricks with great expectations 1.0.1 version.
I want with a single checkpoint to validate two different data assets with their respective expectations suites. I have the following:

batch_parameters = {"dataframe": df}
batch_parameters2 = {"dataframe": df2}

data_source = context.data_sources.add_spark(name="source1")
data_source2 = context.data_sources.add_spark(name="source2")

data_asset = data_source.add_dataframe_asset(name="asset1")
data_asset2 = data_source2.add_dataframe_asset(name="asset2")

batch_definition_name = "my_batch_definition"
batch_definition_name2 = "my_batch_definition2"

batch_definition = data_asset.add_batch_definition_whole_dataframe(
    name=batch_definition_name
)
batch_definition2 = data_asset2.add_batch_definition_whole_dataframe(
    name=batch_definition_name2
)

batch_definition.build_batch_request(batch_parameters)
batch_definition2.build_batch_request(batch_parameters2)

definition_name = "my_validation_definition"
validation_definition = ge.ValidationDefinition(
    data=batch_definition, suite=suite, name=definition_name
)

definition_name2 = "my_validation_definition2"
validation_definition2 = ge.ValidationDefinition(
    data=batch_definition2, suite=suite2, name=definition_name2
)

context.validation_definitions.add(validation_definition)
context.validation_definitions.add(validation_definition2)

validation_definitions = [
    validation_definition,
    validation_definition2
]

action_list = [
]

checkpoint_name = "my_checkpoint"
checkpoint = ge.Checkpoint(
    name=checkpoint_name,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={"result_format": "BASIC", "unexpected_index_column_names": ["hash_col"]},
)

validation_results = checkpoint.run()

Durin checkpoint.run() I am getting the following error:
BuildBatchRequestError: Bad input to build_batch_request: options must contain exactly 1 key, ā€˜dataframeā€™.

I get the same error, did you find out what was wrong?

Try passing your dataframe to the checkpoint.run() with:

validation_results = checkpoint.run(batch_parameters=batch_parameters)

Do you get the error when using this?

When using dataframes as the input data, the checkpoint run method always requires you to pass in the dataframe that should be validated as batch parameters. This dataframe is then passed on as the batch that is validated. Unfortunately, the Checkpoint canā€™t save the dataframe or a way to download the dataframe.
My guess is that it also canā€™t validate both datasets, it will just pass the same dataframe to both validation definitions.

If possible, consider using Databricks SQL instead. That allows you to perform the exact workflow you have here.

I wrote a sample code for how to do that here: GX 1.0 and Databricks - #3 by ToivoMattila
Also GX documentation for that here: Connect to SQL data | Great Expectations

1 Like

If I run either checkpoint.run() or validation_definition.run() without any parameter I get this error:
BuildBatchRequestError: Bad input to build_batch_request: options must contain exactly 1 key, 'dataframe'.

If I run them with this parameter batch_parameters = {ā€œdataframeā€: df}, I get this error: TypeError: ValidationDefinition.run() takes 1 positional argument but 2 were given so I wonder what is the correct format ?

I was unable to replicate this.
Could you post a sample code that replicates this issue?

Hereā€™s the code that I used which ran without errors for me:

import pandas as pd

import great_expectations as gx

df = pd.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

context = gx.get_context()

data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="passenger_count", min_value=1, max_value=6
)

validation_result = batch.validate(expectation)

suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(name="expectations")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="passenger_count", min_value=1, max_value=6
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(column="fare_amount", min_value=0)
)

validation_definition = context.validation_definitions.add(
    gx.core.validation_definition.ValidationDefinition(
        name="validation definition",
        data=batch_definition,
        suite=suite,
    )
)

validation_definition.run(
    batch_parameters={"dataframe": df}
)

Do you get an error when running this code?