How to specify expect_column_pair_values_to_be_in_set value_pairs_set input arg via json

I’m running running Python 3.9.18, great_expectations, version 0.18.8

I’m trying to use https://greatexpectations.io/expectations/expect_column_pair_values_to_be_in_set

How can the value_pairs_set inputs be specified via a json expectations file?

The sample code of

[('apple','red'),('apple','green'),('apple','yellow'), ('banana','yellow')]

when I saved as json via json.dumps I get

[["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]

and plug into json as follows

{
  "expectation_type": "expect_column_pair_values_to_be_in_set",
  "kwargs": {
    "column_A": "mycolA",
    "column_B": "mycolB",
    "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
  }
},

However when I run the validation I get an error in ge_validations_store

    "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
    "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 128, in _pandas_map_condition_index\n    domain_records_df = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",

Any help much appeciated!

Hi @peetee, thanks for reaching out! Welcome to the GX community.

I’m not seeing a problem with your Expectation configuration - the error suggests to me that the issue might be related to the dataframe that you’re trying to validate. Could you provide your complete code (including creation of the sample dataframe) that generates this error?

1 Like

hi @rachel.house thanks for replying.

Strange! I stripped it all back to a bare bare min code setup with just that and one other expectation (a expect_table_row_count_to_be_between) and it it’s now working!?

I’ll try again in the full code set-up.

A minor question, if using expect_column_pair_values_to_be_in_set how do I configure it to output as a metric because e.g.

  • name: store_metrics
    action:
    class_name: StoreMetricsAction
    target_store_name: metric_store # This should match the name of the store configured above
    requested_metrics:
    my_gx_suite:
    - expect_table_row_count_to_be_between.success
    - column:
    mycolA:
    - expect_column_values_to_not_be_null.success

does not seem to work?

Hi @peetee , you should be able to capture the result of expect_column_pair_values_to_be_in_set as a metric if you configure for your Checkpoint action_list as shown below:

  - name: store_metrics
    action:
      class_name: StoreMetricsAction
      target_store_name: <name-of-metric-store>
      requested_metrics:
        <name-of-expectation-suite>:
          - expect_table_row_count_to_be_between.success
          - expect_column_pair_values_to_be_in_set.success

hi @rachel.house

Note I am pandas installewd as

Name Version Build Channel

pandas 2.1.4 py39h5d65943_0 conda-forge

Seeing some more strange behaviour, if I run this expectation on the following data

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,fail

It runs ok and reports the invalid rows, but If the rows with both cols missing appears before last row e.g.

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,fail
8,melon,melon,fail

an exception is raised

        "exception_info": {
          "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 201, in _pandas_map_condition_query\n    domain_values_df_filtered = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",
          "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
          "raised_exception": true
        }

Not sure if this a bug caused by the version of Pandas or caused by how this GX expectation is implemented.

Hey @peetee, thanks for the additional details - I’m able to reproduce the same error using your sample data.

I’m not sure if this is a Pandas or GX bug, we’ll need Engineering to take a look. At your convenience, could you create a GX GitHub issue for this behavior?

As a workaround, I’d recommend using .fillna() to replace the NaNs in your dataframe with a suitable non-null value. I used the empty string (.fillna("")) and the Expectation ran without error on the sample data.

Thanks, note how do I use .fillna() via an expectation or checkpoint file, can’t see any docs how to do this?

Note issued raised expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing · Issue #9577 · great-expectations/great_expectations · GitHub

@peetee Ah, apologies that wasn’t clear. I meant to use .fillna() on your source Pandas dataframe before running the Expectation, so that the error isn’t triggered by NaN values in your problem row(s). You wouldn’t use .fillna() on any of the GX code. For example, if we took this approach on the sample data above, you’d use:

data = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "peach" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
    { "idx" : 8, "fruit" : "melon", "color" : "melon" },
]

df = pd.DataFrame(data=data).fillna("")