How to specify expect_column_pair_values_to_be_in_set value_pairs_set input arg via json

peetee · February 14, 2024, 2:23pm

I’m running running Python 3.9.18, great_expectations, version 0.18.8

I’m trying to use https://greatexpectations.io/expectations/expect_column_pair_values_to_be_in_set

How can the value_pairs_set inputs be specified via a json expectations file?

The sample code of

[('apple','red'),('apple','green'),('apple','yellow'), ('banana','yellow')]

when I saved as json via json.dumps I get

[["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]

and plug into json as follows

{
  "expectation_type": "expect_column_pair_values_to_be_in_set",
  "kwargs": {
    "column_A": "mycolA",
    "column_B": "mycolB",
    "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
  }
},

However when I run the validation I get an error in ge_validations_store

    "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
    "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 128, in _pandas_map_condition_index\n    domain_records_df = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/data-eng-tools-py/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",

Any help much appeciated!

rachel.house · February 26, 2024, 9:45pm

Hi @peetee, thanks for reaching out! Welcome to the GX community.

I’m not seeing a problem with your Expectation configuration - the error suggests to me that the issue might be related to the dataframe that you’re trying to validate. Could you provide your complete code (including creation of the sample dataframe) that generates this error?

peetee · February 29, 2024, 7:13pm

peetee:

{
  "expectation_type": "expect_column_pair_values_to_be_in_set",
  "kwargs": {
    "column_A": "mycolA",
    "column_B": "mycolB",
    "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
  }
}

hi @rachel.house thanks for replying.

Strange! I stripped it all back to a bare bare min code setup with just that and one other expectation (a expect_table_row_count_to_be_between) and it it’s now working!?

I’ll try again in the full code set-up.

A minor question, if using expect_column_pair_values_to_be_in_set how do I configure it to output as a metric because e.g.

name: store_metrics
action:
class_name: StoreMetricsAction
target_store_name: metric_store # This should match the name of the store configured above
requested_metrics:
my_gx_suite:
- expect_table_row_count_to_be_between.success
- column:
mycolA:
- expect_column_values_to_not_be_null.success

does not seem to work?

rachel.house · March 1, 2024, 9:44pm

Hi @peetee , you should be able to capture the result of expect_column_pair_values_to_be_in_set as a metric if you configure for your Checkpoint action_list as shown below:

  - name: store_metrics
    action:
      class_name: StoreMetricsAction
      target_store_name: <name-of-metric-store>
      requested_metrics:
        <name-of-expectation-suite>:
          - expect_table_row_count_to_be_between.success
          - expect_column_pair_values_to_be_in_set.success

peetee · March 5, 2024, 1:00pm

hi @rachel.house

Note I am pandas installewd as

Name Version Build Channel

pandas 2.1.4 py39h5d65943_0 conda-forge

Seeing some more strange behaviour, if I run this expectation on the following data

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,fail

It runs ok and reports the invalid rows, but If the rows with both cols missing appears before last row e.g.

id,mycolA,mycolB,valid
1,apple,red,pass
2,apple,green,pass
3,apple,yellow,pass
4,peach,peach,fail
5,banana,yellow,pass
6,banana,black,fail
7,fail
8,melon,melon,fail

an exception is raised

        "exception_info": {
          "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 201, in _pandas_map_condition_query\n    domain_values_df_filtered = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",
          "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
          "raised_exception": true
        }

Not sure if this a bug caused by the version of Pandas or caused by how this GX expectation is implemented.

rachel.house · March 5, 2024, 4:40pm

Hey @peetee, thanks for the additional details - I’m able to reproduce the same error using your sample data.

I’m not sure if this is a Pandas or GX bug, we’ll need Engineering to take a look. At your convenience, could you create a GX GitHub issue for this behavior?

As a workaround, I’d recommend using .fillna() to replace the NaNs in your dataframe with a suitable non-null value. I used the empty string (.fillna("")) and the Expectation ran without error on the sample data.

peetee · March 5, 2024, 6:27pm

Thanks, note how do I use .fillna() via an expectation or checkpoint file, can’t see any docs how to do this?

Note issued raised expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing · Issue #9577 · great-expectations/great_expectations · GitHub

rachel.house · March 6, 2024, 9:40pm

@peetee Ah, apologies that wasn’t clear. I meant to use .fillna() on your source Pandas dataframe before running the Expectation, so that the error isn’t triggered by NaN values in your problem row(s). You wouldn’t use .fillna() on any of the GX code. For example, if we took this approach on the sample data above, you’d use:

data = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "peach" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
    { "idx" : 8, "fruit" : "melon", "color" : "melon" },
]

df = pd.DataFrame(data=data).fillna("")

Topic		Replies	Views
Expect_column_pair_values_a_to_be_greater_than_b.py not working Feedback help-wanted	0	574	July 19, 2021
Support - ExpectColumn catching allowed value_sets GX Core Support types-of-expectation	1	19	June 10, 2025
Custom expectations for ExpectColumnPairValuesToBeInSet GX Core Support help-wanted	1	54	March 31, 2025
Error in expect_column_values_to_be_in_type_list for python 3.11 GX Core Support help-wanted	1	268	October 23, 2023
Issue while running Great expectation validations on Big Query External Tables GX Core Support	0	123	May 16, 2024

How to specify expect_column_pair_values_to_be_in_set value_pairs_set input arg via json

Name Version Build Channel

Related topics