Check nulls in subset of columns

andi · March 2, 2020, 2:36pm

Hi all, I wonder if I can combine batch.expect_column_values_to_not_be_null() with a subset of columns and a threshold like in pandas.

abegong · March 2, 2020, 2:39pm

Are you looking for something like this?

batch.expect_column_values_to_not_be_null("my_1st_column", mostly=.9)
batch.expect_column_values_to_not_be_null("my_2nd_column", mostly=.8)
batch.expect_column_values_to_not_be_null("my_3rd_column")

This would assert that my_1st_column is null no more than 10% of the time, my_2nd_column is null no more than 20% of the time, and my_3rd_column is never null.

andi · March 2, 2020, 2:42pm

Let’s say I have a table with route requests, columns include fromStation or fromAddress or fromGPS. Each request starts with either a station, an address or a GPS point, the other two colums are null.

So for a daily validation I could estimate the percentage I expect in each day for each column, but that my vary. I want to ensure that in each record one col of the subset [fromStation, fromAddress, fromGPS] is not null and the others are null.

abegong · March 2, 2020, 3:02pm

I see—thanks for the clarification.

At the moment, we don’t have a single Expectation for this kind of check.

One path forward would be to define a custom Expectation. For this one, you’d want to use the multicolumn_map_expectation decorator.

If you do this and like what you end up with, we’d be happy to work with you to make it a PR into the main library.

An alternative would be to use a pattern we call a “check_df”: create an intermediate dataframe (the check_df), then apply a simple Expectation to the check_df. For example, you could sum the number of nulls in the three columns into a single column called “source_null_count” and then check_df.expect_column_values_to_be_in_set("source_null_count", [1]).

andi · March 4, 2020, 4:59pm

Thank you very much for the suggestions! I’ll check my schedule tomorrow and see what makes the most sense .

andi · March 18, 2020, 3:07pm

Sounds doable, check the dev guidelines. do you prefer fork and pr or dev branch on the main repo?

Edit: https://github.com/great-expectations/great_expectations/blob/develop/CONTRIBUTING.md

Topic		Replies	Views
Validate Null values in Multiple columns in single expectation Archive	1	1225	March 22, 2021
Expect_column_values_to_not_be_null column not showing up when column contains some null values GX Core Support help-wanted	2	286	January 24, 2024
How to Create Custom Expectation of type Multicolumn Map Expectation GX Core Support how-to , help-wanted , databricks	1	97	January 30, 2025
How to run an expectation only if the column exists Archive help-wanted	3	1367	September 30, 2021
Dynamic Columns to apply expectations GX Core Support	1	253	February 6, 2024

Check nulls in subset of columns

Related topics