Hi all, I wonder if I can combine batch.expect_column_values_to_not_be_null()
with a subset of columns and a threshold like in pandas.
Are you looking for something like this?
batch.expect_column_values_to_not_be_null("my_1st_column", mostly=.9)
batch.expect_column_values_to_not_be_null("my_2nd_column", mostly=.8)
batch.expect_column_values_to_not_be_null("my_3rd_column")
This would assert that my_1st_column
is null no more than 10% of the time, my_2nd_column
is null no more than 20% of the time, and my_3rd_column
is never null.
Let’s say I have a table with route requests, columns include fromStation
or fromAddress
or fromGPS
. Each request starts with either a station, an address or a GPS point, the other two colums are null.
So for a daily validation I could estimate the percentage I expect in each day for each column, but that my vary. I want to ensure that in each record one col of the subset [fromStation, fromAddress, fromGPS]
is not null and the others are null.
I see—thanks for the clarification.
At the moment, we don’t have a single Expectation for this kind of check.
One path forward would be to define a custom Expectation. For this one, you’d want to use the multicolumn_map_expectation decorator.
If you do this and like what you end up with, we’d be happy to work with you to make it a PR into the main library.
An alternative would be to use a pattern we call a “check_df”: create an intermediate dataframe (the check_df), then apply a simple Expectation to the check_df. For example, you could sum the number of nulls in the three columns into a single column called “source_null_count” and then check_df.expect_column_values_to_be_in_set("source_null_count", [1])
.
Thank you very much for the suggestions! I’ll check my schedule tomorrow and see what makes the most sense .
Sounds doable, check the dev guidelines. do you prefer fork and pr or dev branch on the main repo?
Edit: https://github.com/great-expectations/great_expectations/blob/develop/CONTRIBUTING.md