When adding a expect_column_kl_divergence_to_be_less_than on a low-cardinality column, we need to pass a “partition_object” argument that holds the values of the column and their weights.
How can GE help get this object from a sample batch of data?
When adding a expect_column_kl_divergence_to_be_less_than on a low-cardinality column, we need to pass a “partition_object” argument that holds the values of the column and their weights.
How can GE help get this object from a sample batch of data?
This answer assumes that you:
First, call expect_column_kl_divergence_to_be_less_than without specifying any partition object or threshold. GE will interpret this as you having no constraints and will returned the partition that it observed in the batch:
profiling_result = batch.expect_column_kl_divergence_to_be_less_than(COLUMN_NAME,
bucketize_data=False,
partition_object=None,
threshold=None,
result_format=‘COMPLETE’)
observed_partition = profiling_result.result[“details”][“observed_partition”]
The step above essentially profiled the column.
Now you can use the observed partition object to create a real expectation (don’t forget to specify a threshold as well):
batch.expect_column_kl_divergence_to_be_less_than(COLUMN_NAME,
bucketize_data=False,
partition_object=observed_partition,
threshold=0.4,
)