Is it possible to see the number of record in pyspark dataframe that didn't pass the validation?

Im validating big dataset in aws glue using pyspark dataframe. I have created properly data_source with .add_spark(), data_asset, batch definition, ExpectationSuite, validation_definition
When I’m reading the result in “SUMMARY” format i can see the results of validation. Eg. for one of my created expectation I can see:
“success”: false,
“expectation_config”: {
“type”: “expect_column_value_lengths_to_equal”,
“kwargs”: {
“batch_id”: “my_data_source-my_data_asset”,
“column”: “some_column”,
“value”: 9.0
},
“meta”: {},
“id”: “some_id_12323-sdf324-etc”
},
“result”: {
“element_count”: 64060,
“unexpected_count”: 63796,
“unexpected_percent”: 99,58788635654075,
“partial_unexpected_list”: [
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
],
“missing_count”: 0,
“missing_percent”: 0.0,
“unexpected_percent_total”: 99.58788635654075,
“unexpected_percent_nonmissing”: 99.58788635654075,
“partial_unexpected_counts”: [
{
“value”: “12345678”,
“count”: 20
}
]
},
“meta”: {},
“exception_info”: {
“raised_exception”: false,
“exception_traceback”: null,
“exception_message”: null,
},

I was wondering if there is a possibility to show the number of record (in the order that record was presented in dataframe) that didn’t pass validation?
Thanks, Kuba

1 Like

hey there @woodbine welcome to our community

can you try adding unexpected_index_column_names to your result format? If your DataFrame has a unique identifier column (like an ID or record number), specify that column in unexpected_index_column_names. This will include the failing record indices in the validation output.

result_format = {
   "result_format": "SUMMARY",
   "unexpected_index_column_names": ["id_column"] 
}

replace “id_column” with your unique identifier column

2 Likes