Im validating big dataset in aws glue using pyspark dataframe. I have created properly data_source with .add_spark(), data_asset, batch definition, ExpectationSuite, validation_definition
When I’m reading the result in “SUMMARY” format i can see the results of validation. Eg. for one of my created expectation I can see:
“success”: false,
“expectation_config”: {
“type”: “expect_column_value_lengths_to_equal”,
“kwargs”: {
“batch_id”: “my_data_source-my_data_asset”,
“column”: “some_column”,
“value”: 9.0
},
“meta”: {},
“id”: “some_id_12323-sdf324-etc”
},
“result”: {
“element_count”: 64060,
“unexpected_count”: 63796,
“unexpected_percent”: 99,58788635654075,
“partial_unexpected_list”: [
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
],
“missing_count”: 0,
“missing_percent”: 0.0,
“unexpected_percent_total”: 99.58788635654075,
“unexpected_percent_nonmissing”: 99.58788635654075,
“partial_unexpected_counts”: [
{
“value”: “12345678”,
“count”: 20
}
]
},
“meta”: {},
“exception_info”: {
“raised_exception”: false,
“exception_traceback”: null,
“exception_message”: null,
},
I was wondering if there is a possibility to show the number of record (in the order that record was presented in dataframe) that didn’t pass validation?
Thanks, Kuba