Is it possible to see the number of record in pyspark dataframe that didn't pass the validation?

woodbine · October 18, 2024, 10:53am

Im validating big dataset in aws glue using pyspark dataframe. I have created properly data_source with .add_spark(), data_asset, batch definition, ExpectationSuite, validation_definition
When I’m reading the result in “SUMMARY” format i can see the results of validation. Eg. for one of my created expectation I can see:
“success”: false,
“expectation_config”: {
“type”: “expect_column_value_lengths_to_equal”,
“kwargs”: {
“batch_id”: “my_data_source-my_data_asset”,
“column”: “some_column”,
“value”: 9.0
},
“meta”: {},
“id”: “some_id_12323-sdf324-etc”
},
“result”: {
“element_count”: 64060,
“unexpected_count”: 63796,
“unexpected_percent”: 99,58788635654075,
“partial_unexpected_list”: [
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
“12345678”,
],
“missing_count”: 0,
“missing_percent”: 0.0,
“unexpected_percent_total”: 99.58788635654075,
“unexpected_percent_nonmissing”: 99.58788635654075,
“partial_unexpected_counts”: [
{
“value”: “12345678”,
“count”: 20
}
]
},
“meta”: {},
“exception_info”: {
“raised_exception”: false,
“exception_traceback”: null,
“exception_message”: null,
},

I was wondering if there is a possibility to show the number of record (in the order that record was presented in dataframe) that didn’t pass validation?
Thanks, Kuba

adeola · October 23, 2024, 3:05pm

hey there @woodbine welcome to our community

can you try adding unexpected_index_column_names to your result format? If your DataFrame has a unique identifier column (like an ID or record number), specify that column in unexpected_index_column_names. This will include the failing record indices in the validation output.

result_format = {
   "result_format": "SUMMARY",
   "unexpected_index_column_names": ["id_column"] 
}

replace “id_column” with your unique identifier column

Topic		Replies	Views
DQ result to indicate the row number of the invalid records? Archive	2	539	May 24, 2021
How do I retrieve rows that failed validation from an Athena or Spark data source? Archive	0	522	July 30, 2021
How to validate Spark DataFrames in 0.13 Archive	3	1260	July 19, 2021
Get sample rows that fail validation Archive help-wanted	0	779	August 11, 2021
Great expectation is only returning 20 rows of failed records GX Core Support how-to , help-wanted , types-of-expectation , expectation-request , sql	7	285	March 14, 2025

Is it possible to see the number of record in pyspark dataframe that didn't pass the validation?

Related topics