SUMMARY validation in SparkExecutionEngine triggers excessive collect() calls

Alvaro · July 29, 2025, 10:43am

Hi everyone,

I’ve been experimenting with Great Expectations on Spark, and I’ve noticed some unexpected behavior when switching from a “BASIC” validation to a “SUMMARY” validation that returns unexpected rows.

I’m working with a worst-case scenario: validating 1000 expectations (all intentionally forced to fail) on a single variable from a Hive table with 86,400 rows (one per second in a day).

What I’m seeing

BASIC validation

	batch.validate(
		suite,
		result_format={
		    "result_format": "BASIC",
		    "partial_unexpected_count": 0
		}
	)

In this mode, SparkHistory shows around 8 collect() calls. Moreless this is what I’d expect, since Spark is lazy and you usually want few actions at the end.

SUMMARY + unexpected rows

batch.validate(
	suite,
	result_format={
	    "result_format": "SUMMARY",
	    "unexpected_index_column_names": [timestamp_name],
	    "partial_unexpected_count": 86400
	}
)

However, in this case over 3000 collect() calls are triggered. On a large dataset, the driver quickly runs out of Java heap memory, resulting in OOM errors. I’d expect an increase in collect calls due to the additional information required by this result format, but I feel this should scale better.

Questions

Is there anyway to group those collects into a single, or at least fewer, Spark Jobs?
Are there any plans in the SparkExecutionEngine to implement this in future GE releases?

Perhaps this “brute force” aproach is not the intended usage pattern for Great Expectations, but I believe that this would make a great improvement in the SparkExecutionEngine’ scalability.

I’d greatly appreciate any insights, pointers to existing issues/PRs, or code snippets you might have. Thanks in advance for your help!

Álvaro

Using Great Expectations v1.5.4 with Spark v3.2.3

Topic		Replies	Views
Multiple Full Data Scans When Validating a DataFrame with Multiple Expectations GX Core Support how-to	0	29	September 1, 2025
Is it possible to see the number of record in pyspark dataframe that didn't pass the validation? GX Core Support how-to , help-wanted	1	88	October 23, 2024
Great Expectations now supports execution in Spark! \| Great Expectations Archive	0	480	May 22, 2020
Abnormal behavior on expectation expect_column_pair_values_a_to_be_greater_than_b GX Core Support help-wanted	1	52	March 31, 2025
Validations need to be ran twice to store and create data docs Archive how-to , help-wanted	1	492	March 8, 2021

SUMMARY validation in SparkExecutionEngine triggers excessive collect() calls

What I’m seeing

BASIC validation

SUMMARY + unexpected rows

Questions

Related topics