Multiple Full Data Scans When Validating a DataFrame with Multiple Expectations

RohanSingh · September 1, 2025, 6:14am

When validating a Spark DataFrame with multiple expectations, the Great Expectations validation process appears to trigger multiple full scans of the source data. This results in an exponential increase in the observed “Input Size” in the Spark UI, leading to severe memory pressure, excessive garbage collection, and ultimately, job failures—even on moderately sized datasets.

Load a ~500MB Parquet file into a Spark DataFrame.
Define a Great Expectations suite with just two expectations:
- expect_column_values_to_be_between
- expect_column_pair_values_a_to_be_greater_than_b
Run validation using a validator created from a RuntimeBatchRequest.
Observe the Spark UI for the resulting job.

The input data size should be in the same order of magnitude as the original dataset (~500MB - ~1GB), indicating efficient computation.

The Spark UI reports an “Input Size” of over 7.1 GB (a 14x increase), indicating the data was scanned multiple times.

The current validation engine seems to compile each expectation into a separate Spark query plan without a mechanism to fuse operations or cache the intermediate dataset effectively. This forces Spark to read the entire source dataset from disk for each expectation, as observed in the query plan.

Jyoti_Thakkar · October 27, 2025, 8:41am

we are facing similar issue. Any idea on how to fix it ?

Topic		Replies	Views
SUMMARY validation in SparkExecutionEngine triggers excessive collect() calls GX Core Support help-wanted	0	76	July 29, 2025
How to validate Spark DataFrames in 0.13 Archive	3	1318	July 19, 2021
Validations need to be ran twice to store and create data docs Archive how-to , help-wanted	1	546	March 8, 2021
How do I programmatically validate expectations? Archive	3	628	May 17, 2021
Validate different dataframes with respective expectation suites using checkpoint GX Core Support how-to , databricks	4	404	October 10, 2024

Multiple Full Data Scans When Validating a DataFrame with Multiple Expectations

Related topics