When validating a Spark DataFrame with multiple expectations, the Great Expectations validation process appears to trigger multiple full scans of the source data. This results in an exponential increase in the observed “Input Size” in the Spark UI, leading to severe memory pressure, excessive garbage collection, and ultimately, job failures—even on moderately sized datasets.
-
Load a ~500MB Parquet file into a Spark DataFrame.
-
Define a Great Expectations suite with just two expectations:
-
expect_column_values_to_be_between -
expect_column_pair_values_a_to_be_greater_than_b
-
-
Run validation using a validator created from a
RuntimeBatchRequest. -
Observe the Spark UI for the resulting job.
The input data size should be in the same order of magnitude as the original dataset (~500MB - ~1GB), indicating efficient computation.
The Spark UI reports an “Input Size” of over 7.1 GB (a 14x increase), indicating the data was scanned multiple times.
The current validation engine seems to compile each expectation into a separate Spark query plan without a mechanism to fuse operations or cache the intermediate dataset effectively. This forces Spark to read the entire source dataset from disk for each expectation, as observed in the query plan.