Hi everyone,
I’ve been experimenting with Great Expectations on Spark, and I’ve noticed some unexpected behavior when switching from a “BASIC” validation to a “SUMMARY” validation that returns unexpected rows.
I’m working with a worst-case scenario: validating 1000 expectations (all intentionally forced to fail) on a single variable from a Hive table with 86,400 rows (one per second in a day).
What I’m seeing
BASIC validation
batch.validate(
suite,
result_format={
"result_format": "BASIC",
"partial_unexpected_count": 0
}
)
In this mode, SparkHistory shows around 8 collect() calls. Moreless this is what I’d expect, since Spark is lazy and you usually want few actions at the end.
SUMMARY + unexpected rows
batch.validate(
suite,
result_format={
"result_format": "SUMMARY",
"unexpected_index_column_names": [timestamp_name],
"partial_unexpected_count": 86400
}
)
However, in this case over 3000 collect() calls are triggered. On a large dataset, the driver quickly runs out of Java heap memory, resulting in OOM errors. I’d expect an increase in collect calls due to the additional information required by this result format, but I feel this should scale better.
Questions
- Is there anyway to group those collects into a single, or at least fewer, Spark Jobs?
- Are there any plans in the SparkExecutionEngine to implement this in future GE releases?
Perhaps this “brute force” aproach is not the intended usage pattern for Great Expectations, but I believe that this would make a great improvement in the SparkExecutionEngine’ scalability.
I’d greatly appreciate any insights, pointers to existing issues/PRs, or code snippets you might have. Thanks in advance for your help!
Álvaro
Using Great Expectations v1.5.4 with Spark v3.2.3