Performance of Spark Dataframe Execution Engine

lcoogan · July 1, 2025, 8:43am

Hello,

I’ve been trying for the past few weeks to implement GX Core on some infrastructure that accesses data in a non-standard way. I don’t have a SQL connection string to expose GX to, but I can run PySpark code inside a container that can access the data (hence the Spark Dataframe Datasource does work).

I have it working using a Spark Dataframe, but the problem is that so far performance has been less than ideal. Let’s say, as a proof of concept, I want to validate that every id in a table is unique.

Without GX, I can use the following SQL query, which takes 48 minutes to run.

SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT id) AS unique_ids,
    CASE
        WHEN COUNT(*) = COUNT(DISTINCT id) THEN 'All IDs are unique'
        ELSE 'Duplicate IDs found'
    END AS result
FROM table;

To contrast, in GX, I use this expectation:

gxe.ExpectColumnValuesToBeUnique(
    column="id",
)

Even with result_format={"result_format": "BOOLEAN_ONLY"}, the initial collect() alone takes seven hours! The performance makes GX difficult to justify using for me.

I tried as a workaround to monkey-patch GX to use a custom datasource (discussed here) but these efforts have been unsuccessful.

I wanted to ask why the performance of the Spark DF source is as (comparatively) poor as it is. Is it just that GX is collecting a lot of additional metrics? If so, is there any way to turn that off? If there are other reasons, what can I do to improve things?

Thanks

lcoogan · July 7, 2025, 1:27pm

Update: I have been able to improve performance substantially by setting persist=False when creating the data source.

Topic		Replies	Views
No support for Spark DF in Result Format COMPLETE Mode GX Core Support	5	504	February 9, 2024
Expect_compound_columns_to_be_unique got an unexpected keyword argument 'condition_parser' GX Core Support	1	316	August 16, 2023
Unable to run ‘ExpectColumnPairValuesToBeEqual’ with spark on Databricks GX Core Support databricks , types-of-expectation	9	155	July 28, 2025
Great Expectations now supports execution in Spark! \| Great Expectations Archive	0	477	May 22, 2020
Databricks and dataframes using serverless compute GX Core Support	0	286	February 12, 2025

Performance of Spark Dataframe Execution Engine

Related topics