Hello,
I’ve been trying for the past few weeks to implement GX Core on some infrastructure that accesses data in a non-standard way. I don’t have a SQL connection string to expose GX to, but I can run PySpark code inside a container that can access the data (hence the Spark Dataframe Datasource does work).
I have it working using a Spark Dataframe, but the problem is that so far performance has been less than ideal. Let’s say, as a proof of concept, I want to validate that every id
in a table is unique.
Without GX, I can use the following SQL query, which takes 48 minutes to run.
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT id) AS unique_ids,
CASE
WHEN COUNT(*) = COUNT(DISTINCT id) THEN 'All IDs are unique'
ELSE 'Duplicate IDs found'
END AS result
FROM table;
To contrast, in GX, I use this expectation:
gxe.ExpectColumnValuesToBeUnique(
column="id",
)
Even with result_format={"result_format": "BOOLEAN_ONLY"}
, the initial collect()
alone takes seven hours! The performance makes GX difficult to justify using for me.
I tried as a workaround to monkey-patch GX to use a custom datasource (discussed here) but these efforts have been unsuccessful.
I wanted to ask why the performance of the Spark DF source is as (comparatively) poor as it is. Is it just that GX is collecting a lot of additional metrics? If so, is there any way to turn that off? If there are other reasons, what can I do to improve things?
Thanks