Batch Requests converted to non-optimized queries in BigQuery (GoogleSQL)

lmandruzzato · August 7, 2023, 10:48am

Good morning GX Support,

I mapped a BigQuery dataset to a GX Datasource and then I mapped a BigQuery table to a GX Data Asset. Afterward, I added a splitter (yearly, monthly, and daily) to the Data Asset so that I can control what data to fetch via GX Batch Requests based on the interval of the Airflow DAG I’m running.

The problem is that the final request inferred by GX and executed on BigQuery is not optimized to leverage pruning. In a nutshell, I added the splitter by defining as column_name the partition column of the table I’m working with. By doing so, I expect to get a final query that can leverage pruning and load the partition I need only (not the entire table every time) in order to not waste resources. However, the way GX translates the Batch Request to the GoogleSQL dialect is something like this:

CREATE OR REPLACE TABLE `<GX TEMP ID>`
	OPTIONS(
		expiration_timestamp=TIMESTAMP_ADD(
		CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
	)
	AS SELECT * 
FROM `<NAME OF THE TABLE>` 
WHERE EXTRACT(year FROM `<PARTITION COLUMN>`) = <YEAR> AND EXTRACT(month FROM `<PARTITION COLUMN>`) = <MONTH> AND EXTRACT(day FROM `<PARTITION COLUMN>`) = <DAY>

instead of something like this:

CREATE OR REPLACE TABLE `<GX TEMP ID>`
	OPTIONS(
		expiration_timestamp=TIMESTAMP_ADD(
		CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
	)
	AS SELECT * 
FROM `<NAME OF THE TABLE>` 
WHERE TIMESTAMP_TRUNC(<PARTITION COLUMN>, DAY) = TIMESTAMP("<YEAR>-<MONTH>-<DAY>")

In my specific case, the first query processes 33GB while the second would process 130MB.

Is there any chance this is something you’re working on? In case you aren’t, is there another way to address this problem? I want to add some quality checks within my ETL pipeline on the new batch I’m processing without wasting resources (especially considering I’m working with very big tables sometimes).

Thanks in advance.

austin_gx · August 8, 2023, 4:39pm

Hey @lmandruzzato! Thanks for reaching out. Would you be willing to open a GitHub Issue detailing this functionality you’re looking for? That will help us to better review the behavior here.

lmandruzzato · August 9, 2023, 8:39am

To keep track of the issue and an eventual solution, I leave here the GitHub Issue link: Optimize BigQuery Batch Requests to fully leverage pruning · Issue #8518 · great-expectations/great_expectations · GitHub.

Topic		Replies	Views
Memory Management for GX GX Core Support	0	53	January 7, 2025
Batch definitions and spark dataframes GX Core Support	1	343	August 13, 2024
Batch API v3 migration - specify name of BQ temp table Archive help-wanted	0	518	September 2, 2022
How can I create a batch from two sql assets? GX Core Support	1	255	September 6, 2023
How do I prevent GX from creating temporary tables in my DB? GX Core Support	3	471	June 10, 2024

Batch Requests converted to non-optimized queries in BigQuery (GoogleSQL)

Related topics