Good morning GX Support,
I mapped a BigQuery dataset to a GX Datasource and then I mapped a BigQuery table to a GX Data Asset. Afterward, I added a splitter (yearly, monthly, and daily) to the Data Asset so that I can control what data to fetch via GX Batch Requests based on the interval of the Airflow DAG I’m running.
The problem is that the final request inferred by GX and executed on BigQuery is not optimized to leverage pruning. In a nutshell, I added the splitter by defining as column_name
the partition column of the table I’m working with. By doing so, I expect to get a final query that can leverage pruning and load the partition I need only (not the entire table every time) in order to not waste resources. However, the way GX translates the Batch Request to the GoogleSQL dialect is something like this:
CREATE OR REPLACE TABLE `<GX TEMP ID>`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(
CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
)
AS SELECT *
FROM `<NAME OF THE TABLE>`
WHERE EXTRACT(year FROM `<PARTITION COLUMN>`) = <YEAR> AND EXTRACT(month FROM `<PARTITION COLUMN>`) = <MONTH> AND EXTRACT(day FROM `<PARTITION COLUMN>`) = <DAY>
instead of something like this:
CREATE OR REPLACE TABLE `<GX TEMP ID>`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(
CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
)
AS SELECT *
FROM `<NAME OF THE TABLE>`
WHERE TIMESTAMP_TRUNC(<PARTITION COLUMN>, DAY) = TIMESTAMP("<YEAR>-<MONTH>-<DAY>")
In my specific case, the first query processes 33GB while the second would process 130MB.
Is there any chance this is something you’re working on? In case you aren’t, is there another way to address this problem? I want to add some quality checks within my ETL pipeline on the new batch I’m processing without wasting resources (especially considering I’m working with very big tables sometimes).
Thanks in advance.