Increased execution time for expectations in Airflow

vhernandezs · May 3, 2022, 8:53am

Dear Team.

I am currently facing an issue while working with GreatExpectations in Airflow. In my current setup, I define several Airflow tasks to run GreatExpectations Checkpoints (version 0.13.37) to validate my data, which is stored in AWS Redshift and, in turn, the results of these validations are stored in an S3 bucket in AWS. However, after a couple of months, I started to notice an increase in execution time from the Airflow tasks that contain the GreatExpectations validations (when originally configured, each Airflow task took around 5 minutes to complete, but now each takes up to 1 hour).

After closer inspection, I noticed that this increase happened during the execution of the LegacyCheckpoint.run() command in Python. So basically, my question is: Is there something that could be holding up the completion of the run() command?

As far as I understand, in this setup, when a Checkpoint runs it will execute one query in Redshift for each expectation, and once all of the queries have been executed, the validation results are stored in the specified destination (S3 in this case), but I cannot find where the process is getting stuck. For example, if my Airflow task starts at 10:00, I can see that the expectation queries are completed in Redshift at 10:03 and the results are available in S3 at 10:05, but yet the Airflow tasks keep running and are marked as completed until 11:00. Any hints on what the issue could be or how to debug the execution of the Checkpoints is welcomed.

Thanks in advance.

Sergii · May 10, 2022, 12:45pm

Noticed the slow execution too. However it is slow on my local instance deployed in docker on MacBook. Figured out that getting the response takes unacceptable long time. 4hrs vs 5 mins (when run the query directly).
Resources are enough in docker. Still can’t figure out the cause of it.

Topic		Replies	Views
Airflow running great expectations in DEV and PROD env GX Core Support airflow	2	269	August 24, 2023
Airflow Integration Archive airflow	3	2391	August 26, 2020
Slow validations GX Core Support	1	265	August 31, 2023
Validations need to be ran twice to store and create data docs Archive how-to , help-wanted	1	491	March 8, 2021
Can GE access / validate data from Spark, stored in an S3 bucket? Archive s3	1	646	April 20, 2021

Increased execution time for expectations in Airflow

Related topics