Dear Team.
I am currently facing an issue while working with GreatExpectations in Airflow. In my current setup, I define several Airflow tasks to run GreatExpectations Checkpoints (version 0.13.37) to validate my data, which is stored in AWS Redshift and, in turn, the results of these validations are stored in an S3 bucket in AWS. However, after a couple of months, I started to notice an increase in execution time from the Airflow tasks that contain the GreatExpectations validations (when originally configured, each Airflow task took around 5 minutes to complete, but now each takes up to 1 hour).
After closer inspection, I noticed that this increase happened during the execution of the LegacyCheckpoint.run()
command in Python. So basically, my question is: Is there something that could be holding up the completion of the run()
command?
As far as I understand, in this setup, when a Checkpoint runs it will execute one query in Redshift for each expectation, and once all of the queries have been executed, the validation results are stored in the specified destination (S3 in this case), but I cannot find where the process is getting stuck. For example, if my Airflow task starts at 10:00, I can see that the expectation queries are completed in Redshift at 10:03 and the results are available in S3 at 10:05, but yet the Airflow tasks keep running and are marked as completed until 11:00. Any hints on what the issue could be or how to debug the execution of the Checkpoints is welcomed.
Thanks in advance.