Can we use Great Expectations to quarantine all the bad data into a separate log file or table so it can be addressed in a cleaning process outside of the normal dag for all the good data?
Joining threads…
Here’s a previous thread that is a bit out of date on this topic:
I was introduced to Great Expectations just yesterday so this is going to be a beginner-level, high-level question.
Right now we are focusing on cleaning our Pandas DataFrame datasets (just cleaning, no transform) by defining functions that should match our criteria. It looks like we can accomplish this by using the “expectations” because they return unexpected_index_list, which we can use it to filter out unwanted data. We are also planning to implement custom “expectations” within Great Expec…
There’s also this more recent post about seeing all the records with anomalies:
Hey, Is there a way to see all the records with detected anomalies. Its good that it shows statistics, but if we need to fix those records, we need to identify the problematic records.
Thanks!
Also, here’s a toy example for doing this using pandas. It is not implemented for sql.
Im looking to provide unexpected index list as a parameter for the next task in my data pipeline in airflow or dagster.