As of GE 0.13.10, if you want to pull all the failing records in a sql table after validation, you can use the following work flow:
Set the result_format to COMPLETE and run validations.
Use the values in the unexpected_list to pass in to a temp table or a text list so you can use in a sql query.
Use the following query to find your failing records:
SELECT *
FROM TableA
WHERE ColumnWithFailingValues IN ( {pass in the values from the unexpected_list when result_format is set to COMPLETE} )
Either copy or move these records to another table for triage or for cleaning.
The above work flow might not be optimal if there are many failing records. As of GE 0.13.10, Great Expectations doesn’t have the option to return the primary key or some sort of index on a sql table to return all the failing records. This would be a fantastic community contribution if anyone is interested in building it. What is needed is a configuration option to tell GE to keep track of a key value for every failing record during validation.
search terms: quarantine queues, separate bad records, separate bad data
You should check out the WAP pattern (write, audit, publish). The general idea is that you use an orchestration tool to run pipelines in a staging or test area or environment, then you run validation tests, and then if the tests pass you publish the data to production. Great Expectations Open Source doesn’t manage this workflow for you. You have to orchestrate the whole thing with an outside tool like Airflow/Prefect/Dagster. Here’s a post with an example of how to put it all together: Great Expectations, dbt, and AirFlow: Building a Robust Data Pipeline
The use case in that article doesn’t specifically break out the failing records into a separate table, but you would just add that into your airflow DAG as a separate node. On failure of validation, run SQL code to load failed records into one table and passing records into another table. What you do with failing records is specific to your business case. In some cases, you may want to just continue running the pipeline with only the passing records. You can orchestrate that all with airflow or another orchestration tool.
Hopefully that gives you an idea how you can put it all together. Alternatively, Great Expectations Cloud will eventually have a slick workflow to manage this process directly. We’re not there yet, but this is a feature on our roadmap for the cloud product.
Thank you for your response. I have noted down the above points and will try to
incorporate in my Framework.
Also wanted to ask you that “How I could convert the HTML doc output to a csv file. Any predefined options available within GE framework or any external options, it’s our requirement to convert the HTML doc to csv file containing all the major result?”
I would suggest instead of converting the HTML into csv, use our standard method of configuring your validation results to be stored in a database or any of the other standard outputs for that and then export that data or use it however you want prior to rendering it in HTML. You can still use the HTML formats but also build separate reports from a database or s3, etc.
Here’s one guide on doing that with postgresql. There are other guides as well for other backends for your validation store.