Can Great Expectations separate sql database records that fail tests from my good data?

bhcastleton · February 24, 2021, 12:33am

Great Expectations includes an unexpected_list as part of the results of validation if you request validation results in COMPLETE format. https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/result_format.html

If you’re running in pandas, you also have the option of getting the unexpected_index_list and using that to bring back the failing records. How can I use the return_format unexpected_index_list to select row from a PandasDataSet

As of GE 0.13.10, if you want to pull all the failing records in a sql table after validation, you can use the following work flow:

Set the result_format to COMPLETE and run validations.
Use the values in the unexpected_list to pass in to a temp table or a text list so you can use in a sql query.
Use the following query to find your failing records:

SELECT *
FROM TableA
WHERE ColumnWithFailingValues IN ( {pass in the values from the unexpected_list when result_format is set to COMPLETE} )

Either copy or move these records to another table for triage or for cleaning.

The above work flow might not be optimal if there are many failing records. As of GE 0.13.10, Great Expectations doesn’t have the option to return the primary key or some sort of index on a sql table to return all the failing records. This would be a fantastic community contribution if anyone is interested in building it. What is needed is a configuration option to tell GE to keep track of a key value for every failing record during validation.

search terms: quarantine queues, separate bad records, separate bad data

Anki · December 2, 2021, 6:31am

@bhcastleton , can you please give me example on how to store the failed records in the reject table and
passed records in separate table.

it’s urgent.

Thanks and Regards,
Ankita Hora

bhcastleton · December 2, 2021, 6:19pm

Hi @Anki,

You should check out the WAP pattern (write, audit, publish). The general idea is that you use an orchestration tool to run pipelines in a staging or test area or environment, then you run validation tests, and then if the tests pass you publish the data to production. Great Expectations Open Source doesn’t manage this workflow for you. You have to orchestrate the whole thing with an outside tool like Airflow/Prefect/Dagster. Here’s a post with an example of how to put it all together: Great Expectations, dbt, and AirFlow: Building a Robust Data Pipeline

The use case in that article doesn’t specifically break out the failing records into a separate table, but you would just add that into your airflow DAG as a separate node. On failure of validation, run SQL code to load failed records into one table and passing records into another table. What you do with failing records is specific to your business case. In some cases, you may want to just continue running the pipeline with only the passing records. You can orchestrate that all with airflow or another orchestration tool.

Hopefully that gives you an idea how you can put it all together. Alternatively, Great Expectations Cloud will eventually have a slick workflow to manage this process directly. We’re not there yet, but this is a feature on our roadmap for the cloud product.

Anki · December 3, 2021, 6:12pm

Hi @bhcastleton ,

Thank you for your response. I have noted down the above points and will try to
incorporate in my Framework.

Also wanted to ask you that “How I could convert the HTML doc output to a csv file. Any predefined options available within GE framework or any external options, it’s our requirement to convert the HTML doc to csv file containing all the major result?”

Thanks and Regards,
Ankita Hora

bhcastleton · December 3, 2021, 10:43pm

Hi @Anki,

I would suggest instead of converting the HTML into csv, use our standard method of configuring your validation results to be stored in a database or any of the other standard outputs for that and then export that data or use it however you want prior to rendering it in HTML. You can still use the HTML formats but also build separate reports from a database or s3, etc.

Here’s one guide on doing that with postgresql. There are other guides as well for other backends for your validation store.

Topic		Replies	Views
How to separate bad data records from good data? Archive how-to	2	2715	February 19, 2021
Great expectation is only returning 20 rows of failed records GX Core Support how-to , help-wanted , types-of-expectation , expectation-request , sql	7	295	March 14, 2025
How do I integrate validation results into my dashboard? Archive	4	1478	January 7, 2021
Issue while running Great expectation validations on Big Query External Tables GX Core Support	0	125	May 16, 2024
Using Great expectations in Production environment- Combining multiple data docs, Triggering warning for unfit data Archive airflow , help-wanted , databricks	0	677	January 8, 2022

Can Great Expectations separate sql database records that fail tests from my good data?

Related topics