Data Cleaning with expectations

jonozbek · March 4, 2020, 9:49pm

I was introduced to Great Expectations just yesterday so this is going to be a beginner-level, high-level question.

Right now we are focusing on cleaning our Pandas DataFrame datasets (just cleaning, no transform) by defining functions that should match our criteria. It looks like we can accomplish this by using the “expectations” because they return unexpected_index_list, which we can use it to filter out unwanted data. We are also planning to implement custom “expectations” within Great Expectations.

Are we gaining any benefit from cleaning our data this way? Or should we just do this cleaning process outside Great Expectations with our own pure Python functions? I am asking to plan our future workflows.

abegong · March 4, 2020, 10:09pm

Great question: you can do it either way.

To be honest, it’d be awesome to if you’d try the unexpected_index_list way and see what happens. We’ve heard from at least a couple other teams that are doing this, and we suspect that there are some powerful design patterns that could be learned and shared. But so far, no one has done a detailed exploration and brought it back to the community.

(This probably isn’t the answer you were expecting, but we’ve always had a strong “build something useful and see where the community takes it” ethos.)

I suspect you’ve got an especially strong case for this mode, since you’re (1) in pandas, not SQL or Spark, and (2) only cleaning, not transforming.

One related pattern that we’ve seen is “quarantine DFs” where bad rows are put in a separate DF for review and/or triaging.

Note: since you’re using unexpected_index_list, I assume you’re running validate in COMPLETE mode (https://docs.greatexpectations.io/en/latest/reference/result_format.html#behavior-for-complete). If not, you’ll get a partial_unexpected_index_list, which might not contain all the bad rows.

Topic		Replies	Views
How to separate bad data records from good data? Archive how-to	2	2704	February 19, 2021
Create Custom Expectations using pandas dataframe in ingestion layer Archive	1	557	October 14, 2020
Desperate for examples of full custom expectations GX Core Support how-to	2	105	March 31, 2025
A super-simple alternative introduction to Great Expectations Archive	6	3828	March 27, 2020
How can I use the return_format unexpected_index_list to select row from a PandasDataSet Archive	2	2585	May 24, 2021

Data Cleaning with expectations

Related topics