Data Cleaning with expectations

I was introduced to Great Expectations just yesterday so this is going to be a beginner-level, high-level question.

Right now we are focusing on cleaning our Pandas DataFrame datasets (just cleaning, no transform) by defining functions that should match our criteria. It looks like we can accomplish this by using the “expectations” because they return unexpected_index_list, which we can use it to filter out unwanted data. We are also planning to implement custom “expectations” within Great Expectations.

Are we gaining any benefit from cleaning our data this way? Or should we just do this cleaning process outside Great Expectations with our own pure Python functions? I am asking to plan our future workflows.

Great question: you can do it either way.

To be honest, it’d be awesome to if you’d try the unexpected_index_list way and see what happens. We’ve heard from at least a couple other teams that are doing this, and we suspect that there are some powerful design patterns that could be learned and shared. But so far, no one has done a detailed exploration and brought it back to the community.

(This probably isn’t the answer you were expecting, but we’ve always had a strong “build something useful and see where the community takes it” ethos.)

I suspect you’ve got an especially strong case for this mode, since you’re (1) in pandas, not SQL or Spark, and (2) only cleaning, not transforming.

One related pattern that we’ve seen is “quarantine DFs” where bad rows are put in a separate DF for review and/or triaging.

Note: since you’re using unexpected_index_list, I assume you’re running validate in COMPLETE mode (https://docs.greatexpectations.io/en/latest/reference/result_format.html#behavior-for-complete). If not, you’ll get a partial_unexpected_index_list, which might not contain all the bad rows.