Pros and Cons of implementing DQ within Data Flow vs after the data is loaded in Datalake

dvmishra · September 12, 2020, 7:45pm

Hello Team,

I am confused with the implementation of GE. Should it be implemented within the Data Pipeline like NiFi / Airflow before the data is loaded to the Target system or should it be implemented once all the data is loaded in DataLake like Snowflake or S3.

Can you highlight some of the pros and cons of using the same either ways?

Also , Will the above approach change for Streaming Data vs Batch Data load? or will it be same?

Thanks,
Deepak

abegong · September 12, 2020, 8:06pm

You’re really hitting the mark with good questions today.

Great Expectations supports both of these patterns, and they’re both widely implemented by teams that have deployed Great Expectations.

Let’s call them the Within-Pipeline versus Within-Store patterns.

Within-pipeline testing

In general, Within-Pipeline will give you more control, since you can do things like:

halt data processing if you discover serious errors in data validation
trigger followup processing depending on specific validation results (e.g. if aggregate expectations show significant feature drift in an ML model)
directly control the processing of bad rows (e.g. kick off notifications, etc. for triage workflows)

It’s also often easier to configure things like run_ids directly within the pipeline, so that your validation records are fully integrated with your pipeline execution.

The main downside that I’ve seen is that it sometimes takes a little more work to configure Within-Pipeline validation, since you have to configure Checkpoints and deploy them throughout your pipeline. This isn’t a huge amount of work (and we’re working to streamline it further), but it can take a little bit more to get started.

Within-store testing

The advantages on the Within-Store side are mostly about faster setup and fewer required permissions.

For example, once you’ve configured a SQL datasource, you can iterate over all the important tables and Profile them to generate candidate tests suites and descriptive stats. (GE’s Profilers are still pretty rough, but we’re actively working to make them better. In the meantime, you can always extend/improve them yourself.)

Similarly we sometimes see teams that don’t have access to production pipeline code. It might be managed by a different team, or even a data vendor. In that case, Within-store testing might be the only feasible option. Or it might be a quick way to set up a proof-of-concept to convince the upstream team to grant more access to enable a more powerful Within-Pipeline implementation.

To generalize a little bit, I’d say that the most common pattern is starting Within Store, then adding Within Pipeline checkpoints over time.

dvmishra · September 12, 2020, 8:20pm

Thanks for prompt replies, I will get back if i have further questions.

Topic		Replies	Views
What is the best place to implement Great Expectations Archive	0	532	February 14, 2023
Ensuring data quality in a data warehouse environment with Great Expectations \| Great Expectations Archive	0	427	December 14, 2020
How is Data Processed with Great Expectation Archive	4	1986	September 12, 2020
Can I embed Great Expectations in another Application? Archive	2	621	September 4, 2020
Validations need to be ran twice to store and create data docs Archive how-to , help-wanted	1	492	March 8, 2021

Pros and Cons of implementing DQ within Data Flow vs after the data is loaded in Datalake

Within-pipeline testing

Within-store testing

Related topics