I want to understand , if i am performing Data Quality Checks on an JDBC data Sources like Oracle or Snowflake on a Data Set which is huge in size; say 50GB how does Great Expectation does the processing? Will is read the 50 GB of data within GE engine , perform Quality checks and create good or bad records on the filesystem ?
Or
Does it offloads the DQ checks on the Source Database and does all processing within that database rather than reading all 50GB of data within GE engine? and later stores Good and Bad records within some other Database tables ensuring no data comes out off database and GE utilises the processing power of Source Database.
Can you point to any documentation which talks about the same for reference or explain me how the same works as i have a use case where my table on which i want to perform DQ checks is 400GB in size and i wanted to be considerate about the resource requirement when using GE.
There’s a short description of how this works in the key features section of the docs, but we should probably provide a more detailed write-up.
If you’re using a SqlAlchemy Datasource, then Great Expectations will “compile” Expectations and execute them as queries within your SQL database. That means that virtually all of the processing will be handled within Oracle/Snowflake/etc., with only a modest amount of metadata passed back to the environment where you’re running Great Expectations. That metadata can include rows that contain invalid data*. You can then feed those rows back into your database (or send them to some other system) for triage/review.
That’s usually the best way to implement the kind of system you’re describing. However, there’s also another option: if you prefer, you can pull data from your database using something like pandas.from_sql_query and process it locally. This has a high I/O cost, since you’ll be moving all of the queried data to the environment where Great Expectations is running.
Does that answer your question?
*Note: you can configure this using the result_format parameter when validating data. If I recall, the behavior is incompletely implemented for some SQL engines. If this is something you’d like to work on implementing, please DM me on Slack and I’ll see if there’s a way for us to collaborate on it.
Thanks and Yes it does answers my question. So basically to to ensure my understanding is aligned with yours, Kindly confirm:
The default behaviour with Oracle or any other SQL or Distributed Database like Snowflake is to push the processing to Source system and all the processing will be handled within these source systems.
If using pandas this behaviour is not met as pandas.from_sql_query will pull all the data back to GE for data validation.
If bad records are found during DQ checks , these data will still be returned to GE engine from where the same can be corrected and ingested back to source system.
In case the Bad records set is huge in size then it may be IO intensive.
I am sure aggregation / group by etc are also pushed back to source system for processing and there are no constraint’s with any SQL generated in which case the data will have to be returned back to GE engine ?
Yes! and i agree there should be some more detailed write ups on how the DQ checks takes place and which Data sources are full compatible and which are not or partially compatible.
Yes, depending on your configuration for result_format
Yes. This depends on your configuration, but this is something you should be cautious about.
I think so? It depends on how you execute your aggregation / group by. As a general rule, Great Expectations is set up to enable native execution, to minimize unwanted I/O costs.
On 4: If I/O cost is an important consideration for you, one option to consider is validating your data in two passes:
Pass 1: Downsample to 1% of the data
Validate the sample
If the total percentage of bad rows within the sample < some threshold:
Pass 2: Validate 100% of the data
else:
Send the bad rows from the sample for inspection to diagnose the problem.
This will guard you against data processing failures that affect lots of rows at once (e.g. an upstream data team stops populating a specific column, so all rows fail validation.)
There is a small cost: when the number of errors doesn’t exceed your threshold, you’ll end up paying 1% more for data processing. Of course, you can tune this parameter, depending on your specific circumstances.
You can set up this logic as a custom ValidationOperator. Once you’ve got a working version, it’d be awesome if you could share it back here, or even submit it as a PR to the project!