Expectation which includes two tables

jugal_kishore · August 25, 2020, 3:50am

Hello I am trying to make an expectation which inludes two tables and the comparision between two tables. can anybody help me in getting started with it?
For example:
batch.expect_table_row_count_to_equal_other_table(other_table_name, result_format=None, include_config=True, catch_exceptions=None, meta=None)
How to specify my other_table_name in the above expecation using pandas/pyspark.
If not can you specify an example how to compare two tables using an expectation in pandas/pyspark ?
Thank You.

eugene.mandel · August 26, 2020, 12:54am

There are two possible solutions:

To validate that the two tables have the same number of rows, you can use the Evaluation Parameters feature. It allows the number of rows in one table to become a parameter in an expectation about of the number of rows in the second table. This how-to guide shows how to use it: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_and_editing_expectations/how_to_create_expectations_that_span_multiple_batches_using_evaluation_parameters.html
We recommend a design pattern we call “check asset” for the more general case of creating expectations across tables:

While Great Expectations has nearly 50 built in expectations, the need for complex data assertions is common.

Check assets are a design pattern that enables even more complex and fine-grained data tests, such as:

assertions on slices of the data. For example: How many visits occurred last week?
assertions across logical tables. For example: Does my visits table join with my patients table?

A check asset is a slice of data that is only created for validation purposes and may not feed into pipeline or analytical output. It helps express your expectations against a data object that more naturally reflects the meaning of the expectation. For example, a Check Asset for event data might be built as an aggregate rollup binning events by the hour in which they occurred. Then, you can easily express expectations about how many events should happen in the day versus the night.

These check assets should be built in your pipeline’s native transformation language.
For example, if your pipeline is primarily SQL, create an additional table or view that slices the data so that you can use the built in expectations.

Using a check asset introduces a new node into your data pipeline. You should clearly name the expectations about a check asset in a way that makes it easy to understand how it is used in the pipeline, for example by creating an Expectation Suite with the name event_data.time_rollup_check.warning.

Postgres SQL example.

Let’s suppose we have a visits table and we want to make an assertion about the typical number of visits in the last 30 days.

Create a new table with an obvious name like visits.last_30_days that is populated with the following query:

SELECT *
FROM visits
WHERE visit_date > current_date - interval ‘30’ day;
Create a new Expectation Suite for this new table.
Again, we recommend using an obvious name such as visits.last_30_days.warning.
Add an Expectation as follows:

batch.expect_table_row_count_to_be_between(min_value=2000, max_value=5000)

Yes, it is possible to apply it to Pandas or Pyspark data frames/tables. You can join the original tables in your code and create expectations for the new table.

nok · September 14, 2020, 5:39pm

What is the general approach to deal with data validation in much granular level?

For me, I have a lot of use case need to group by a few keys, then check the aggregate sum. Let say, I have some dynamic checking like, I need to check the last 8 weeks aggregate weekly sales, by product, store in two tables.

I have these needs because we had two pipeline generating data for our machine learning model, one come from historical snapshot table, one come from a live production table. These two tables has identical schema, but they don’t always match exactly to each other. They tend to agree on each other in older data, but larger difference in recent data (data latency etc).

So far, I find GE could be useful for validating some condition to make sure our machine learning pipeline does not mess up, but I have not figured out a nice way to handle these checking in more granular level. GE seems to dealing with data validation mostly on Table/Column level.

I am interested in general, how did people handle these kind of situations.

eugene.mandel · September 14, 2020, 10:00pm

Looks like this file was deleted by a commit - I’ll see if it was intentional.

bhcastleton · September 16, 2020, 5:26am

Hi @nok,

From your use case, I’m wondering if creating distributional expectations makes sense. If you need to check the last 8 weeks (a new batch) of sales by product, that sounds like a distribution over the product dimension. Are you expecting a certain distribution? Or do you really have granular sums that have to be matched for each partition? If you need to check a distribution matches the same distribution in a different table, then you can use evaluation parameters to accomplish that.
In pseudocode it would look something like this:

suite1 = expect_column_kl_divergence_to_be_less_than(A, None, None)
suite2 = expect_column_kl_divergence_to_be_less_than(B, suite1.kl_divergence.observed_value, threshold)

Then you have to execute suite 1 followed by suite 2 with the same run id.

nok · September 16, 2020, 6:12am

Should I create expectation for every combinations of dimension that I need to check? Imagine if I have 100 countries, 1000 products, then it’s 100000 expectations

The A or B is refering to a column. Are you suggesting I should create batches that is partition by the original table by products?

Besides, does GE encourage using the yaml config or using python code to build these suites? I am not clear how JSON or YAML can handle these dynamic condition. Or I should only use notebook for these multi-batches operation? these notebook are no longer disposable that a simple JSON could describe.

bhcastleton · September 16, 2020, 7:23am

@nok - No, you don’t need to partition the data separately. I was just referring to each bin of a histogram as a partition.

Please read this documentation about distributional expectations. This should help you conceptualize what’s possible with Great Expectations in this use case. https://docs.greatexpectations.io/en/0.12.0/reference/core_concepts/expectations/distributional_expectations.html?highlight=distributional%20expectations

nok · September 16, 2020, 7:39am

Thanks a lot @bhcastleton, I don’t realize there is partition object before, I will look into it.

Topic		Replies	Views
Expect_table_row_count_to_equal_other_table Archive	0	1145	February 28, 2022
Row by row comparison in great_expectation between two datasources Archive how-to	3	2157	March 16, 2022
How to define Expectations that span multiple tables Archive how-to , help-wanted	0	479	May 28, 2020
How to compare two tables across different databases Archive how-to	1	2864	February 12, 2021
Can a single expectation suite contain multiple tables with different expectations for different tables GX Core Support airflow , s3	1	420	July 26, 2023

Expectation which includes two tables

Related topics