How to compare two tables across different databases

sam · January 11, 2021, 5:05pm

We frequently get questions from users asking how to compare two tables across different datasources, e.g. comparing a CSV file to the loaded file in a PostgreSQL database, or when doing a database migration from MySQL to Snowflake.

Currently (as of January 2021), there is no Expectation type that can compare tables across different datasources. One possible workaround for that is to use a workflow as follows:

Load a batch of data from datasource A
Create an Expectation Suite based on metrics of that batch, either manually or by using a Profiler
Use that Expectation Suite to validate a batch from datasource B

The code for this would look as follows:

1. Load reference batch

# Import GE and create a new data context
import great_expectations as ge
context = ge.data_context.DataContext()

# This is the Expectation Suite containing the Expectations we use for comparison
suite = context.create_expectation_suite( 
    "compare_two_tables_suite", 
    overwrite_existing=True
)

# This is the "reference" batch, i.e. our source of truth for metrics
reference_batch_kwargs = {
    "datasource": "my_postgres_db",
    "schema": "public",
    "table": "my_table",
    "data_asset_name": "my_table"
}
reference_batch = context.get_batch(reference_batch_kwargs, suite)

2. Create an Expectation Suite

Option 1: Manually created the Suite

You can now create any kinds of Expectations using the reference_batch by getting the respective parameters (e.g. row count, columns, etc) from the existing batch:

row_count = len(batch)
reference_batch.expect_table_row_count_to_equal(row_count)
columns = batch.columns
reference_batch.expect_table_columns_to_match_ordered_list(columns)
... # create more Expectations if desired

This example only shows table metrics. If you want to add column metrics (e.g. expect_column_min_to_be_between), you might want to load the batch as a Pandas dataframe to get some more fine-grained metrics like the column min, max, mean of numeric columns, etc.

You could also loop over columns in order to create Expectations of the same type for multiple columns.

Option 2: Use the Profiler to create a Suite

This uses the BasicSuiteBuilderProfiler to create an Expectation Suite which can be used to compare the tables. Note that the current version of this Profiler isn’t very fine-tuned, so you might want to play around with the configuration options and edit the resulting Expectation Suite. We’re currently working on improvements on this Profiler!

from great_expectations.profile import BasicSuiteBuilderProfiler
suite, evr = BasicSuiteBuilderProfiler().profile(reference_batch)

3. Validate second batch

# Load a batch from the second datasource
comparison_batch_kwargs = {
    "datasource": "my_snowflake_db",
    "table": "my_migrated_table",
    "data_asset_name": "my_migrated_table"
}
# Note that we're creating a batch with the Expectation Suite from the reference batch
comparison_batch = context.get_batch(
    comparison_batch_kwargs, 
    reference_batch.get_expectation_suite()
)
# Then validate the comparison batch with the reference suite
results = context.run_validation_operator(
    "action_list_operator", 
    assets_to_validate=[comparison_batch]
)
# This value indicates whether all Expectations passed or not
results.success

Note that this workflow doesn’t do an exact comparison of every value across tables, but uses some metrics as a proxy for identity. We definitely welcome contributions that implement identity checks!

tal.gluck · February 12, 2021, 7:12pm

We have a new profiler and an updated process for this! You can check out how to do this with our UserConfigurableProfiler here

Topic		Replies	Views
How to compare two tables with the UserConfigurableProfiler Archive how-to	3	4759	November 4, 2021
Row by row comparison in great_expectation between two datasources Archive how-to	3	2147	March 16, 2022
Expectation which includes two tables Archive how-to	7	3367	September 16, 2020
Multi Batch expectations? Archive	1	1108	February 19, 2021
Difference in results when executing expectations on same data from CSV and Athena datasources Archive help-wanted	0	485	July 21, 2021

How to compare two tables across different databases

1. Load reference batch

2. Create an Expectation Suite

3. Validate second batch

Related topics