We frequently get questions from users asking how to compare two tables across different datasources, e.g. comparing a CSV file to the loaded file in a PostgreSQL database, or when doing a database migration from MySQL to Snowflake.
Currently (as of January 2021), there is no Expectation type that can compare tables across different datasources. One possible workaround for that is to use a workflow as follows:
- Load a batch of data from datasource A
- Create an Expectation Suite based on metrics of that batch, either manually or by using a Profiler
- Use that Expectation Suite to validate a batch from datasource B
The code for this would look as follows:
1. Load reference batch
# Import GE and create a new data context
import great_expectations as ge
context = ge.data_context.DataContext()
# This is the Expectation Suite containing the Expectations we use for comparison
suite = context.create_expectation_suite(
"compare_two_tables_suite",
overwrite_existing=True
)
# This is the "reference" batch, i.e. our source of truth for metrics
reference_batch_kwargs = {
"datasource": "my_postgres_db",
"schema": "public",
"table": "my_table",
"data_asset_name": "my_table"
}
reference_batch = context.get_batch(reference_batch_kwargs, suite)
2. Create an Expectation Suite
Option 1: Manually created the Suite
You can now create any kinds of Expectations using the reference_batch
by getting the respective parameters (e.g. row count, columns, etc) from the existing batch:
row_count = len(batch)
reference_batch.expect_table_row_count_to_equal(row_count)
columns = batch.columns
reference_batch.expect_table_columns_to_match_ordered_list(columns)
... # create more Expectations if desired
This example only shows table metrics. If you want to add column metrics (e.g. expect_column_min_to_be_between
), you might want to load the batch as a Pandas dataframe to get some more fine-grained metrics like the column min, max, mean of numeric columns, etc.
You could also loop over columns in order to create Expectations of the same type for multiple columns.
Option 2: Use the Profiler to create a Suite
This uses the BasicSuiteBuilderProfiler
to create an Expectation Suite which can be used to compare the tables. Note that the current version of this Profiler isn’t very fine-tuned, so you might want to play around with the configuration options and edit the resulting Expectation Suite. We’re currently working on improvements on this Profiler!
from great_expectations.profile import BasicSuiteBuilderProfiler
suite, evr = BasicSuiteBuilderProfiler().profile(reference_batch)
3. Validate second batch
# Load a batch from the second datasource
comparison_batch_kwargs = {
"datasource": "my_snowflake_db",
"table": "my_migrated_table",
"data_asset_name": "my_migrated_table"
}
# Note that we're creating a batch with the Expectation Suite from the reference batch
comparison_batch = context.get_batch(
comparison_batch_kwargs,
reference_batch.get_expectation_suite()
)
# Then validate the comparison batch with the reference suite
results = context.run_validation_operator(
"action_list_operator",
assets_to_validate=[comparison_batch]
)
# This value indicates whether all Expectations passed or not
results.success
Note that this workflow doesn’t do an exact comparison of every value across tables, but uses some metrics as a proxy for identity. We definitely welcome contributions that implement identity checks!