Comparing two tables with the user_configurable_profiler
With this workflow, we will use GE’s user_configurable_profiler
to gauge whether two tables are identical. The workflow is as follows:
-
General GE set-up
-
Creating a suite from the source data
- Set up a batch for your source data - this is the data that you trust to be correct
- Instantiate a profiler and run the
build_suite
method to build a suite from your source data
-
Validating your migrated data with your suite
- Set up a batch for your migrated data - this is the data that you would like to test
- Run a validation on your migrated data using your created suite.
1. General GE set-up
Imports and setting up your data context
import great_expectations as ge
import snowflake
import sqlalchemy as sa
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
context = ge.data_context.DataContext( # Make sure to update your context root directory with your own directory
context_root_dir='/my/context/root/directory/great_expectations'
)
2. Creating a suite from the source data
Set up a batch for your source data - this is the data that you trust to be correct
Create your blank suite
expectation_suite_name = "compare_two_tables"
suite = context.create_expectation_suite(
expectation_suite_name, overwrite_existing=True
)
Set up a batch with the source data you will use to create your initial expectations. This batch will consist of data that you trust to be accurate.
batch_kwargs_postgres = {
"datasource": "my_postgres_db",
"schema": "public",
"table": "my_table",
"data_asset_name": "my_table"
# If you would like to create a batch from a query rather than a full table, you can use the `query` key
# demonstrated below instead of the `schema`, `table`, and `data_asset_name` keys.
# "query": "SELECT * FROM my_schema.my_table WHERE '1988-01-01' <= date AND date < '1989-01-01';
}
batch_postgres = context.get_batch(batch_kwargs_postgres, suite)
batch_postgres.head()
Instantiate a profiler and run the build_suite
method to build a suite from your source data
Next, we will instantiate a profiler object, passing in our batch
profiler = UserConfigurableProfiler(dataset=batch_postgres)
Now we will call the build_suite method on our instantiated UserConfigurableProfiler to create our suite. This will produce a suite, and will output all of the expectations that the profiler created, as well as the information about column types and cardinality that it used to determine which expectations to create per column.
suite = profiler.build_suite()
The output will look something like the below:
Creating an expectation suite with the following expectations:
Table-Level Expectations
expect_table_columns_to_match_ordered_list
expect_table_row_count_to_be_between
Expectations by Column
Column Name: c_acctbal | Column Data Type: FLOAT | Cardinality: UNIQUE
expect_column_max_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_min_to_be_between
expect_column_proportion_of_unique_values_to_be_between
expect_column_quantile_values_to_be_between
expect_column_values_to_be_in_type_list
expect_column_values_to_not_be_null
Column Name: c_address | Column Data Type: STRING | Cardinality: UNIQUE
expect_column_proportion_of_unique_values_to_be_between
expect_column_values_to_be_in_type_list
expect_column_values_to_not_be_null
Column Name: c_comment | Column Data Type: STRING | Cardinality: UNIQUE
expect_column_proportion_of_unique_values_to_be_between
expect_column_values_to_be_in_type_list
3. Validating your migrated data with your suite
Set up a batch for your migrated data - this is the data that you would like to test
Next, we will set up the batch of data that we want to validate. This is the migrated data that we want to ensure is the same as our source data. It is important that we perform this step after creating our expectation suite, because this batch needs to include our newly created suite.
batch_kwargs_snowflake = {
"datasource": "my_snowflake_db",
"schema": "public",
"table": "my_table",
"data_asset_name": "my_table",
# If you would like to create a batch from a query rather than a full table, you can use the `query` key
# demonstrated below instead of the `schema`, `table`, and `data_asset_name` keys.
# "query": "SELECT * FROM my_schema.my_table WHERE '1988-01-01' <= date AND date < '1989-01-01';
}
batch_snowflake = context.get_batch(batch_kwargs_snowflake, suite)
batch_snowflake.head()
Now that we have our suite, created on our source data, and the migrated batch of data that we would like to validate, we can actually run our validation.
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch_snowflake])
validation_result_identifier = results.list_validation_result_identifiers()[0]
print(f"The migrated suite passes validation: {results.success}")
Our output should look like the below:
The migrated suite passes validation: False
Occasionally, suites that look the same may fail due to idiosyncracies between back-ends. In cases like these, you may want to exclude a particular expectation, or a particular column. These configuration options, and others, are below.
If we would like, we can save our expectation suite and build data docs to take a closer look at the created suite.
context.save_expectation_suite(suite, expectation_suite_name)
context.build_data_docs()
context.open_data_docs(validation_result_identifier)
Configuring the user_configurable_profiler
General parameters
The user_configurable_profiler
can take a few different parameters to further hone the results. For the purposes of comparing two tables, the most useful of these are:
-
excluded_expectations
: Takes a list of expectation names which you want to exclude from the suite. If you find that certain expectation types are problematic across your database back-ends, you can specify an expectation to exclude -
ignored_columns
: Takes a list of columns for which you may not want to build expectations. You may want to use this if you have metadata which might not be the same between tables, if you find that certain columns are problematic across your database back-ends -
table_expectations_only
: Takes a boolean. If True, this will only create table-level expectations (i.e. ignoring all columns. Table-level expectations includeexpect_table_row_count_to_equal
and `expect_table_columns_to_match_ordered_list
If you would like to make use of these parameters, you can specify them while instantiating your profiler.
excluded_expectations = ["expect_column_quantile_values_to_be_between"]
ignored_columns = ['c_comment', 'c_acctbal', 'c_mktsegment', 'c_name', 'c_nationkey', 'c_phone']
not_null_only = True
table_expectations_only = False
value_set_threshold = "unique"
suite = context.create_expectation_suite(
expectation_suite_name, overwrite_existing=True
)
batch_postgres = context.get_batch(batch_kwargs_postgres, suite)
profiler = UserConfigurableProfiler(
dataset=batch_postgres,
excluded_expectations=excluded_expectations,
ignored_columns=ignored_columns,
not_null_only=not_null_only,
table_expectations_only=table_expectations_only,
value_set_threshold=value_set_threshold
)
suite = profiler.build_suite()
That’s it!