We got this question from multiple users that want to run their GE Data Context (deployment) against both their production environment and the dev/test one. How to configure great_expectations.yml to support this?
Typically teams want to use a Great Expectations Data Context (project) in production and dev environments. We will use “prod” and “dev” labels for brevity.
“prod” is an environment shared by multiple team members where the data is real and validation is crucial for operating the business.
“dev” is a personal environment used by individual team member to develop, experiment and test.
What area of Great Expectations configuration should vary between “prod” and “dev”
Metadata stores (expectations_store and validations_store)
The “prod” environments store their validation results and/or expectations in a shared location, such as S3, GCS or a database. “Dev” environment should have its own store at least for validation results (and maybe for expectations) that is separate from “prod”.
Datasources
“Prod” and “dev” might need to validate data from different datasources, such as databases, S3 buckets, etc.
Data Docs
The “prod” Data Docs sites are usually deployed on S3, GCS or another service that allows multiple users to access. “Dev” environments should not update the team’s “prod” sites and are best deployed on the team members’ drives.
Notifications
Checkpoints/Validation Operators that are used for validating data have configurable lists of actions, each performing an action on the validation result. One of the actions in the chain sends notifications of the validation’s success or failure. The default class is SlackNotificationAction, but other messaging platforms can be used. “dev” environments should not sent alerts and notifications that can be mistaken for production ones.
Recommended Approach: One config file with variable substitution
Have one great_expectations.yml
, but parametrize all the config properties whose values are environment dependent. Great Expectations supports "${VAR} variables that are substituted in run time (as shows in this how-to guide)
This feature allows you to supply environment specific values for these variables from env variables or a file.
-
In the stores section of the config define two variations of expectations stores and validation results stores, giving them distinct names.
-
Set the default store names:
expectations_store_name: {EXPECTATIONS_STORE_NAME} validations_store_name: {VALIDATIONS_STORE_NAME} -
Define two Data Docs sites in the data_docs_sites section.
local_site that comes pre–built is appropriate for “dev”, since it uses local filesystem.
Add the config of a site to be used for “prod”. -
Validation Operators write to the stores, update Data Docs and send notifications. Parametrize the configuration of the actions of Validation Operators:
For example
...
class_name: ActionListValidationOperator
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
target_store_name: ${VALIDATIONS_STORE_NAME}
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
site_names:
- ${ACTIVE_SITE}
- name: send_slack_notification_on_validation_result
action:
class_name: SlackNotificationAction
slack_webhook: ${validation_notification_slack_webhook}
notify_on: all
notify_with:
renderer:
module_name: great_expectations.render.renderer.slack_renderer
class_name: SlackRenderer
...
Alternative approach: separate config files
Some users have two great_expectations.yml
config files - one for “prod” and the other - for “dev”. Each has its own configuration of Datasources, expectations store, validation results store, Data Docs site and a Validation Operator. If you are running GE in Docker, set the file to be used when starting the container. Outside of Docker, use manual (or scripted) renaming.
Please use comments to ask questions and share your best practices for configuring Great Expectations for “prod” and “dev” environments.