A super-simple alternative introduction to Great Expectations

abegong · November 8, 2019, 7:54pm

In a recent Slack conversation, Ian pointed out that the documentation for Great Expectations is kinda overwhelming.

It leans heavily into setting up GE in production, which means you’ll need abstractions like DataContexts and ValidationOperators, etc.

However, if you’re just getting your feet wet with the library, it’s pretty overwhelming. Maybe there’s a simpler way.

Ian threw together two super simple scripts to get started. They only use Expectations and the validate command. Nothing else.

What do you think? Would this be a good way to help people get oriented to Great Expectations at the very beginning?

MVP Great Expectations in pandas:

import great_expectations as ge

# Build up expectations on a sample dataset and save them
train = ge.read_csv("data/npi.csv")
train.expect_column_values_to_not_be_null("NPI")
train.save_expectation_suite("npi_csv_expectations.json")

# Load in a new dataset and test them
test = ge.read_csv("data/npi_new.csv")
validation_results = test.validate(expectation_suite="npi_csv_expectations.json")

if validation_results["success"]:
    print ("giddy up!")
else:
    raise Exception("oh shit.")

MVP Great Expectations for SQLAlchemy:

import os
from great_expectations.dataset import SqlAlchemyDataset
from sqlalchemy import create_engine

db_string = "postgres://{user}:{password}@{host}:{port}/{dbname}".format(
    user=os.environ["DB_USER"],
    password=os.environ["DB_PASSWORD"],
    port=os.environ["DB_PORT"],
    dbname=os.environ["DB_DBNAME"],
    host=os.environ["DB_HOST"],
)

db_engine = create_engine(db_string)

# Build up expectations on a table and save them
sql_dataset = SqlAlchemyDataset(table_name='my_table', engine=db_engine)
sql_dataset.expect_column_values_to_not_be_null("id")
sql_dataset.save_expectation_suite("postgres_expectations.json")

# Load in a subset of the table and test it
sql_query = """
    select *
    from my_table
    where created_at between date'2019-11-07' and date'2019-11-08'
"""
new_sql_dataset = SqlAlchemyDataset(custom_sql=sql_query, engine=db_engine)
validation_results = new_sql_dataset.validate(
    expectation_suite="postgres_expectations.json"
)

if validation_results["success"]:
    ...

ericmjl · February 28, 2020, 1:33am

I’m quite in favour of a simpler introduction. Basically something that lets us directly operate on DataFrames would make it much more lightweight. Seems to me that having multiple API layers, some that are lower-level and others that are higher-level, with corresponding power-user to convenience-user targeting, would be desirable.

abegong · February 29, 2020, 2:58pm

Yes, we need both.

I’d estimate that 75% of GE users want the full DataContext experience, and 25% want the simpler version. A lot of integrations and plugins to other tools end up in the 25%. So we need to treat both as first-class citizens.

@ericmjl and I are finding time to discuss this soon.

If anyone else wants to weigh in, please speak up!

Joey · March 26, 2020, 7:02pm

25% want the simpler version

That would be me

Here’s the example I currently have (using the Boston Housing data):

import os

import great_expectations as ge
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer,
    ProfilingResultsPageRenderer,
    ValidationResultsPageRenderer
)
from great_expectations.render.view import DefaultJinjaPageView


INPUT_PATH = "/tmp/BostonHousing.csv"
OUTPUT_DIR = "/tmp"


def _generate_html(obj, renderer, filename):
    document_model = renderer.render(obj)
    with open(os.path.join(OUTPUT_DIR, filename), "w") as writer:
        writer.write(DefaultJinjaPageView().render(document_model))


if __name__ == "__main__":
    # Load data
    ds = ge.read_csv(INPUT_PATH)

    # Run some expectations ...
    ds.expect_column_values_to_not_be_null("b")
    ds.expect_column_values_to_not_be_null("age")
    ds.expect_column_mean_to_be_between("age", 7, 16)  # This expectation will NOT be met
    ds.expect_column_stdev_to_be_between("age", 2, 20)

    # Get information about the expectations that just ran
    validation_results = ds.validate()
    expectation_suite = ds.get_expectation_suite(discard_failed_expectations=False)
    # Note: validation_results and expectation_suite are JSON serializable/deserializable

    # Generate HTML
    _generate_html(validation_results, ValidationResultsPageRenderer(), "validation_results.html")
    _generate_html(expectation_suite, ExpectationSuitePageRenderer(), "expectation_suite.html")

    # Next, let's do some profiling (i.e. let the library analyze our dataset automatically)
    profiler = BasicDatasetProfiler()
    expectation_suite, validation_results = profiler.profile(ds)
    _generate_html(validation_results, ProfilingResultsPageRenderer(), "profile.html")
    _generate_html(validation_results, ValidationResultsPageRenderer(), "validation_results_2.html")
    _generate_html(expectation_suite, ExpectationSuitePageRenderer(), "expectation_suite_2.html")

abegong · March 27, 2020, 4:36am

Nice!

I like the idea of _generate_html. That’s a nice utility function for the 25%.

Also, from your comments, it sounds like you might be thinking that ds.validate reports out the results from the expect_* commands run previously.

Strictly speaking, that’s not the case. Each expect_* method will run its own validation. ds.validate will then execute the full suite of validations again.

This is because we originally imagined the expect_* methods as exploratory methods, with the intent that expectations would be serialized and stored as a kind of test suite. Then when you want to validate new data, you can fetch the appropriate Expectations and just run validate.

In other words, we’ve tended to treat expectation suites more like test fixtures than assertions. Overall, I believe that will stay the more common use case (because it lets you use the expectation metadata to do other useful things), but there’s enough call for assertion-style expectations that we’re thinking about implementing them, too.

For related discussion, see https://github.com/great-expectations/great_expectations/issues/1219

Joey · March 27, 2020, 1:04pm

Abe, thanks for the detailed reply.

sounds like you might be thinking that ds.validate reports out the results from the expect_* commands run previously

You are absolutely correct! Thank you so much for the clarification.

originally imagined the expect_* methods as exploratory … expectations would be serialized

This is exactly the clarification I was looking for!
Since I didn’t need to serialize expectations (they are specified in Python, and committed to version control), I didn’t understand how the serialized expectations into the picture (other than exporting).

The current design makes total sense now, but I do see an issue regarding adoption in large/enterprise system.
For example: With JSON as the source of truth for expectation suites, we end up adding Notebooks to the mix in order to edit expectations. With Python code as the source of truth, I’ll be able to edit the expectations using my favourite IDE (and plugins I have running).

The general point is … when I evaluate tools, I focus on how many dependencies (libraries, tools, workflows, etc.) they introduce into my current workflow, and whether I can gradually adopt them.

there’s enough call for assertion-style expectations that we’re thinking about implementing them, too

I would love to be a part of the conversation and, hopefully, contribute to the effort.

abegong · March 27, 2020, 5:07pm

@Joey, yep, we’re right with you on notebooks.

Starting in 0.9.0, we began piloting the use of auto-generated, disposable notebooks to edit Expectations. I recorded an example video here to show the workflow: https://www.loom.com/share/6ed3df93b0394b1398774c7a152eb5a2

We don’t see notebooks as the only way to create and edit Expectations, but they’re a tool that most data scientists/analysts/engineers are comfortable with, and they offer a nice combination of quick interaction and ability to build powerful scripts.

Profilers (still in relatively early stages of development) are another tool that will help streamline the workflow for creating and editing Expectations.

I’m curious if this approach appeals to you, or if you see big holes. If you’re up for it, I’d love to set up call to discuss.

Philosophically, we believe that the right long-term approach is a “power tools” approach that partially automates the Expectations creation/editing workflow, and gives developers a ton of control by integrating smoothly with the tools they already use. Lots of details TBD.

Topic		Replies	Views
How do I programmatically validate expectations? Archive	3	592	May 17, 2021
Docs need to arrange Core Concepts in a logical workflow Archive	1	513	August 31, 2021
Profiling a dataset / scaffolding an expectation suite without using the CLI (e.g. spark, databricks) Archive how-to	2	1156	October 23, 2020
How to validate data without a Checkpoint Feedback how-to , help-wanted	4	1166	September 28, 2021
Is there a way to run the scaffold CLI command in a Python script like a function? Archive	4	1035	January 19, 2021

A super-simple alternative introduction to Great Expectations

Related topics