A super-simple alternative introduction to Great Expectations

In a recent Slack conversation, Ian pointed out that the documentation for Great Expectations is kinda overwhelming.

It leans heavily into setting up GE in production, which means you’ll need abstractions like DataContexts and ValidationOperators, etc.

However, if you’re just getting your feet wet with the library, it’s pretty overwhelming. Maybe there’s a simpler way.

Ian threw together two super simple scripts to get started. They only use Expectations and the validate command. Nothing else.

What do you think? Would this be a good way to help people get oriented to Great Expectations at the very beginning?

MVP Great Expectations in pandas:

import great_expectations as ge

# Build up expectations on a sample dataset and save them
train = ge.read_csv("data/npi.csv")
train.expect_column_values_to_not_be_null("NPI")
train.save_expectation_suite("npi_csv_expectations.json")

# Load in a new dataset and test them
test = ge.read_csv("data/npi_new.csv")
validation_results = test.validate(expectation_suite="npi_csv_expectations.json")

if validation_results["success"]:
    print ("giddy up!")
else:
    raise Exception("oh shit.")

MVP Great Expectations for SQLAlchemy:

import os
from great_expectations.dataset import SqlAlchemyDataset
from sqlalchemy import create_engine

db_string = "postgres://{user}:{password}@{host}:{port}/{dbname}".format(
    user=os.environ["DB_USER"],
    password=os.environ["DB_PASSWORD"],
    port=os.environ["DB_PORT"],
    dbname=os.environ["DB_DBNAME"],
    host=os.environ["DB_HOST"],
)

db_engine = create_engine(db_string)

# Build up expectations on a table and save them
sql_dataset = SqlAlchemyDataset(table_name='my_table', engine=db_engine)
sql_dataset.expect_column_values_to_not_be_null("id")
sql_dataset.save_expectation_suite("postgres_expectations.json")

# Load in a subset of the table and test it
sql_query = """
    select *
    from my_table
    where created_at between date'2019-11-07' and date'2019-11-08'
"""
new_sql_dataset = SqlAlchemyDataset(custom_sql=sql_query, engine=db_engine)
validation_results = new_sql_dataset.validate(
    expectation_suite="postgres_expectations.json"
)

if validation_results["success"]:
    ...
3 Likes

I’m quite in favour of a simpler introduction. Basically something that lets us directly operate on DataFrames would make it much more lightweight. Seems to me that having multiple API layers, some that are lower-level and others that are higher-level, with corresponding power-user to convenience-user targeting, would be desirable.

1 Like

Yes, we need both.

I’d estimate that 75% of GE users want the full DataContext experience, and 25% want the simpler version. A lot of integrations and plugins to other tools end up in the 25%. So we need to treat both as first-class citizens.

@ericmjl and I are finding time to discuss this soon.

If anyone else wants to weigh in, please speak up!

25% want the simpler version

That would be me :slightly_smiling_face:

Here’s the example I currently have (using the Boston Housing data):

import os

import great_expectations as ge
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer,
    ProfilingResultsPageRenderer,
    ValidationResultsPageRenderer
)
from great_expectations.render.view import DefaultJinjaPageView


INPUT_PATH = "/tmp/BostonHousing.csv"
OUTPUT_DIR = "/tmp"


def _generate_html(obj, renderer, filename):
    document_model = renderer.render(obj)
    with open(os.path.join(OUTPUT_DIR, filename), "w") as writer:
        writer.write(DefaultJinjaPageView().render(document_model))


if __name__ == "__main__":
    # Load data
    ds = ge.read_csv(INPUT_PATH)

    # Run some expectations ...
    ds.expect_column_values_to_not_be_null("b")
    ds.expect_column_values_to_not_be_null("age")
    ds.expect_column_mean_to_be_between("age", 7, 16)  # This expectation will NOT be met
    ds.expect_column_stdev_to_be_between("age", 2, 20)

    # Get information about the expectations that just ran
    validation_results = ds.validate()
    expectation_suite = ds.get_expectation_suite(discard_failed_expectations=False)
    # Note: validation_results and expectation_suite are JSON serializable/deserializable

    # Generate HTML
    _generate_html(validation_results, ValidationResultsPageRenderer(), "validation_results.html")
    _generate_html(expectation_suite, ExpectationSuitePageRenderer(), "expectation_suite.html")

    # Next, let's do some profiling (i.e. let the library analyze our dataset automatically)
    profiler = BasicDatasetProfiler()
    expectation_suite, validation_results = profiler.profile(ds)
    _generate_html(validation_results, ProfilingResultsPageRenderer(), "profile.html")
    _generate_html(validation_results, ValidationResultsPageRenderer(), "validation_results_2.html")
    _generate_html(expectation_suite, ExpectationSuitePageRenderer(), "expectation_suite_2.html")

Nice!

I like the idea of _generate_html. That’s a nice utility function for the 25%. :slight_smile:

Also, from your comments, it sounds like you might be thinking that ds.validate reports out the results from the expect_* commands run previously.

Strictly speaking, that’s not the case. Each expect_* method will run its own validation. ds.validate will then execute the full suite of validations again.

This is because we originally imagined the expect_* methods as exploratory methods, with the intent that expectations would be serialized and stored as a kind of test suite. Then when you want to validate new data, you can fetch the appropriate Expectations and just run validate.

In other words, we’ve tended to treat expectation suites more like test fixtures than assertions. Overall, I believe that will stay the more common use case (because it lets you use the expectation metadata to do other useful things), but there’s enough call for assertion-style expectations that we’re thinking about implementing them, too.

For related discussion, see https://github.com/great-expectations/great_expectations/issues/1219

Abe, thanks for the detailed reply.

sounds like you might be thinking that ds.validate reports out the results from the expect_* commands run previously

You are absolutely correct! Thank you so much for the clarification.

originally imagined the expect_* methods as exploratory … expectations would be serialized

This is exactly the clarification I was looking for!
Since I didn’t need to serialize expectations (they are specified in Python, and committed to version control), I didn’t understand how the serialized expectations into the picture (other than exporting).

The current design makes total sense now, but I do see an issue regarding adoption in large/enterprise system.
For example: With JSON as the source of truth for expectation suites, we end up adding Notebooks to the mix in order to edit expectations. With Python code as the source of truth, I’ll be able to edit the expectations using my favourite IDE (and plugins I have running).

The general point is … when I evaluate tools, I focus on how many dependencies (libraries, tools, workflows, etc.) they introduce into my current workflow, and whether I can gradually adopt them.

there’s enough call for assertion-style expectations that we’re thinking about implementing them, too

I would love to be a part of the conversation and, hopefully, contribute to the effort.

1 Like

@Joey, yep, we’re right with you on notebooks.

Starting in 0.9.0, we began piloting the use of auto-generated, disposable notebooks to edit Expectations. I recorded an example video here to show the workflow: https://www.loom.com/share/6ed3df93b0394b1398774c7a152eb5a2

We don’t see notebooks as the only way to create and edit Expectations, but they’re a tool that most data scientists/analysts/engineers are comfortable with, and they offer a nice combination of quick interaction and ability to build powerful scripts.

Profilers (still in relatively early stages of development) are another tool that will help streamline the workflow for creating and editing Expectations.

I’m curious if this approach appeals to you, or if you see big holes. If you’re up for it, I’d love to set up call to discuss.

Philosophically, we believe that the right long-term approach is a “power tools” approach that partially automates the Expectations creation/editing workflow, and gives developers a ton of control by integrating smoothly with the tools they already use. Lots of details TBD.