Can I use Great Expectations with dask?

Inquiring minds want to know.

1 Like

Quick copy of my post in the great expectations slack channel:

For people wondering how to use GE with dask (distributed), and dockerized, here’s a quick write-up on how I did it:

  1. quick and dirty way to install great expectations on the dask workers:
def install_ge():
        import os
        os.system("pip install great_expectations")
    dask_client.register_worker_callbacks(install_ge)
  1. cast the pandas dataframes underlying the dask dataframe to GE.PandasDataset and run the validation suite
def run_suite(data, data_name, suite):
    def run_partition(data_in):
        return pd.Series(PandasDataset(data_in).validate(expectation_suite=suite)["results"])
    results = data.map_partitions(run_partition).persist()
    for result in results.compute():
        # do something with the results, like aggregating them or the like
        pass

Fair warning:
More complex expectations won’t really work this way, where you compare e.g. the relative amounts of things, because they might not be scattered evenly.