Inquiring minds want to know.
1 Like
Quick copy of my post in the great expectations slack channel:
For people wondering how to use GE with dask (distributed), and dockerized, here’s a quick write-up on how I did it:
- quick and dirty way to install great expectations on the dask workers:
def install_ge():
import os
os.system("pip install great_expectations")
dask_client.register_worker_callbacks(install_ge)
- cast the pandas dataframes underlying the dask dataframe to GE.PandasDataset and run the validation suite
def run_suite(data, data_name, suite):
def run_partition(data_in):
return pd.Series(PandasDataset(data_in).validate(expectation_suite=suite)["results"])
results = data.map_partitions(run_partition).persist()
for result in results.compute():
# do something with the results, like aggregating them or the like
pass
Fair warning:
More complex expectations won’t really work this way, where you compare e.g. the relative amounts of things, because they might not be scattered evenly.