Hi GE folks !
In my pipeline the datasource filename changes every day, but follows this systematic pattern: tableXYZ_YYYYMMDD.csv
.
Hence, I need to set the path dynamically, which I tried to do by adding a few lines of code at the top of the suite’s jupyter notebook.
This is what I did:
1.) I opened the suite’s notebook via
great_expectations suite edit "Table Checker"
2.) I changed the first chunk of the notebook to
import glob
import datetime
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.data_context.types.resource_identifiers import (
ValidationResultIdentifier,
)
context = ge.data_context.DataContext()
expectation_suite_name = "Table Checker"
suite = context.get_expectation_suite(expectation_suite_name)
suite.expectations = []
path_to_csv = glob.glob("../data/tableXYZ*.csv")[0]
batch_kwargs = {
"data_asset_name": "Table XYZ",
"datasource": "Data Provider ZZZ",
"path": path_to_csv,
}
batch = context.get_batch(batch_kwargs, suite)
batch.head()
3.) I ran the jupyter notebook (sucessfully) and saved it.
4.) But when I return to the juypter notebook later via
great_expectations suite edit "Table Checker"
the code from 2.) got overwritten.
Questions:
- How can I make sure changes in the first code chunk of the suite jupyter notebook do not get overwritten later?
- Is there a more elegant or sustainable solution for setting the path to datasource dynamically?
Best from Berlin and thx for your great work ,
Guido