Hello all. Newbie to GX, read the overview (v helpful) and started following Try GX Core. However, I had trouble piecing together the various code snippets on the page: got several errors when trying to run.
Is there a sample project/script anywhere that shows them all assembled together? I’m looking to run some basic checks on data read from a single csv file (e.g. uniqueness of column values).
Did try searching the web for such a thing but that hasn’t been successful yet.
Without knowing the exact way you wish to use GX (database, context/environment, use case) it’s hard to know what to recommend as a sample project as there are various different ways and configurations one may wish to use when getting started with GX.
To start, here’s a demo video that introduces creating a GX project. You can also check out the code sample used in the video.
As a quick example, here’s a workflow using a local CSV file and pandas:
Create a DataFrame from the CSV.
Add the data source and asset, and define a batch representing the entire DataFrame (or partition it to create smaller batches).
Create an expectation suite with a sample expectation.
Pass the DataFrame as a batch and preview it with print(batch.head()) to confirm the structure.
Set up a checkpoint to send validation results to data docs and run the validation.
For more detail on any of these steps, feel free to refer to our glossary for definitions of GX terms.
hi @adeola, thanks very much for the response - that’s sorted me out.
In case it helps, here’s a rough explanation of my understanding as it’s evolved. Whilst I’d read the glossary, I hadn’t fully internalised the data source/asset/batch setup. My mental model was “I have data in a dataframe that I read from csv: I want to validate that”. So I expected to define a suite and run it directly against the dataframe. I understand that the source/asset/batch setup adds flexibility and is likely useful/necessary for bigger tasks: it just took a while to piece it all together.
Sharing that only in the hope it might help to hear a newbie’s experience. For avoidance of doubt, there’s no implicit criticism in there.
Using @adeola’s post I created a utility function to encapsulate setup for simple cases. Sharing here in case it’s useful for others.
def validate_csv_file(project_path: str, src_csv: str, validation_suite: ExpectationSuite) -> FileDataContext:
# delete all the cached files so the script can be run from scratch every time
# wouldn't do this in production as it makes tings less efficient, but
# helpful for dev as it means there's no cached state.
gx_path = os.path.join(project_path, "gx")
shutil.rmtree(gx_path, ignore_errors=True)
context = gx.get_context(mode="file", project_root_dir=project_path)
# read the data from file
dataframe = pandas.read_csv(src_csv)
# set up the gx context
data_source = context.data_sources.add_pandas("data_source")
data_asset = data_source.add_dataframe_asset("data_asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe("data_batch")
suite = context.suites.add(validation_suite)
batch_parameters = {"dataframe": dataframe}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
# show the first few lines in the batch
print(batch.head())
# set up the validations
validation_definition = context.validation_definitions.add(gx.ValidationDefinition(name="vd", data=batch_definition, suite=suite))
cp = context.checkpoints.add(
gx.Checkpoint(
name="checkpoint",
validation_definitions=[validation_definition],
actions=[gx.checkpoint.actions.UpdateDataDocsAction(name="action")],
)
)
# run the validations
cp.run(batch_parameters=batch_parameters)
return context
if __name__ == "__main__":
project_path = "./data"
csv_file_path = os.path.join(project_path, "input/sample_report.csv")
validation_suite = gx.ExpectationSuite(
"suite",
expectations=[
gx.expectations.ExpectColumnValueLengthsToBeBetween(column="UNIQUEID", min_value=1, max_value=10000),
],
)
print("calling validation")
context = validate_csv_file(project_path, csv_file_path, validation_suite)
print("validation call returned, opening docs")
context.open_data_docs()