[newbie] - sample project to get started with?

sfinnie · November 8, 2024, 12:39pm

Hello all. Newbie to GX, read the overview (v helpful) and started following Try GX Core. However, I had trouble piecing together the various code snippets on the page: got several errors when trying to run.

Is there a sample project/script anywhere that shows them all assembled together? I’m looking to run some basic checks on data read from a single csv file (e.g. uniqueness of column values).

Did try searching the web for such a thing but that hasn’t been successful yet.

Thanks in advance for any pointers.

adeola · November 8, 2024, 9:49pm

hi there and welcome to our community!

Without knowing the exact way you wish to use GX (database, context/environment, use case) it’s hard to know what to recommend as a sample project as there are various different ways and configurations one may wish to use when getting started with GX.

To start, here’s a demo video that introduces creating a GX project. You can also check out the code sample used in the video.

As a quick example, here’s a workflow using a local CSV file and pandas:

Create a DataFrame from the CSV.
Add the data source and asset, and define a batch representing the entire DataFrame (or partition it to create smaller batches).
Create an expectation suite with a sample expectation.
Pass the DataFrame as a batch and preview it with print(batch.head()) to confirm the structure.
Set up a checkpoint to send validation results to data docs and run the validation.

For more detail on any of these steps, feel free to refer to our glossary for definitions of GX terms.

import great_expectations as gx
import pandas as pandas

project_path = "..."
csv_file_path = ""

context = gx.get_context(mode="file", project_root_dir=project_path)

dataframe = pandas.read_csv(csv_file_path, encoding='latin1')

data_source = context.data_sources.add_pandas("data_source")
data_asset = data_source.add_dataframe_asset("data_asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe(
   "data_batch"
)

suite = context.suites.add(
    gx.ExpectationSuite(
        "suite",
        expectations=[
            gx.expectations.ExpectColumnValueLengthsToBeBetween(column="Track", min_value=1, max_value=4),
        ],
    )
)

batch_parameters = {"dataframe": dataframe}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
print(batch.head())

validation_definition = context.validation_definitions.add(gx.ValidationDefinition(name="vd", data=batch_definition, suite=suite))

cp = context.checkpoints.add(
    gx.Checkpoint(
        name="checkpoint",
        validation_definitions=[validation_definition],
        actions=[gx.checkpoint.actions.UpdateDataDocsAction(name="action")],
    )
)

cp.run(batch_parameters=batch_parameters)

context.open_data_docs()

sfinnie · November 12, 2024, 10:04pm

hi @adeola, thanks very much for the response - that’s sorted me out.

In case it helps, here’s a rough explanation of my understanding as it’s evolved. Whilst I’d read the glossary, I hadn’t fully internalised the data source/asset/batch setup. My mental model was “I have data in a dataframe that I read from csv: I want to validate that”. So I expected to define a suite and run it directly against the dataframe. I understand that the source/asset/batch setup adds flexibility and is likely useful/necessary for bigger tasks: it just took a while to piece it all together.

Sharing that only in the hope it might help to hear a newbie’s experience. For avoidance of doubt, there’s no implicit criticism in there.

Thanks again for the help.

cvirparia · November 14, 2024, 2:04pm

I found one good course on Udemy that covers great expectation with sufficient detail to start off with. If you are interested, here is the link -
https://www.udemy.com/course/data-quality-testing-unleashed-theory-to-implementation/

I hope, it helps.

sfinnie · November 19, 2024, 9:53am

thanks @cvirparia, appreciate the link. Will take a look.

sfinnie · November 19, 2024, 9:56am

Using @adeola’s post I created a utility function to encapsulate setup for simple cases. Sharing here in case it’s useful for others.

def validate_csv_file(project_path: str, src_csv: str, validation_suite: ExpectationSuite) -> FileDataContext:

    # delete all the cached files so the script can be run from scratch every time
    # wouldn't do this in production as it makes tings less efficient, but 
    # helpful for dev as it means there's no cached state.
    gx_path = os.path.join(project_path, "gx")
    shutil.rmtree(gx_path, ignore_errors=True)

    context = gx.get_context(mode="file", project_root_dir=project_path)

    # read the data from file
    dataframe = pandas.read_csv(src_csv)

    # set up the gx context
    data_source = context.data_sources.add_pandas("data_source")
    data_asset = data_source.add_dataframe_asset("data_asset")
    batch_definition = data_asset.add_batch_definition_whole_dataframe("data_batch")

    suite = context.suites.add(validation_suite)

    batch_parameters = {"dataframe": dataframe}
    batch = batch_definition.get_batch(batch_parameters=batch_parameters)

    # show the first few lines in the batch
    print(batch.head())

    # set up the validations
    validation_definition = context.validation_definitions.add(gx.ValidationDefinition(name="vd", data=batch_definition, suite=suite))

    cp = context.checkpoints.add(
        gx.Checkpoint(
            name="checkpoint",
            validation_definitions=[validation_definition],
            actions=[gx.checkpoint.actions.UpdateDataDocsAction(name="action")],
        )
    )

    # run the validations
    cp.run(batch_parameters=batch_parameters)

    return context


if __name__ == "__main__":
    project_path = "./data"
    csv_file_path = os.path.join(project_path, "input/sample_report.csv")

    validation_suite = gx.ExpectationSuite(
        "suite",
        expectations=[
            gx.expectations.ExpectColumnValueLengthsToBeBetween(column="UNIQUEID", min_value=1, max_value=10000),
        ],
    )

    print("calling validation")
    context = validate_csv_file(project_path, csv_file_path, validation_suite)
    print("validation call returned, opening docs")
    context.open_data_docs()

Topic		Replies	Views
Could not find local file-backed GX project GX Core Support	11	130	November 7, 2024
I am currently working with Great Expectations Core to validate data from two different sources: a CSV file and a MongoDB data source. While I am able to create Expectations and generate local Data Docs, I am encountering the same issue in both cases. S GX Core Support how-to	1	149	November 7, 2024
How to apply suite on GX OSS? GX Core Support how-to	1	153	April 3, 2024
GX in a Production Setting GX Core Support	1	78	March 31, 2025
New to GX can someone help? GX Core Support how-to	2	306	September 14, 2023

[newbie] - sample project to get started with?

Related topics