Data Context

Anna · February 7, 2025, 2:14pm

Hi Folks,

I am new to great expectations and I have some questions to which I couldn’t find a clear answer so far. I am using GX on Databricks and I would like to validate incoming data. For each dataset which I want to validate, I configure some specific expectations (which can change over time). What I don’t understand so far:

Why do I need to store store information (data source/ data asset/batch definition) in the data context in order to run a validation process? What is the advantage of storing these kind of infomation in the data context even though I already have them configured in my script? I would like to adjust them in my script whenever I like but by using the data context, I have to update them additionally in the data context.
When I try to create a data conext in the runtime environment, using the EphemeralDataContext, my script (see script below) gives me following output:
INFO:great_expectations.data_context.types.base:Created temporary directory '/tmp/tmpn8t19_wd' for ephemeral docs site
This sounds to me like the data context is not anymore just in the runtime environment. Why doesn’t it stay in memory only?
The shortest way I found to validate a pyspark dataframe with some expectations is the following:

import great_expectations as gx
import great_expectations.expectations as gxe
df = spark.createDataFrame(
    [
        (1, 3.5, "abc"),
        (2, 3.5, "def"),
        (3, 3.5, "xyz"),
        (4, 5.5, "abc"),
        (5, 1.0, "abc123"),
        (6, 3.5, "lalala"),
        (6, 3.5, "bcd"),
    ],
    ["ID", "result", "name"] 
)
display(df)

expectations = [gxe.ExpectColumnValuesToMatchRegex(column="name", regex=r"^[a-z]{3}$"),
                gxe.ExpectColumnValuesToNotBeNull(column="ID"),
                gxe.ExpectColumnMaxToBeBetween(column="result", min_value=1, max_value=4)]
suite = gx.ExpectationSuite(name="my-suite", expectations=expectations)

context = gx.get_context(runtime_environment={})

data_source = context.data_sources.add_spark(name="my-spark-data-source")
data_asset = data_source.add_dataframe_asset(name="my-asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe("my-batchdef")

batch = batch_definition.get_batch(batch_parameters={"dataframe": df})
print(batch.validate(suite)["results"])

Is there a shorter way to validate a pyspark dataframe? Why do I have to create these empty Objects data_source, data_asset, batch_definition when all they contain is an arbitrary string?

manuc · February 7, 2025, 8:49pm

Just adding my two cents as a fellow user.
The GX API attempts to abstract capability from implementation. This is useful for GX since this means objects like datasources and data assets offer GX a standardized view into data regardless of the backend. The downside IMO is that it isn’t always as useful for the user. It feels rather cumbersome if you do everything in-memory, as you have found out as well.

On the other hand, it forces you to think about data validation as something that is fundamentally dependent on source of data, the particular view/query/table of this source (asset), and additionally (and optionally) a partitioning of this view, including metadata such as time of execution, versions, etc.

I’m assuming you’re not trying to validate static in-code Spark dataframes - data always come from somewhere. When you save to a file-based data context, you’d only have to save your Spark datasource and asset once in order to run a validation. You could even re-use the batch definition s.t. you’d only have to run

context = gx.get_context()
suite = context.suites.get("my-suite")
batch_def = context.get_datasource("my-spark-data-source").get_asset("my-asset").get_batch_definition("my-batchdef")
batch = batch_def.get_batch(batch_parameters={"dataframe": df})
batch.validate(suite)

if you want to run validations repeatedly on a dataframe. Asset names and some batch information will be visible in your data docs. Another reason to think twice about how arbitrary these strings really are (if you build data docs).

Regarding your ephemeral context (use gx.get_context(mode="ephemeral")) persisting files to disk, if you absolutely don’t want to persist files you can pass project_config=DataContextConfig(InMemoryStoreBackendDefaults()) (import can be found at great_expectations.data_context.types.base). I’m assuming the default behaviour (which passes init_temp_docs_sites=True to InMemoryStoreBackendDefaults) is to enable building data docs.

Hope that helps!

Anna · February 10, 2025, 10:16am

Thanks a lot!! That helps very much! Creating those objects and choosing these strings makes absolutly sense in order to archieve a good data docs outcome!
The reason I didn’t re-use my data source /assets etc. is that I would like to automate the code and independently add in my own config file new datasets to be validated without changing the code. For this I would need to do the following, which looks to me even more complicated than my example above:

# create or retrieve data source
if source_name in context.data_sources.all():
    data_source = context.data_sources.get(source_name)
else:
    data_source = context.data_sources.add_spark(name=source_name)

# create or retrieve asset
if asset_name in data_source.get_asset_names():
    data_asset = data_source.get_asset(asset_name)
else:
    data_asset = data_source.add_dataframe_asset(name=asset_name)

# create or retrieve expectation suite
if suite_name in [suite.name for suite in context.suites.all()]:
    suite = context.suites.get(suite_name)
else:
    # Add an Expectation Suite to the Context
    suite = gx.ExpectationSuite(name=suite_name)
    suite = context.suites.add(suite)

Is there any easier way, like a “get or create” function, to do the same a little prettier?

adeola · February 12, 2025, 6:30pm

we also have reintroduced add_or_update for datasources, expectation suites and checkpoints which will overwrite if the name exists or create new.

Topic		Replies	Views
How to integrate great-expectations with databricks and then to datahub GX Core Support how-to , help-wanted , databricks , datasource	7	480	November 2, 2023
Not able to create expectation suite and data docs in databricks using spark GX Core Support	0	13	July 9, 2025
How do I programmatically validate expectations? Archive	3	591	May 17, 2021
Migration From V0 to V1 GX Core Support databricks , datasource	2	127	December 16, 2024
I am currently working with Great Expectations Core to validate data from two different sources: a CSV file and a MongoDB data source. While I am able to create Expectations and generate local Data Docs, I am encountering the same issue in both cases. S GX Core Support how-to	1	149	November 7, 2024

Data Context

Related topics