I am new to great expectations and I have some questions to which I couldn’t find a clear answer so far. I am using GX on Databricks and I would like to validate incoming data. For each dataset which I want to validate, I configure some specific expectations (which can change over time). What I don’t understand so far:
Why do I need to store store information (data source/ data asset/batch definition) in the data context in order to run a validation process? What is the advantage of storing these kind of infomation in the data context even though I already have them configured in my script? I would like to adjust them in my script whenever I like but by using the data context, I have to update them additionally in the data context.
When I try to create a data conext in the runtime environment, using the EphemeralDataContext, my script (see script below) gives me following output: INFO:great_expectations.data_context.types.base:Created temporary directory '/tmp/tmpn8t19_wd' for ephemeral docs site
This sounds to me like the data context is not anymore just in the runtime environment. Why doesn’t it stay in memory only?
The shortest way I found to validate a pyspark dataframe with some expectations is the following:
Is there a shorter way to validate a pyspark dataframe? Why do I have to create these empty Objects data_source, data_asset, batch_definition when all they contain is an arbitrary string?
Just adding my two cents as a fellow user.
The GX API attempts to abstract capability from implementation. This is useful for GX since this means objects like datasources and data assets offer GX a standardized view into data regardless of the backend. The downside IMO is that it isn’t always as useful for the user. It feels rather cumbersome if you do everything in-memory, as you have found out as well.
On the other hand, it forces you to think about data validation as something that is fundamentally dependent on source of data, the particular view/query/table of this source (asset), and additionally (and optionally) a partitioning of this view, including metadata such as time of execution, versions, etc.
I’m assuming you’re not trying to validate static in-code Spark dataframes - data always come from somewhere. When you save to a file-based data context, you’d only have to save your Spark datasource and asset once in order to run a validation. You could even re-use the batch definition s.t. you’d only have to run
if you want to run validations repeatedly on a dataframe. Asset names and some batch information will be visible in your data docs. Another reason to think twice about how arbitrary these strings really are (if you build data docs).
Regarding your ephemeral context (use gx.get_context(mode="ephemeral")) persisting files to disk, if you absolutely don’t want to persist files you can pass project_config=DataContextConfig(InMemoryStoreBackendDefaults()) (import can be found at great_expectations.data_context.types.base). I’m assuming the default behaviour (which passes init_temp_docs_sites=True to InMemoryStoreBackendDefaults) is to enable building data docs.
Thanks a lot!! That helps very much! Creating those objects and choosing these strings makes absolutly sense in order to archieve a good data docs outcome!
The reason I didn’t re-use my data source /assets etc. is that I would like to automate the code and independently add in my own config file new datasets to be validated without changing the code. For this I would need to do the following, which looks to me even more complicated than my example above:
# create or retrieve data source
if source_name in context.data_sources.all():
data_source = context.data_sources.get(source_name)
else:
data_source = context.data_sources.add_spark(name=source_name)
# create or retrieve asset
if asset_name in data_source.get_asset_names():
data_asset = data_source.get_asset(asset_name)
else:
data_asset = data_source.add_dataframe_asset(name=asset_name)
# create or retrieve expectation suite
if suite_name in [suite.name for suite in context.suites.all()]:
suite = context.suites.get(suite_name)
else:
# Add an Expectation Suite to the Context
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
Is there any easier way, like a “get or create” function, to do the same a little prettier?