Not able to create expectation suite and data docs in databricks using spark

akhilrajput · July 9, 2025, 2:39pm

Hi Team,

We are doing a POC to implement Great Expectations in our Databricks data pipelines for that we wrote a code to read a csv file, created data context on Databricks volumes and tried creating expectations as part of the expectations suite and running validator on it.
But when are running the code using panda’s engine it is running through fine but doing it using spark engine is giving JVM error:

import great_expectations as gx
import os
from great_expectations.core.batch import RuntimeBatchRequest

2. Read data using Spark

df = spark.read.csv(file_path, header=True, inferSchema=True)

3. Convert Spark DataFrame to Pandas DataFrame (fits in memory)

pandas_df = df.toPandas()

4. Initialize GE context (make sure this folder exists and has your GE config)

yaml_dir = “/Volumes/fa_dev/mln_dq/dq_volume/gx/”
context_root_dir = yaml_dir#“/Volumes/fa_dev/mln_dq/dq_volume/gx/”
context = gx.get_context()

5. Add Pandas datasource if it doesn’t exist

if “pandas_datasource” not in context.list_datasources():
context.add_datasource(
name=“pandas_datasource”,
class_name=“Datasource”,
execution_engine={“class_name”: “PandasExecutionEngine”},
data_connectors={
“default_runtime_data_connector”: {
“class_name”: “RuntimeDataConnector”,
“batch_identifiers”: [“default_identifier”],
}
},
)

6. Expectation suite name

suite_name = “my_suite_test2”

Delete existing expectation suite if exists, then create new one

if suite_name in [s.expectation_suite_name for s in context.list_expectation_suites()]:
context.delete_expectation_suite(suite_name)
context.add_expectation_suite(suite_name)

7. Create RuntimeBatchRequest with Pandas DataFrame

batch_request = RuntimeBatchRequest(
datasource_name=“pandas_datasource”,
data_connector_name=“default_runtime_data_connector”,
data_asset_name=“pandas_data_asset”,
runtime_parameters={“batch_data”: pandas_df},
batch_identifiers={“default_identifier”: “default”},
)

8. Get validator

validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name=suite_name,
)
print(pandas_df.columns)
print(pandas_df.head())

9. Add expectation(s)

try:
validator.expect_column_values_to_not_be_null(“age”)
except KeyError as e:
print(f"KeyError during expectation: {e}")

10. Save expectation suite

validator.save_expectation_suite(discard_failed_expectations=False)

11. Run validation and print results

results = validator.validate()
print(results)

12. Build Data Docs

context.build_data_docs()
print(context.expectations_store_name)
print(context.stores[context.expectations_store_name].store_backend.class.name)

13. Stop pipeline if validation failed

#if not results.success:

raise ValueError(“Data validation failed. Stopping pipeline.”)

also using pandas it is creating expectation suite in the inmemory store while we want it in the volumes where there GE context has been initialised

Topic		Replies	Views
GX-Databricks:Datasource-Data asset - Validator GX Core Support databricks , datasource	8	442	December 19, 2024
GE in Databricks Archive	5	1292	May 11, 2020
How to instantiate a Data Context on an Databricks Spark cluster Archive	20	3056	July 30, 2021
Unable to run Great Expectations on PySpark using SparkDFExecutionEngine GX Core Support	0	120	July 12, 2024
How do I programmatically validate expectations? Archive	3	591	May 17, 2021