GX 1.0 and Databricks

ToivoMattila · August 30, 2024, 2:11pm

I’m trying out the new 1.0 release on Databricks and going through the related documentation.
What’s the recommended approach for working with GX 1.0 in Databricks and validating data in the Unity Catalog tables?

mollie.pettit · August 30, 2024, 8:14pm

Hi Toivo, we’re working on providing gx cloud integration with databricks sql warehouses; we can let you know when it launches!

ToivoMattila · September 3, 2024, 1:46pm

Awesome, I’m looking forward to it!

I got things working with GX Core using SQLAlchemy
Had some problems with numpy and SQLAlchemy versioning but they were solved by explicitly defining the versions, I’ll include the errors after the code.

Here’s my code:

%pip install great_expectations[databricks]
%pip uninstall numpy
%pip install "numpy<2"
%pip install "sqlalchemy>=2"
%restart_python   # Not sure if this is actually needed

import great_expectations as gx

context = gx.get_context()

token = ...
host = ...
port = ...
http_path = ...
catalog = ...
schema = ...

datasource = context.data_sources.add_or_update_databricks_sql(
    name="my datasource",
    connection_string=f"databricks://token:{token}@{host}:{port}?http_path={http_path}&catalog={catalog}&schema={schema}",
)

asset_name = "yellow_tripdata_sample_2019_01"
database_table_name = "yellow_tripdata_sample_2019_01"
table_data_asset = datasource.add_table_asset(
    table_name=database_table_name, name=asset_name
)

full_table_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="FULL_TABLE"
)
full_table_batch = full_table_batch_definition.get_batch()
preset_expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="passenger_count", min_value=1, max_value=6
)
suite_name = "my_expectation_suite"
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
suite.add_expectation(preset_expectation)
definition_name = "my_validation_definition"
validation_definition = gx.ValidationDefinition(
    data=full_table_batch_definition, suite=suite, name=definition_name
)
validation_definition = context.validation_definitions.add(validation_definition)
validation_results = validation_definition.run()
validation_definitions = [validation_definition]
action_list = []
checkpoint_name = "my_checkpoint"
checkpoint = gx.Checkpoint(
    name=checkpoint_name,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
checkpoint_result = checkpoint.run()

Note that this does not save the Checkpoints and Data Sources etc to disk and can’t be easily reused later for automagically validating data.
I’m still having trouble figuring out how to do that. Help would be appreciated with that, though I’m under the impression that it’s quite hard when working with the clusters.

Errors that I got related to Numpy version + link to a StackOverflow thread on the topic:
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Error related to SQLAlchemy:
TestConnectionError: Attempt to connect to datasource failed: due to SQLAlchemyCreateEngineError(‘Unable to create SQLAlchemy Engine: due to AttributeError(“module 'sqlalchemy.types' has no attribute 'Uuid'”)’)

This is fixed in SQLAlchemy 2.0, see discussion here.

Seems like the errors have nothing to do with GX and for whatever reason pip just decided to install incompatible versions.

Topic		Replies	Views
GX-Databricks:Datasource-Data asset - Validator GX Core Support databricks , datasource	8	421	December 19, 2024
Migration From V0 to V1 GX Core Support databricks , datasource	2	125	December 16, 2024
How to integrate great-expectations with databricks and then to datahub GX Core Support how-to , help-wanted , databricks , datasource	7	472	November 2, 2023
How to connect to data on Databricks Unity Catalog using Spark GX Core Support how-to , help-wanted , databricks , datasource	3	1078	November 2, 2023
GX in a Production Setting GX Core Support	1	72	March 31, 2025

GX 1.0 and Databricks

Related topics