How to connect to data on Databricks Unity Catalog using Spark

I have my data in Databricks Unity Catalog as

main.checkout_catalog.checkout_schema.checkout_orders_data

How can I connect with data present in above given path

Hi @sant-singh, thanks for reaching out!

If you’d like to use Spark to connect to a Databricks table, you can use spark.read.table("main.checkout_catalog.checkout_schema.checkout_orders_data") to read your table into a Spark dataframe. From there, you can use GX to add a Spark data source and a dataframe asset using the newly created dataframe from your table.

Here’s a link to our relevant docs page on connecting to in-memory Data Assets: Connect to in-memory Data Assets | Great Expectations, along with some sample code adapted for the spark.read.table() usage:

data_source = context.sources.add_or_update_spark(name="spark_datasource")

df = spark.read.table("main.checkout_catalog.checkout_schema.checkout_orders_data")

data_asset = data_source.add_dataframe_asset(
    name="spark_dataframe_asset",
    dataframe=df
)

batch_request = data_asset.build_batch_request()

I have delta tables only. I have to use the tables directly not by reading it in dataframe because I have to send the validation result to Datahub and that should match data asset type to be ingested in datahub.

I am getting this while executing the provided sample code:

DataHubValidationAction does not recognize this GE data asset type - <class 'great_expectations.validator.validator.Validator'>. This is either using v2-api or execution engine other than sqlalchemy.
Metadata not sent to datahub. No datasets found.

I have to use only SqlAlchemyExecutionEngine because Datahub only support SqlAlchemyExecutionEngine while integrating Great Expectation and Datahub.

Hi @sant-singh, if you’re not using Spark, an alternative way to connect to a Databricks table is by creating a Databricks SQL source and table asset.

With regard to your Datahub error, I see you’ve also raised that error in another post - let’s keep conversation around the Datahub error in its original thread. This will help prevent duplicate effort and help keep all the communication & responses consolidated in one place. Thanks!