Struggling to identify correct base directory for add_spark_filesystem

mike_f50 · March 20, 2024, 2:06pm

I have some code which successfully creates a batch request using Pandas. However, I would like to use Spark instead (for scalability, and to solve another problem).

Here is my working Pandas code:

# Copy data for validation to DataBricks cluster
dbutils.fs.cp(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/", f"/data-in/{run_name}/", True)
dbutils.fs.ls(f"/data-in/{run_name}/")

# Create pandas batch request
pandas_datasource = context.sources.add_pandas_filesystem(name="pandas_data_in", base_directory=f"/dbfs/data-in/{run_name}/")
pandas_asset = pandas_datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
pandas_batch_request = pandas_asset.build_batch_request({"file_name": file_name})
pandas_batches = pandas_asset.get_batch_list_from_batch_request(pandas_batch_request)
print (pandas_batches)

Here are some failed attempts to use Spark instead:

Simple substitution of add_spark_filesystem for add_pandas_filesystem

datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"/dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "Unable to read in batch from the following path: /dbfs/data-in/Run42/MyFileName.csv. Please check your configuration."

Trying dbfs:

datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"dbfs:/data-in/{run_name}/")
# Fails with TestConnectionError: base_directory path: /databricks/driver/dbfs:/data-in/Run43 does not exist.

Using relative path

datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"../../dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "IllegalArgumentException: Path must be absolute: ../../dbfs/data-in/Run40/MyFileName.csv"

It seems this may be a simple question of identifying an absolute path which is equivalent to /databricks/driver/../../dbfs/ - but not /dbfs/.

Hoping someone can help!

Many thanks.

mike_f50 · March 21, 2024, 2:41pm

After a flash of inspiration overnight, I thought I should try using a Spark dataframe, which turns out to work a treat, and avoids the need to copy my data onto the DataBricks cluster.

Sharing my solution here in the hope it will help others. (I’m very new to Spark and DataBricks, which doubtless shows!)

data_frame = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/MyFileName.csv")
display(data_frame)

datasource = context.sources.add_spark("spark_datasource")
asset = datasource.add_dataframe_asset(name="spark_dataframe_asset")
batch_request = asset.build_batch_request(dataframe=data_frame)

batches = asset.get_batch_list_from_batch_request(batch_request)
print(batches)

Topic		Replies	Views
Datasource pandas_filesystem does not contain "add_directory_csv_asset" as suggested on docs GX Core Support	2	75	October 7, 2024
GX with Databricks and Azure Blob Storage GX Core Support	8	317	July 2, 2024
Access CSV from Azure Data Storage getting error Archive how-to , help-wanted , databricks	0	518	July 5, 2022
Data Asset Name not showing in Data Docs GX Core Support help-wanted , databricks	1	227	March 7, 2024
GX-Databricks:Datasource-Data asset - Validator GX Core Support databricks , datasource	8	427	December 19, 2024

Struggling to identify correct base directory for add_spark_filesystem

Related topics