Struggling to identify correct base directory for add_spark_filesystem

I have some code which successfully creates a batch request using Pandas. However, I would like to use Spark instead (for scalability, and to solve another problem).

Here is my working Pandas code:

# Copy data for validation to DataBricks cluster
dbutils.fs.cp(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/", f"/data-in/{run_name}/", True)
dbutils.fs.ls(f"/data-in/{run_name}/")

# Create pandas batch request
pandas_datasource = context.sources.add_pandas_filesystem(name="pandas_data_in", base_directory=f"/dbfs/data-in/{run_name}/")
pandas_asset = pandas_datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
pandas_batch_request = pandas_asset.build_batch_request({"file_name": file_name})
pandas_batches = pandas_asset.get_batch_list_from_batch_request(pandas_batch_request)
print (pandas_batches)

Here are some failed attempts to use Spark instead:

  1. Simple substitution of add_spark_filesystem for add_pandas_filesystem
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"/dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "Unable to read in batch from the following path: /dbfs/data-in/Run42/MyFileName.csv. Please check your configuration."
  1. Trying dbfs:
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"dbfs:/data-in/{run_name}/")
# Fails with TestConnectionError: base_directory path: /databricks/driver/dbfs:/data-in/Run43 does not exist.
  1. Using relative path
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"../../dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "IllegalArgumentException: Path must be absolute: ../../dbfs/data-in/Run40/MyFileName.csv"

It seems this may be a simple question of identifying an absolute path which is equivalent to /databricks/driver/../../dbfs/ - but not /dbfs/.

Hoping someone can help!

Many thanks.

1 Like

After a flash of inspiration overnight, I thought I should try using a Spark dataframe, which turns out to work a treat, and avoids the need to copy my data onto the DataBricks cluster.

Sharing my solution here in the hope it will help others. (I’m very new to Spark and DataBricks, which doubtless shows!)

data_frame = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/MyFileName.csv")
display(data_frame)

datasource = context.sources.add_spark("spark_datasource")
asset = datasource.add_dataframe_asset(name="spark_dataframe_asset")
batch_request = asset.build_batch_request(dataframe=data_frame)

batches = asset.get_batch_list_from_batch_request(batch_request)
print(batches)
1 Like