I have some code which successfully creates a batch request using Pandas. However, I would like to use Spark instead (for scalability, and to solve another problem).
Here is my working Pandas code:
# Copy data for validation to DataBricks cluster
dbutils.fs.cp(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/", f"/data-in/{run_name}/", True)
dbutils.fs.ls(f"/data-in/{run_name}/")
# Create pandas batch request
pandas_datasource = context.sources.add_pandas_filesystem(name="pandas_data_in", base_directory=f"/dbfs/data-in/{run_name}/")
pandas_asset = pandas_datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
pandas_batch_request = pandas_asset.build_batch_request({"file_name": file_name})
pandas_batches = pandas_asset.get_batch_list_from_batch_request(pandas_batch_request)
print (pandas_batches)
Here are some failed attempts to use Spark instead:
- Simple substitution of add_spark_filesystem for add_pandas_filesystem
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"/dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "Unable to read in batch from the following path: /dbfs/data-in/Run42/MyFileName.csv. Please check your configuration."
- Trying dbfs:
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"dbfs:/data-in/{run_name}/")
# Fails with TestConnectionError: base_directory path: /databricks/driver/dbfs:/data-in/Run43 does not exist.
- Using relative path
datasource = context.sources.add_spark_filesystem(name="data_in", base_directory=f"../../dbfs/data-in/{run_name}/")
asset = datasource.add_csv_asset("csv", batching_regex=r"(?P<file_name>.*).csv")
batch_request = asset.build_batch_request({"file_name": file_name})
batches = asset.get_batch_list_from_batch_request(batch_request)
# Line above fails with "IllegalArgumentException: Path must be absolute: ../../dbfs/data-in/Run40/MyFileName.csv"
It seems this may be a simple question of identifying an absolute path which is equivalent to /databricks/driver/../../dbfs/
- but not /dbfs/
.
Hoping someone can help!
Many thanks.