Struggling to identify correct base directory for add_spark_filesystem

After a flash of inspiration overnight, I thought I should try using a Spark dataframe, which turns out to work a treat, and avoids the need to copy my data onto the DataBricks cluster.

Sharing my solution here in the hope it will help others. (I’m very new to Spark and DataBricks, which doubtless shows!)

data_frame = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f"abfss://{blob_container_name}@{datavalidation_storage_account_name}.dfs.core.windows.net/{for_validation_data_folder}/MyFileName.csv")
display(data_frame)

datasource = context.sources.add_spark("spark_datasource")
asset = datasource.add_dataframe_asset(name="spark_dataframe_asset")
batch_request = asset.build_batch_request(dataframe=data_frame)

batches = asset.get_batch_list_from_batch_request(batch_request)
print(batches)
1 Like