Spark Connect to S3

xavy · November 9, 2021, 1:09am

Hi All,

I’m using InferredAssetS3DataConnector to connect to S3, but it is not working returning a message “ValueError: S3 query may not have been configured correctly.”
After debug your code I found a bug related to prefix transformation, in the line:

self._prefix = os.path.join(prefix, "")

That line adds a trailing slash to the prefix, you can easily replicate the issue using the code:

import os
prefix = "weather.csv"

prefix_new = os.path.join(prefix, "")
print(prefix_new)

It returns weather.csv/ instead of weather.csv as expected…
The same issue occurs in the class “ConfiguredAssetS3DataConnector”, can you please consider change it to:

self._bucket = bucket
*self._prefix = os.path.join(prefix, "") # causes the issue*
*self._prefix = prefix # potential  solution*
self._delimiter = delimiter
self._max_keys = max_keys

Can you please take a look and fix it in the next release?

With Best Regards
Xavier

Topic		Replies	Views
Use S3 as data source 2022 Archive help-wanted	1	606	November 16, 2022
S3 Data Source Configures Successfully But Suite New Fails with New S3 Datasource Archive s3	2	568	May 3, 2021
Configuring S3 as Datastore Archive how-to , s3	2	690	April 1, 2021
Problems with GX on S3 GX Core Support	3	263	September 8, 2023
Help understanding datasource_name config for RuntimeBatchRequest GX Core Support s3 , datasource	5	374	September 14, 2023

Spark Connect to S3

Related topics