How to load data from S3 for validation with Pandas as a batch

This article is for comments to:

Please comment +1 if this How to is important to you.

1 Like


I don’t see how to define s3 datasource in 0.11.0 so that it works. I have (multiple versions of):

        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: mybucket/
        reader_method: parquet
            prefix: device_ip_sample/
            regex_filter: .*
            dictionary_assets: True
    module_name: great_expectations.datasource
      module_name: great_expectations.dataset
      class_name: SparkDFDataset
    class_name: SparkDFDatasource

but great_expectations datasource profile devices_s3 gives me this error:

Unrecognized batch_parameter(s): {'data_asset_name'}
(...omitted errors...)
  File "/home/kris/anaconda3/lib/python3.7/site-packages/great_expectations/datasource/batch_kwargs_generator/", line 208, in _build_batch_kwargs
    "s3": "s3a://" + self.bucket + "/" + key,
TypeError: can only concatenate str (not "dict") to str

Thanks for your help!

@krisp we’ll work on reproducing this! Thanks for the report. Just to confirm is this on version 0.11.0?

@krisp -> Have you confirmed the version by chance? I believe this was an issue that is now fixed…