How to load data from S3 for validation with Pandas as a batch

This article is for comments to: https://docs.greatexpectations.io/en/latest/how_to_guides/creating_batches/how_to_load_data_from_s3_for_validation_with_pandas_as_a_batch.html

Please comment +1 if this How to is important to you.

1 Like

+1

I don’t see how to define s3 datasource in 0.11.0 so that it works. I have (multiple versions of):

  devices_s3:
    batch_kwargs_generators:
      s3gen:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: mybucket/
        reader_method: parquet
        assets: 
          device_ip_sample: 
            prefix: device_ip_sample/
            regex_filter: .*
            dictionary_assets: True
    module_name: great_expectations.datasource
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: SparkDFDataset
    class_name: SparkDFDatasource

but great_expectations datasource profile devices_s3 gives me this error:

Unrecognized batch_parameter(s): {'data_asset_name'}
(...omitted errors...)
  File "/home/kris/anaconda3/lib/python3.7/site-packages/great_expectations/datasource/batch_kwargs_generator/s3_batch_kwargs_generator.py", line 208, in _build_batch_kwargs
    "s3": "s3a://" + self.bucket + "/" + key,
TypeError: can only concatenate str (not "dict") to str

Thanks for your help!

@krisp we’ll work on reproducing this! Thanks for the report. Just to confirm is this on version 0.11.0?

@krisp -> Have you confirmed the version by chance? I believe this was an issue that is now fixed…