How to load data from S3 for validation with Pandas as a batch

kyle · May 27, 2020, 9:22pm

This article is for comments to: https://docs.greatexpectations.io/en/latest/how_to_guides/creating_batches/how_to_load_data_from_s3_for_validation_with_pandas_as_a_batch.html

Please comment +1 if this How to is important to you.

krisp · May 28, 2020, 2:54am

+1

I don’t see how to define s3 datasource in 0.11.0 so that it works. I have (multiple versions of):

  devices_s3:
    batch_kwargs_generators:
      s3gen:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: mybucket/
        reader_method: parquet
        assets: 
          device_ip_sample: 
            prefix: device_ip_sample/
            regex_filter: .*
            dictionary_assets: True
    module_name: great_expectations.datasource
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: SparkDFDataset
    class_name: SparkDFDatasource

but great_expectations datasource profile devices_s3 gives me this error:

Unrecognized batch_parameter(s): {'data_asset_name'}
(...omitted errors...)
  File "/home/kris/anaconda3/lib/python3.7/site-packages/great_expectations/datasource/batch_kwargs_generator/s3_batch_kwargs_generator.py", line 208, in _build_batch_kwargs
    "s3": "s3a://" + self.bucket + "/" + key,
TypeError: can only concatenate str (not "dict") to str

Thanks for your help!

aylr · May 28, 2020, 10:37pm

@krisp we’ll work on reproducing this! Thanks for the report. Just to confirm is this on version 0.11.0?

jpcampbell42 · June 24, 2020, 9:57pm

@krisp -> Have you confirmed the version by chance? I believe this was an issue that is now fixed…

Topic		Replies	Views
Creating Data source for s3 with pandas Archive how-to , help-wanted , s3	1	642	June 14, 2021
Connecting GE with S3 Archive	5	2673	May 28, 2021
Validate only latest file in S3 Archive	3	486	May 27, 2020
How to configure an EMR Spark Datasource Feedback how-to , help-wanted	7	1652	December 10, 2021
Validate foreign keys / load multiple files to a single spark dataframe with batch generator Archive how-to	1	751	July 8, 2020

How to load data from S3 for validation with Pandas as a batch

Related topics