How to configure a PySpark datasource for accessing the data from AWS S3?

Evgeny_Nikolin · March 12, 2020, 3:30pm

Current version of Great Expectation framework documentation (0.9.4) does not contain any samples of how to configure a PySpark datasource in order to access the AWS S3 files. It would be really helpful if there is any example of it’s configuration.

jpcampbell42 · March 28, 2020, 12:59pm

You’re right! In fact, it’s really similar to the example for pandas, since spark’s reader methods also know how to process s3 paths:

datasources:
  nyc_taxi:
    class_name: SparkDFDatasource
    generators:
      s3:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: nyc-tlc
        delimiter: '/'
        reader_options:
          sep: ','
          engine: python
        assets:
          taxi-green:
            prefix: trip data/
            regex_filter: 'trip data/green.*\.csv'
          taxi-fhv:
            prefix: trip data/
            regex_filter: 'trip data/fhv.*\.csv'
    data_asset_type:
      class_name: SparkDFDataset

Topic		Replies	Views
How to configure a Pandas/S3 Datasource Archive how-to , help-wanted	3	809	February 24, 2021
How to configure an EMR Spark Datasource Feedback how-to , help-wanted	7	1633	December 10, 2021
Connecting GE with S3 Archive	5	2661	May 28, 2021
How to configure a self managed Spark Datasource Archive how-to , help-wanted	1	805	October 28, 2020
Help understanding datasource_name config for RuntimeBatchRequest GX Core Support s3 , datasource	5	369	September 14, 2023

How to configure a PySpark datasource for accessing the data from AWS S3?

Related topics