Current version of Great Expectation framework documentation (0.9.4) does not contain any samples of how to configure a PySpark datasource in order to access the AWS S3 files. It would be really helpful if there is any example of it’s configuration.
You’re right! In fact, it’s really similar to the example for pandas, since spark’s reader methods also know how to process s3 paths:
datasources:
nyc_taxi:
class_name: SparkDFDatasource
generators:
s3:
class_name: S3GlobReaderBatchKwargsGenerator
bucket: nyc-tlc
delimiter: '/'
reader_options:
sep: ','
engine: python
assets:
taxi-green:
prefix: trip data/
regex_filter: 'trip data/green.*\.csv'
taxi-fhv:
prefix: trip data/
regex_filter: 'trip data/fhv.*\.csv'
data_asset_type:
class_name: SparkDFDataset