Connecting GE with S3

roblim · May 21, 2020, 7:47am

Thanks for reaching out! I think the easiest way forward would be to complete the great_expectations init process as you’ve been doing (selecting 1. Files on a filesystem (for processing with Pandas or Spark) for the first option. But instead of entering an S3 path when prompted for a filepath, enter the filepath of a small sample csv you have saved on your local machine. Since this csv will only be used to generate a sample Expectation Suite and validation results, you can delete these artifacts later. This way, you’ll get the Great Expectations project scaffold, with directory structure and config file, without having to worry about setting everything up manually.

Once you’ve completed the init process, open the great_expectations.yml file that was created - here, we’ll add the configuration for the S3GlobReaderBatchKwargsGenerator. The datasources section should look something like this:

datasources:
  files_datasource:
    module_name: great_expectations.datasource
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: PandasDataset
    class_name: PandasDatasource

Since this datasource was set up in init, it lacks a batch_kwargs_generators section. After adding the appropriate config as outlined in https://docs.greatexpectations.io/en/latest/module_docs/generator_module.html?highlight=s3%20glob#s3globreaderbatchkwargsgenerator, the datasources section should now look something like this (filled in with your own info of course):

datasources:
  files_datasource:
    module_name: great_expectations.datasource
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: PandasDataset
    class_name: PandasDatasource
    batch_kwargs_generators:
      my_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        boto3_options:
          endpoint_url: $S3_ENDPOINT  # Use the S3_ENDPOINT environment variable to determine which endpoint to use
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
            dictionary_assets: True
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100

You can also check out this post for another config example: How to configure a PySpark datasource for accessing the data from AWS S3?. (You can also do the above programmatically if you have a data_context object, using data_context.add_batch_kwargs_generator( datasource_name, batch_kwargs_generator_name, class_name, **kwargs) - passing in config as kwargs)

Once you have the generator set up properly, you can run great_expectations datasource profile YOUR_DATASOURCE_NAME in the CLI to generate expectation suites using batches yielded by the S3 generator.

Next, you can start playing with the sample validation notebooks found at great_expectations/notebooks/pandas/validation_playground.ipynb (for Pandas datasources). Since you’ll have a batch_kwargs_generator configured, instead of providing batch_kwargs manually as shown in the notebook, you can call context.build_build_batch_kwargs(datasource="datasource_name", batch_kwargs_generator="my_batch_kwargs_generator_name") to yield batch_kwargs, which you can then pass to context.get_batch(batch_kwargs, expectation_suite_name).

If for some reason, the batch_kwargs_generators configuration is still giving you issues, you can always just construct batch_kwargs yourself (since that’s all the batch_kwargs_generators do). For S3, batch_kwargs should have the form:

{
    "s3": "s3a://BUCKET/KEY",
    "reader_options": {...},
    "reader_method": "...",
    "limit": ...
}

(if you’re curious, you can check out the source for building the batch_kwargs here: https://github.com/great-expectations/great_expectations/blob/f5abb426d3837c587846d91157c9a663d6698c4d/great_expectations/datasource/batch_kwargs_generator/s3_batch_kwargs_generator.py#L188)

Topic		Replies	Views
Creating Data source for s3 with pandas Archive how-to , help-wanted , s3	1	651	June 14, 2021
How to load data from S3 for validation with Pandas as a batch Archive how-to , help-wanted	3	650	June 24, 2020
How to configure a PySpark datasource for accessing the data from AWS S3? Archive	1	1441	March 28, 2020
Use S3 as data source 2022 Archive help-wanted	1	610	November 16, 2022
Configuring S3 as Datastore Archive how-to , s3	2	708	April 1, 2021

Connecting GE with S3

Related topics