Configuring S3 as Datastore

bradleyfay · March 30, 2021, 8:07pm

Hi All, I’m trying to configure an S3 bucket as a datastore. I’m following this guide: How to configure a Pandas/S3 Datasource — great_expectations documentation, but running into errors. Specifically, “If s3 returned common prefixes it may not have been able to identify desired keys”.I’m stuck on the Regex_filter part (I’m almost sure). Basic challenge is I have a prefix specified, but then under that prefix, there are 1-2 additional layers of prefixes because we’re using a hive partioning strategy.

prefix: path/to/top/
partitions: version={version}/dt={date}

Any suggestions on how to handle the partitions line above?

eugene.mandel · April 1, 2021, 3:09pm

@bradleyfay I think you are right - need to tinker with the regex. It is difficult to say what the regex should be without seeing the exact directory structure and knowing which files you want be counted as batches of that data asset. As a workaround, are you sure that your use case requires configuring a BatchKwargsGenerator? If not, section 2 of this how-to guide shows how you can specify a particular file on S3 to be used in a batch.

bradleyfay · April 1, 2021, 7:41pm

the directory structure looks like this:

s3_bucket/pre/fix/dataset_1/version=1/dt=2021-02-01/file.parquet
s3_bucket/pre/fix/dataset_1/version=1/dt=2021-02-02/file.parquet
s3_bucket/pre/fix/dataset_2/version=1/dt=2021-02-01/file.parquet
s3_bucket/pre/fix/dataset_2/version=1/dt=2021-02-01/file.parquet

dataset_1 indicates the top level of the dataset. Version and dt are hive partitions, and the file.parquet are the files that I want to execute the expectations against. All the expectations will be written at the dataset entity level.

Based on the tutorial, i’m using this in my YAML:

assets:
  dataset_1:
    prefix: pre/fix/dataset_1/
    regex_filter: .*

I’ve also tried (.*)/(.*)/.* as regex_filter and it doesn’t work.

Topic		Replies	Views
Profile data on S3 in multiple prefixes Archive	3	1004	September 21, 2020
Creating Data source for s3 with pandas Archive how-to , help-wanted , s3	1	643	June 14, 2021
How to configure a Pandas/S3 Datasource Archive how-to , help-wanted	3	817	February 24, 2021
Use S3 as data source 2022 Archive help-wanted	1	609	November 16, 2022
S3 Data Source Configures Successfully But Suite New Fails with New S3 Datasource Archive s3	2	572	May 3, 2021

Configuring S3 as Datastore

Related topics