Configuring S3 as Datastore

Hi All, I’m trying to configure an S3 bucket as a datastore. I’m following this guide: How to configure a Pandas/S3 Datasource — great_expectations documentation, but running into errors. Specifically, “If s3 returned common prefixes it may not have been able to identify desired keys”.I’m stuck on the Regex_filter part (I’m almost sure). Basic challenge is I have a prefix specified, but then under that prefix, there are 1-2 additional layers of prefixes because we’re using a hive partioning strategy.

prefix: path/to/top/
partitions: version={version}/dt={date}

Any suggestions on how to handle the partitions line above?

2 Likes

@bradleyfay I think you are right - need to tinker with the regex. It is difficult to say what the regex should be without seeing the exact directory structure and knowing which files you want be counted as batches of that data asset. As a workaround, are you sure that your use case requires configuring a BatchKwargsGenerator? If not, section 2 of this how-to guide shows how you can specify a particular file on S3 to be used in a batch.

the directory structure looks like this:

s3_bucket/pre/fix/dataset_1/version=1/dt=2021-02-01/file.parquet
s3_bucket/pre/fix/dataset_1/version=1/dt=2021-02-02/file.parquet
s3_bucket/pre/fix/dataset_2/version=1/dt=2021-02-01/file.parquet
s3_bucket/pre/fix/dataset_2/version=1/dt=2021-02-01/file.parquet

dataset_1 indicates the top level of the dataset. Version and dt are hive partitions, and the file.parquet are the files that I want to execute the expectations against. All the expectations will be written at the dataset entity level.

Based on the tutorial, i’m using this in my YAML:

assets:
  dataset_1:
    prefix: pre/fix/dataset_1/
    regex_filter: .*

I’ve also tried (.*)/(.*)/.* as regex_filter and it doesn’t work.