Profile data on S3 in multiple prefixes

falizadeh · September 21, 2020, 1:58pm

We are using GE to profile our ingested datasets on S3. It mostly works but we have an issue with the scenario where data is ingested in multiple prefixes. I am wondering how I can specify the prefixes in my regex_filter. This is what I have:
“assets”: {
“name_1”: {
“prefix”: “prefix_1/”,
“regex_filter”: “prefix_1/exportdate=.*/.*parquet”
}
}
And I would like to profile these files from these prefixes:
prefix_1/exportdate=2020/1.parquet
prefix_1/exportdate=2020/2.parquet
prefix_1/exportdate=2019/10.parquet
prefix_1/exportdate=2019/20.parquet

But I get this error: great_expectations.exceptions.BatchKwargsError: Unrecognized batch_kwargs for spark_source

bhcastleton · September 21, 2020, 2:13pm

@falizadeh - Clarifying question:
Do these files have the same schema so you want them all to be profiled together and eventually tested in a single batch?

falizadeh · September 21, 2020, 2:34pm

@bhcastleton yes, they all have the same schemas only they are placed in different prefixes.

bhcastleton · September 21, 2020, 2:41pm

@falizadeh - Thanks. Unfortunately, the current answer is that Great Expectations doesn’t have a native method to handle this internally unless the computing framework you’re working in can handle it. For example, this post talks about how to use Spark to load multiple files into one batch. Validate foreign keys / load multiple files to a single spark dataframe with batch generator

Spark supports specifying a path with multiple files and spark will read them into a data frame. So this works. Other frameworks may not support this. Great Expectations currently doesn’t handle that internally so you’d have to get all the files into a single file or dataframe before pointing Great Expectations to it.

Topic		Replies	Views
Configuring S3 as Datastore Archive how-to , s3	2	705	April 1, 2021
How to configure a PySpark datasource for accessing the data from AWS S3? Archive	1	1441	March 28, 2020
Validate foreign keys / load multiple files to a single spark dataframe with batch generator Archive how-to	1	756	July 8, 2020
How to load data from S3 for validation with Pandas as a batch Archive how-to , help-wanted	3	650	June 24, 2020
Creating Data source for s3 with pandas Archive how-to , help-wanted , s3	1	650	June 14, 2021

Profile data on S3 in multiple prefixes

Related topics