We are using GE to profile our ingested datasets on S3. It mostly works but we have an issue with the scenario where data is ingested in multiple prefixes. I am wondering how I can specify the prefixes in my regex_filter. This is what I have:
“assets”: {
“name_1”: {
“prefix”: “prefix_1/”,
“regex_filter”: “prefix_1/exportdate=.*/.*parquet”
}
}
And I would like to profile these files from these prefixes:
prefix_1/exportdate=2020/1.parquet
prefix_1/exportdate=2020/2.parquet
prefix_1/exportdate=2019/10.parquet
prefix_1/exportdate=2019/20.parquet
But I get this error: great_expectations.exceptions.BatchKwargsError: Unrecognized batch_kwargs for spark_source
1 Like
@falizadeh - Clarifying question:
Do these files have the same schema so you want them all to be profiled together and eventually tested in a single batch?
@bhcastleton yes, they all have the same schemas only they are placed in different prefixes.
@falizadeh - Thanks. Unfortunately, the current answer is that Great Expectations doesn’t have a native method to handle this internally unless the computing framework you’re working in can handle it. For example, this post talks about how to use Spark to load multiple files into one batch. Validate foreign keys / load multiple files to a single spark dataframe with batch generator
Spark supports specifying a path with multiple files and spark will read them into a data frame. So this works. Other frameworks may not support this. Great Expectations currently doesn’t handle that internally so you’d have to get all the files into a single file or dataframe before pointing Great Expectations to it.