CSVAsset only showing one Batch even though batching_regex matches two files

Barrett · July 20, 2023, 10:06pm

I need some pointers as to what I’m doing wrong, please.

Following the process here and in docs for setting up a multi-file Pandas filesystem datasource, I’ve created a datasource and a CSVAsset with a batching_regex which I’ve verified matches two files in the same data directory.

When then running a checkpoint, I was only seeing one file processed. In working backward to see what’s wrong, I’m finding that only one file is listed when i run

mybatchrequest = asset.build_batch_request()
for batch in asset.get_batch_list_from_batch_request(mybatchrequest):
  print(batch.batch_spec)

There’s not something special that has to be done to get all the files when the regex includes group names, is there? My regex in the asset is batching_regex=re.compile('customer_(?P<datetime>\\d{14})\\.csv')

Are there other likely sources of this that I should be looking into?

CesarGarcia · July 20, 2023, 10:48pm

Hi Barret,

The format looks ok to me. The only thing I find strange is the re.compile when defining the batch regex.

According to the Batch Request documentation, you define the regex as:

asset = datasource.add_csv_asset(
    name="csv_asset",
    batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv",
    order_by=["year", "month"],
)

So instead of passing it as re.compile, could you try passing it as an r string?

Another thing to consider is the way you are building your batch request. According to the docs it’s done using a call like this:

batch_request = asset.build_batch_request(options={"year": "2019", "month": "02"})

In you code, you define the group based on datetime, but then in the asset.build_batch_request, you are not passing the datetime as an option.

So as a general rule, first you define the regex using the r’string, defining the group “tag”. Then you can select which of the matching files to use, either all or one specific file.

As a sidetone, as I was playing around batch requests, I discovered that when defining a batch request you can get a single file, or the complete list of elements if you don’t pass any argument. For example, given a dir with 12 monthly files, you can’t just pass any regex or list as arguments to the call of build_batch_request() (So you can’t pass options={ “month”: ["01,“02”, "03]} or “month”: “^1\d*” (to choose all months starting by 1))

I hope it helps!
César

Barrett · July 21, 2023, 7:15am

I think I’ve figured this out. After rebuilding all the code from scratch, what I’m seeing is that the content of a batch request is not changed when a batch request is recreated after the content of the file system changes.

To replicate:

Create a fielsystem datasource and asset using a batching_regex which will match some number of existing files.
Generate a batch request and print the batches in the request. Observe the count matches the number of files.
Add an additional file matching the batching_regex parameter to the filesystem
Generate a new batch request and prtint the batches in the request. Observe the new file is not listed in the batch.

Is that expected behavior? I thought I understood that the batching_regex was re-evaulated when each time you created a batch request.

[edit]
The opposite case is apparently also true: If you create an asset which includes some number of files and create a batch request from that asset, then remove one or more of the files and create a new batch request, running the code below raises a FileNotFoundError: [Errno 2] No such file or directory: <the removed file>

for batch in asset.get_batch_list_from_batch_request(mybatchrequest):
  print(batch.batch_spec)

HaebichanGX · July 21, 2023, 1:44pm

Hi @Barrett so this is a common issue raised raised by our users. We’ve just worked on a GX-recommend method for processing multiple batches using one checkpoint. The docs are here:

Barrett · July 21, 2023, 3:17pm

Thank you, @HaebichanGX !

Barrett · July 21, 2023, 5:38pm

Actually, @HaebichanGX , follow up question, if I may:

The solution you linked to covers getting the validation to run against all the files in the batch, but it doesn’t address the issue that new files added to the file system are not recognized by the batch even after generating a new request. The only way I’ve been able to solve for that is to create a new Asset.

Shouldn’t a batch request find all the files matching the batching_regex of the Asset at the time the batch request is created?

vipinadm · July 22, 2023, 5:36am

Hi @HaebichanGX , Validating multiple batches within Single checking point using batch request and validation list is very good improvement. However it still come with a challenge and I think it will be good point for another feature in GX.
Validation with multiple batches will create separate validation results for each batch. Mostly we will have relevant data from data source in the same checkpoint. It will be good to have option of consolidated single validation results. Why would one will like to see many entries in data docs site for same expectations?

Thoughts ? will GX architecture will allow this?

HaebichanGX · July 24, 2023, 4:12pm

Hi @vipinadm, first, welcome to GX discourse! Yes I understand your point. So for that specific use case, my suggestion is to treat it during the asset level. So let’s say you’re working with a SQL datasource and have multiple data. You can use the add_query_asset to merge the data (union or otherwise) to create one mega batch. Then you can run one expectation against one mega batch of data. For filesystem datasource, the process would be something similar.

The example we gave to match one expectation suite to multiple batch requests inside a validations list was to keep things simple in terms of code and explanation and to focus on checkpoints. I hope that clarifies our intentions and the solution to the potential issue you raised.

slack_user · July 24, 2023, 4:35pm

Hi @HaebichanGX, Could you please shared an example of add query asset. I am reading data from GCS bucket
_{Note: @Vipin Admulwar originally posted this reply in Slack. It might not have transferred perfectly.}

slack_user · July 24, 2023, 4:39pm

… or a hql ( Hive metastore ) example will be helpful
_{Note: @Vipin Admulwar originally posted this reply in Slack. It might not have transferred perfectly.}

HaebichanGX · July 24, 2023, 4:53pm

Hi @vipinadm there is no doc on hive metastore. The doc on add query asset is here: Manage SQL Data Assets | Great Expectations

You can use this advanced GX flow chart to visualize the workflow: GX Advanced Used Case Flow Chart (INCOMPLETE) | Lucidspark

vipinadm · July 25, 2023, 4:54am

Hi @HaebichanGX, Can you please confirm hive datasource supported in GX? I am using spark to read the data from GCS bucket so add_query_asset won’t help.

Topic		Replies	Views
How to process multiple files read from s3 based on a regex in pandas GE Archive how-to	2	564	June 22, 2021
Setting up a batchrequest with pattern Archive	0	706	January 19, 2022
Data Asset Name not showing in Data Docs GX Core Support help-wanted , databricks	1	227	March 7, 2024
Bug in great expectation's checkpoint.py GX Core Support	0	39	July 10, 2024
Adding many csv file with add_batch_definition_path GX Core Support	0	35	April 10, 2025

CSVAsset only showing one Batch even though batching_regex matches two files

Related topics