CSVAsset only showing one Batch even though batching_regex matches two files

I need some pointers as to what I’m doing wrong, please.

Following the process here and in docs for setting up a multi-file Pandas filesystem datasource, I’ve created a datasource and a CSVAsset with a batching_regex which I’ve verified matches two files in the same data directory.

When then running a checkpoint, I was only seeing one file processed. In working backward to see what’s wrong, I’m finding that only one file is listed when i run

mybatchrequest = asset.build_batch_request()
for batch in asset.get_batch_list_from_batch_request(mybatchrequest):
  print(batch.batch_spec)

There’s not something special that has to be done to get all the files when the regex includes group names, is there? My regex in the asset is batching_regex=re.compile('customer_(?P<datetime>\\d{14})\\.csv')

Are there other likely sources of this that I should be looking into?

Hi Barret,

The format looks ok to me. The only thing I find strange is the re.compile when defining the batch regex.

According to the Batch Request documentation, you define the regex as:

asset = datasource.add_csv_asset(
    name="csv_asset",
    batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv",
    order_by=["year", "month"],
)

So instead of passing it as re.compile, could you try passing it as an r string?

Another thing to consider is the way you are building your batch request. According to the docs it’s done using a call like this:

batch_request = asset.build_batch_request(options={"year": "2019", "month": "02"})

In you code, you define the group based on datetime, but then in the asset.build_batch_request, you are not passing the datetime as an option.

So as a general rule, first you define the regex using the r’string, defining the group “tag”. Then you can select which of the matching files to use, either all or one specific file.

As a sidetone, as I was playing around batch requests, I discovered that when defining a batch request you can get a single file, or the complete list of elements if you don’t pass any argument. For example, given a dir with 12 monthly files, you can’t just pass any regex or list as arguments to the call of build_batch_request() (So you can’t pass options={ “month”: ["01,“02”, "03]} or “month”: “^1\d*” (to choose all months starting by 1))

I hope it helps!
César

I think I’ve figured this out. After rebuilding all the code from scratch, what I’m seeing is that the content of a batch request is not changed when a batch request is recreated after the content of the file system changes.

To replicate:

  1. Create a fielsystem datasource and asset using a batching_regex which will match some number of existing files.
  2. Generate a batch request and print the batches in the request. Observe the count matches the number of files.
  3. Add an additional file matching the batching_regex parameter to the filesystem
  4. Generate a new batch request and prtint the batches in the request. Observe the new file is not listed in the batch.

Is that expected behavior? I thought I understood that the batching_regex was re-evaulated when each time you created a batch request.

[edit]
The opposite case is apparently also true: If you create an asset which includes some number of files and create a batch request from that asset, then remove one or more of the files and create a new batch request, running the code below raises a FileNotFoundError: [Errno 2] No such file or directory: <the removed file>

for batch in asset.get_batch_list_from_batch_request(mybatchrequest):
  print(batch.batch_spec)

Hi @Barrett so this is a common issue raised raised by our users. We’ve just worked on a GX-recommend method for processing multiple batches using one checkpoint. The docs are here:

Thank you, @HaebichanGX !

Actually, @HaebichanGX , follow up question, if I may:

The solution you linked to covers getting the validation to run against all the files in the batch, but it doesn’t address the issue that new files added to the file system are not recognized by the batch even after generating a new request. The only way I’ve been able to solve for that is to create a new Asset.

Shouldn’t a batch request find all the files matching the batching_regex of the Asset at the time the batch request is created?

Hi @HaebichanGX , Validating multiple batches within Single checking point using batch request and validation list is very good improvement. However it still come with a challenge and I think it will be good point for another feature in GX.
Validation with multiple batches will create separate validation results for each batch. Mostly we will have relevant data from data source in the same checkpoint. It will be good to have option of consolidated single validation results. Why would one will like to see many entries in data docs site for same expectations?

Thoughts ? will GX architecture will allow this?

Hi @vipinadm, first, welcome to GX discourse! Yes I understand your point. So for that specific use case, my suggestion is to treat it during the asset level. So let’s say you’re working with a SQL datasource and have multiple data. You can use the add_query_asset to merge the data (union or otherwise) to create one mega batch. Then you can run one expectation against one mega batch of data. For filesystem datasource, the process would be something similar.

The example we gave to match one expectation suite to multiple batch requests inside a validations list was to keep things simple in terms of code and explanation and to focus on checkpoints. I hope that clarifies our intentions and the solution to the potential issue you raised.

Hi @HaebichanGX, Could you please shared an example of add query asset. I am reading data from GCS bucket
Note: @Vipin Admulwar originally posted this reply in Slack. It might not have transferred perfectly.

… or a hql ( Hive metastore ) example will be helpful
Note: @Vipin Admulwar originally posted this reply in Slack. It might not have transferred perfectly.

Hi @vipinadm there is no doc on hive metastore. The doc on add query asset is here: Manage SQL Data Assets | Great Expectations

You can use this advanced GX flow chart to visualize the workflow: GX Advanced Used Case Flow Chart (INCOMPLETE) | Lucidspark

Hi @HaebichanGX, Can you please confirm hive datasource supported in GX? I am using spark to read the data from GCS bucket so add_query_asset won’t help.