We got a question about validating Excel files without having to export them to CSV. I can think of two options:
-
One quick option would be to follow the instructions provided in this tutorial to validate Excel files in-line in notebook. You can use
ge.read_excel
instead ofread_csv
(make sure the headers are loaded correctly). However, this means you won’t be able to save the expectation suite or generate data docs since you won’t have a data context. -
Another option would be to create a data context with a “dummy” datasource (e.g. pointing at any directory), then loading the Excel data as a pandas dataframe, creating a batch, and using the standard workflow that’s shown in the
great_expectations suite new
notebook. The only difference is how you set up the batch_kwargs: 1) Pass the ‘dataset’ key to point to the dataframe. 2) Unfortunately you’ll need the ‘datasource’ key to point to a dummy datasource (this isn’t quite intuitive). Here’s some code that worked for me: https://gist.github.com/spbail/6723728aecf0295bf50622d07e09840e