Configure datasource for json files

ramon.oliveira · May 13, 2020, 4:44pm

Hey There,

Anyone knows how to properly configure the datasource for json files composed with lines (one json per line)?
Here’s my datasource configuration at the moment:

datasources:
  raw:
    class_name: PandasDatasource
    data_asset_type:
      class_name: PandasDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: ../data/raw
    module_name: great_expectations.datasource

Using pandas I can read my file with pd.read_json(‘data/raw/file.json’, lines=True). How can I configure a file like this in the datasources?

I tried configuring batch_kwargs_generators like this with no luck:

batch_kwargs_generators:
  subdir_reader:
    class_name: SubdirReaderBatchKwargsGenerator
    base_directory: ../data/raw
    reader_method: read_json
    reader_options:
      lines: true

eugene.mandel · May 13, 2020, 10:04pm

Your second configuration snippet is correct - if you add “line: true” under reader_options, this option will be passed to pandas.read_json and will read a file with one JSON object per line correctly.

I suspect the only reason this is not working yet is the relative path to the base_directory. If you are using a relative path to data, make sure that it is relative to the “great_expectations” directory in your project (this is the directory where great_expectations.yml is located in).

ramon.oliveira · May 18, 2020, 3:27pm

I managed to fix with your suggestion. Thanks!

kuberketes · December 12, 2020, 10:23pm

I have the same issue but my raw jsons are in a azure datalake filesystem.
Would it be possible to use ge to validate them?

Topic		Replies	Views
Creating Data source for s3 with pandas Archive how-to , help-wanted , s3	1	645	June 14, 2021
How to configure a Pandas/filesystem Datasource Archive how-to , help-wanted	5	727	August 26, 2021
Connecting GE with S3 Archive	5	2677	May 28, 2021
How to create GlobReader Generator when creating a new datasource via CLI Archive	1	441	March 15, 2020
How to configure a Pandas/S3 Datasource Archive how-to , help-wanted	3	817	February 24, 2021

Configure datasource for json files

Related topics