Configure datasource for json files

Hey There,

Anyone knows how to properly configure the datasource for json files composed with lines (one json per line)?
Here’s my datasource configuration at the moment:

datasources:
  raw:
    class_name: PandasDatasource
    data_asset_type:
      class_name: PandasDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: ../data/raw
    module_name: great_expectations.datasource

Using pandas I can read my file with pd.read_json(‘data/raw/file.json’, lines=True). How can I configure a file like this in the datasources?

I tried configuring batch_kwargs_generators like this with no luck:

batch_kwargs_generators:
  subdir_reader:
    class_name: SubdirReaderBatchKwargsGenerator
    base_directory: ../data/raw
    reader_method: read_json
    reader_options:
      lines: true 

Your second configuration snippet is correct - if you add “line: true” under reader_options, this option will be passed to pandas.read_json and will read a file with one JSON object per line correctly.

I suspect the only reason this is not working yet is the relative path to the base_directory. If you are using a relative path to data, make sure that it is relative to the “great_expectations” directory in your project (this is the directory where great_expectations.yml is located in).

1 Like

I managed to fix with your suggestion. Thanks!

1 Like

I have the same issue but my raw jsons are in a azure datalake filesystem.
Would it be possible to use ge to validate them?

1 Like