Can I use Great Expectations to validate non-tabular data?

Can I use Great Expectations to validate non-tabular data? For example, graphs, nested fields, etc.?

Currently, Great Expectations can be used to validate only tabular data - Pandas dataframes, Spark dataframes, database tables and query results.

There is no theoretical reason why Great Expectations’s approach cannot be extended to non-tabular data, but it has not been done yet.

As a workaround, some users who want to use Great Expectations to validate their non-tabular data transform it into a tabular format.

Very interested here! I see great potential for great_expectations in the JSON and RDF world.
Let’s start with object/hierarchical/JSON/XML data. The nesting and fields are dynamic, which makes them unpredictable. This is hard to stream the data, thus compared to tabular, it is way harder to leverage the same performance. We might want to think what restrictions to introduce. Next, you would need some path language (XPath or JSONPath) to be able to traverse the object. This is because in order to define an expectation, you need to add a reference to the value (in a table, this is simply the column. Easy).

I did some work on a mapping language to transform hierachical formats to RDF called RML (http://rml.io), which had the same challenges. In the end, we ended up with two required constructs:

  • an iterator: a path expression which selects a iterable list of json objects. This turns a unpredictable object into a series of predictable objects with consistent shape.
  • reference: a path expression that selects the value within one of these object produced by the iterator.

Example:

{
  "a": "test",
  "b": [{ "c": [{ "d": 1 }, { "d": 2 }, { "d": 3 }] }]
 }

Let’s say we want to define an expectation over { “d”: 1 }, { “d”: 2 }, { “d”: 3 }, then we would need (in JSONPath):
iterator: .b.[0].c -> return array { "d": 1 }, { "d": 2 }, { "d": 3 } reference: .d -> return value of “d”, which is 1,2,3

1 Like

I think this makes complete sense and probably covers 90% of the use cases. On the other hand, I am currently dealing with a pipeline that pulls raw data from a GCS bucket and loads it into a BigQuery table.

In this case, supporting nested JSON structures would allow me to set expectations on raw data before it even touches the table. Sadly, I think the only workaround for this would probably be:

  1. Grab data from the bucket and load it in a temp table
  2. Run a given set of expectations
  3. Perform actions based on the results (reject data or write it to prod table)

Sadly this is not viable as the overhead of creating temp tables (data loaded in si massive) would be just not beareable. I am left with the only option of running expectations after data is ingested and perform some cleanup actions after the fact.

Any suggestion on better approaces is highly welcomed.

1 Like

@anddt I can’t readily think of a better approach than what you are describing. At least, not until we support nested structures.

Is there any update on JSON support with GE? How does the JsonSchemaProfiler() play into innovations in this relm?