Can I use Great Expectations to validate non-tabular data?

eugene.mandel · June 8, 2020, 5:25am

Can I use Great Expectations to validate non-tabular data? For example, graphs, nested fields, etc.?

eugene.mandel · June 8, 2020, 5:30am

Currently, Great Expectations can be used to validate only tabular data - Pandas dataframes, Spark dataframes, database tables and query results.

There is no theoretical reason why Great Expectations’s approach cannot be extended to non-tabular data, but it has not been done yet.

As a workaround, some users who want to use Great Expectations to validate their non-tabular data transform it into a tabular format.

mielvds · August 27, 2020, 3:18pm

Very interested here! I see great potential for great_expectations in the JSON and RDF world.
Let’s start with object/hierarchical/JSON/XML data. The nesting and fields are dynamic, which makes them unpredictable. This is hard to stream the data, thus compared to tabular, it is way harder to leverage the same performance. We might want to think what restrictions to introduce. Next, you would need some path language (XPath or JSONPath) to be able to traverse the object. This is because in order to define an expectation, you need to add a reference to the value (in a table, this is simply the column. Easy).

I did some work on a mapping language to transform hierachical formats to RDF called RML (http://rml.io), which had the same challenges. In the end, we ended up with two required constructs:

an iterator: a path expression which selects a iterable list of json objects. This turns a unpredictable object into a series of predictable objects with consistent shape.
reference: a path expression that selects the value within one of these object produced by the iterator.

Example:

{
  "a": "test",
  "b": [{ "c": [{ "d": 1 }, { "d": 2 }, { "d": 3 }] }]
 }

Let’s say we want to define an expectation over { “d”: 1 }, { “d”: 2 }, { “d”: 3 }, then we would need (in JSONPath):
iterator: .b.[0].c -> return array { "d": 1 }, { "d": 2 }, { "d": 3 } reference: .d -> return value of “d”, which is 1,2,3

anddt · April 22, 2021, 7:15pm

I think this makes complete sense and probably covers 90% of the use cases. On the other hand, I am currently dealing with a pipeline that pulls raw data from a GCS bucket and loads it into a BigQuery table.

In this case, supporting nested JSON structures would allow me to set expectations on raw data before it even touches the table. Sadly, I think the only workaround for this would probably be:

Grab data from the bucket and load it in a temp table
Run a given set of expectations
Perform actions based on the results (reject data or write it to prod table)

Sadly this is not viable as the overhead of creating temp tables (data loaded in si massive) would be just not beareable. I am left with the only option of running expectations after data is ingested and perform some cleanup actions after the fact.

Any suggestion on better approaces is highly welcomed.

eugene.mandel · April 29, 2021, 11:50pm

@anddt I can’t readily think of a better approach than what you are describing. At least, not until we support nested structures.

pbjolsby · June 15, 2021, 9:37pm

Is there any update on JSON support with GE? How does the JsonSchemaProfiler() play into innovations in this relm?

Topic		Replies	Views
How do I programmatically validate expectations? Archive	3	574	May 17, 2021
What data types does Great Expectations support for validation? Archive	1	461	May 6, 2021
Data Cleaning with expectations Archive	1	1547	March 4, 2020
How to validate Spark DataFrames in 0.13 Archive	3	1246	July 19, 2021
Using Great Expectations to validate Excel files Archive	0	1348	October 13, 2020

Can I use Great Expectations to validate non-tabular data?

Related topics