No support for Spark DF in Result Format COMPLETE Mode

it seems that when the Data Frame is Spark Data Frame (ie NOT Pandas DF), the result format, COMPLETE mode does NOT return neither:

a) unexpected_index_query
b) unexpected_list (only partial list is returned)

therefore it can be inferred that the COMPLETE mode has NOT been implemented for the case when the DF is a Spark DF - is that correct?

and therefore it is not possible to perform Automated Data Cleansing of Spark DF with GX? Only Automated Data Quality Reporting on Spark DF is possible with GZ

adding some new info about the issue - it seems that the bug reported above is manifested ONLY when the Result Format is specified from the Validator class e.g.:

validation_result = my_validator.expect_column_values_to_be_in_set(
column=“my_var”,
value_set=[“A”, “B”],
result_format={
“result_format”: “COMPLETE”,
“unexpected_index_column_names”: [“pk_column”],
“return_unexpected_index_query”: True,
},
)

However if the result format is specified in the Checkpoint class (and therefore applies to ALL Validators / Validation Rules referenced by the Checkpoint), the COMPLETE mode works as expected it it produces a SPark DF query which can be used to filter out bad records / records which have failed the validation rules

my investigation and explanation is that when the Validator object is persisted / saved the result format doesnt appear in the json file for some reason - which may be a bug in GX. Hence subsequently the Checkpoint referencing the Expectation (validation rule) retrieves Expectation WITHOUT result format COMLETE

I’ve found your post following a similar issue (bug?) that I’m experiencing with GE validation result.
My backend storage is SQLite (similar config as PostgreSQL). I’m running a checkpoint with the following specs for the runtime config and result format set to ‘COMPLETE’ with the following options:

runtime_configuration={
    "result_format": {
        "result_format": "COMPLETE",
        "unexpected_index_column_names": ['EQUNR'],
        'unexpected_list': True,
        "unexpected_metrics_with_values": True,
        "unexpected_index_list" : True, 

}

The documentation specifies the following for COMPLETE results.
I’ve tried multiple options. ‘EQUNR’ is a PK in my SQLite table, but not the column I’m performing the validation against. I’ve also tried creating a ROWID (equivalent to a Pandas DF index) on my table and using it in the config above, but I’m getting always the same three issues:

  1. The ‘unexpected_list’ yields all the failed rows, but without my PK or the index which is not useful at all in order to track failed rows. My goal was to pick up the query and create a trigger in my SQLite that will load these failed rows into a separate table (‘failed records’ table).
  2. The ‘unexpected_index_list’ only yields the ‘partial_unexpected_index_list’ results showing both my PK and the failed expectation column.
  3. The ‘unexpected_index_query’ is not properly formatted (at least for use in SQLite. It comes with backspaces \ and other formatting characters that make it unusable, via console or copying it as a sql string to be grabbed by a function and used by the dbengine.
    Bonus. Is there a way (or easy way) to deJSONize the ‘value’ column and to put it in different columns based on user config, i.e. in my config above I’d get (at least) 3 columns, unexpected_index_column_names, unexpected_list and unexpected_index_list.

Would it be possible for someone from GE to clearly specify what/when is supported for the ‘COMPLETE’ option based on the datasource/store type combinations. At least we don’t spend too much time on something if it’s not fully supported (yet).

"unexpected_list" : [A list of all values that violate the Expectation]       
 'unexpected_index_list': [A list of the indices of the unexpected values in the column, as defined by the columns in `unexpected_index_column_names`]        
'unexpected_index_query': [A query that can be used to retrieve all unexpected values (SQL and Spark), or the full list of unexpected indices (Pandas)]

@igalech
Your configuration looks fine.

The behavior you are seeing in 1 and 2 with ‘unexpected_list’ and ‘undexpected_index_list’ is consistent with the documentation. There’s an example output near the bottom of the documentation page you linked. Based on that I’d assume that it is not a bug.

Your 3rd point (unusable query) does seem like either a bug or SQLite is an unsupported backend for the behavior. As you mentioned, there isn’t clear documentation on which backends are supported.
Hopefully, this will be improved in the future, especially after the GX 1.0 release. Just to confirm even though this might not make any difference, did you create your data source using add_sql or add_sqlite?

As for the bonus, I didn’t quite understand what you meant by the ‘value’ column and couldn’t find one from the Checkpoint results. Anyway, an option is to write a Python script that manipulates the Checkpoint results to a desired format.

@evo
Yes, this is known behavior, though maybe documented unclearly: the result format persists only when it’s specified in the Checkpoints.
Directly from the documentation:

GX recommends that you use an Expectation-level configuration for exploratory analysis, and add the final configuration at the Checkpoint-level.

As for your other point: yes, GX overall is geared towards automated data quality reporting. This is not limited to just Spark.
The results from GX’s data quality reporting/data validation can be used to do data cleansing but users have to implement this functionality themselves.

Thanks for your response Toivo.

I added my data source using context.sources.add_sqlite(name='my_sqlite_ds', connection_string='path_to_sqliteDB')

With respect to my 2nd point, I think this is either missing info. in the documentation or a bug (for sqlite store type at least), or maybe it is simply not supported (see below).

If I follow the documentation (and the example) for COMPLETE result format with the options I mentioned I should obtain, for the ‘unexpected_index_list’ all the values (rows), along with my PK, that have failed a given expectation.

'unexpected_index_list': [A list of the indices of the unexpected values in the column, as defined by the columns in `unexpected_index_column_names`]

In my case, with SQLite as store backend, this is not the case. The ‘unexpected_index_list’ = ‘partial_unexpected_index_list’ with only 20 records (rows). Nowhere in the documentation (at least to my knowledge and searches) it says that the ‘unexpected_index_list’ would yield only a partial list of failed records.

If I do use however a Pandas DF and filesystem store type for the same data I do obtain all the failed rows.

Based on your response I have to assume that only supported stores yield (0.18.8) the entire 'unexpected_index_list` for ‘COMPLETE’ type of results? For my project it would be relatively easy to move from SQLite->PostgreSQL if that means getting all the errors listed out.

I look forward to 1.0. Traceability and error/correction tagging in data quality projects is of utter most importance when remediation and DQM is in scope IMO.

Thanks.

Inaki.