Integration with Datahub: does not work

Hi all :slightly_smiling_face:

I am trying to integrate GE in Datahub and I follow the documentation about it, which implies to add an action in the checkpoint.yml Great Expectations | DataHub

Is there someone that has been able to integrate them?

When I execute my script, checkpoint.yml is reverted and action deleted.

Hey @Ric_Denmark ! Thanks for reaching out.

Which backend & version of GX are you using? I believe the DataHub integration only supports SQLAlchemy backends, and may only support up to v0.15.50, possibly 0.16.16.

Additionally, The GX <> DataHub integration is maintained and supported by the folks over at DataHub – they may be able to provide more help for you over in their community.

Hej Austin. Thank you for writing me.

I use GE 0.15.50 and SQLAlchemy - no pandas here :slight_smile:

Since you wrote me, I started all over using the gx-tutorials and my code is not reverted anymore, I can run the GE checkpoint in CLI, but the metadata is not ingested in Datahub.

Metadata is sent but the validation tab is not visualised.

I activate the DATAHUB_DEBUG variable to true and I see that the Dataset URN might be the issue.

 - name: datahub_action
    action:
      module_name: datahub.integrations.great_expectations.action
      class_name: DataHubValidationAction
      server_url: https://meta.test-data.domain.it/api/gms
      token: 'string_of_token' 
      env: TEST
      platform_instance_map:  {"name_my_datasource": "Synapse.Data_Platform" }
      graceful_exceptions: true

When I run my checkpoint I get:

* Using v3 (Batch Request) API
* Calculating Metrics: 100%|███████████████████████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00,  3.92it/s]
* Finding datasets being validated
* GE expectation_suite_name - name_of_my_suite, expectation_type - expect_table_columns_to_match_set, Assertion URN - urn:li:assertion:bebcf033345475640f99b73e34873ad1
* GE expectation_suite_name - name_of_my_suite, expectation_type - expect_column_values_to_not_be_null, Assertion URN - urn:li:assertion:da4173832d02233a39d60186a6deed59
* GE expectation_suite_name - name_of_my_suite, expectation_type - expect_column_values_to_not_be_null, Assertion URN - urn:li:assertion:a64d6234eb91f4e8cdfd250584032c24
* GE expectation_suite_name - name_of_my_suite, expectation_type - expect_column_values_to_be_between, Assertion URN - urn:li:assertion:158ae3d98a2e227e126a226746524418
* Sending metadata to datahub ...
* Dataset URN - urn:li:dataset:(urn:li:dataPlatform:mssql,Synapse.Data_Platform..name_container.name_dataset,TEST)
* Assertion URN - urn:li:assertion:bebcf033345475640f99b73e34873ad1
* Assertion URN - urn:li:assertion:da4173832d02233a39d60186a6deed59
* Assertion URN - urn:li:assertion:a64d6234eb91f4e8cdfd250584032c24
* Assertion URN - urn:li:assertion:158ae3d98a2e227e126a226746524418
* Metadata sent to datahub.
* Validation succeeded!

* Suite Name                                   Status     Expectations met
* - name_of_my_suite                   ✔ Passed   4 of 4 (100.0 %)

There are 2 dots in Dataset URN instead of 1 as it should be by reading the url of the dataset, where I try to visualize the validation tab

Solution found.

After activating DATAHUB_DEBUG it was possible to see which Dataset URN was used to send metadata.

This allowed to correct the arguments within :
platform_instance_map
platform_alias
env

btw the token should be within ’ ’ and without Authorization Bearer

Make sure that any yml file has the right GE version 0.15.50, in case you still find issues

1 Like