GX with Databricks and Azure Blob Storage

Hello All,

I have been working with the GX package for few days now but I have a series of question to ask:
I want to ask whether it is possible to do this:

  1. Initiate a file system data context directly in a container in Azure blob storage instead of using DBFS root in databricks. If this is possible, can someone guide me on how to implement this?
  2. If the Above is not possible, when we want to host and share datadocs in Azure blob storage, MUST I configure other artifacts such as expectation store, checkpoint stores etc, to be in the Azure blob storage too OR can I leave them in dbfs root and ONLY configure the hosting and sharing of docs in the blob storage?

Hi @Chijioke ,

have you seen my post here:

This should answer your 1st question I guess. :slight_smile:

If you have more detailed questions, please ask.

Hello @hdamczy , Thank you for your response.

Correct me if I am misreading the code.
Your context_root_dir is still pointing to dbfs root and the datadocs is the only part that is actually stored and hosted on azure Storage right?

I’m not sure to answer your question 100% precisely but afaik due to I create a EphemeralDataContext no gx yml file is stored anywhere at the beginning.
context_root_dir is a mount point to an ADLS container which is looks like this: /dbfs/mnt/myproject/DataQuality/GX/
Afaik you can’t have a direct link to your ADLS (maybe it’s possible these days?) so we solved this problem using mountpoints. All data is stored in ADLS containers. The expectations suites .json files and the checkpoint .yml files as well as all data docs files in the $web container.

Hello @hdamczy

I don’t think we can have a direct link to ADLS, I used a mounted volume equally. I noticed something after I configured the data context.

  1. The config_variables.yml and great_expectations.yml files are both missing in the ADLS container where other gx contents are seen. Is this a problem?

  2. I configured the context to equally keep the contents of datadocs inside the $web container which it actually does, but when I copy the primary endpoint url
    and try to execute it on a browser, the page does not load. It displays an authentication error message.

Can you check whether I am missing something?
Below is the config:

root_dir="/Volumes/test/dq/"

data_context_config = gx.data_context.types.base.DataContextConfig(
        ## Local storage backend
        store_backend_defaults=gx.data_context.types.base.FilesystemStoreBackendDefaults(
            root_directory=root_dir
        ),

        ## Data docs site storage
        data_docs_sites={
            "az_site": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleAzureBlobStoreBackend",
                "container":  r"\$web",
                "connection_string":  "DefaultEndpointsProtocol=https;AccountName=<account_name>;AccountKey=<account_key>;EndpointSuffix=core.windows.net",
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": True,
            },
            }
        },
     )

context = gx.get_context(project_config=data_context_config)

Hello @Chijioke ,

I do not have config_variables.yml and great_expectations.yml either.
Everything GX needs on runtime is created by the function get_gx_context() in class GX_Context. You can put the code from the class directly in your Databricks notebook for testing purposes.

I just had my mount to DataQuality/GX/ on ADLS where 4 directories were created (checkpoints/, expectations/, profilers/ and uncommitted/)

After profiling my dataframes (see code below “To get the expectation-suite JSON files and the checkpoint YML files you can do a profiling…”) I had a .yml in checkpoints/ dir for each dataframe I profiled as well as .json in expectations/ dir.

Both files are loaded during runtime.
Maybe this schema helps a bit:

On your storage account you have to activate the static website like this:

Then you can access the website via URL “Primary endpoint” (see pic (4)).

Your “roor_dir” seems to be a Volume (is UnityCatalog enabled?).
I do not have Unity enabled and my “root_dir” (= context_root_dir) is a mountpoint to ADLS

Can you sent what’s inside the context-Variable at the end?
print(context)

Hello @hdamczy,
Apologies for my late response. Thank you so much for your well detailed last response.
Below is the image of the context.

I got to find out that the issue with the endpoint url was related to a firewall in our network and not with the way it was set up.

I actually have a use case where I need to create multiple directories in the $web container and I think you complained of an issue where you created a sub directory inside the $web container and was unable to view the index.htl file properly. Were you later able to resolve this issue?

Hello @Chijioke ,

regarding your “late” response - no pressure :slight_smile:

The content of your context object looks pretty much the same as mine, except my
“root_directory”: “/dbfs/mnt/sdl/ … DataQuality/GX/”
is a mount point where your root_directory leads to a Volume because you have UC enabled Databricks workspace (and you can’t use mount points any more afaik!?).

Regarding your context object following question:
Is there a “checkpoints” directory in your “/Volumes/…” folder?
Have you tried the profiling on any dataframe which would create a .yml file in your checkpoints subdir and a .json file in your expectations subdir by executing the last (3rd) code snippet in post GX-Databricks:Datasource-Data asset - Validator - #4 by hdamczy ?

Regarding the multiple directories inside the $web container I’m afraid the issue I created for this which can be found in this post: Data Docs in Azure ADLS $web subdirectory not working - #10 by hdamczy is still not fixed yet.

Greetings, Holger

1 Like

Hello @hdamczy
Thank you :slightly_smiling_face:
Yes, there is a checkpoint and expectations directory created in the container on azure blob storage. I created the my expectations for each data asset interactively with python, all in a databricks notebook. I have not tried profiling yet…

Regarding the multiple directories inside the $web container, I will take a look at it some more and If I find any thing tangible, I will let you know