I have been working with the GX package for few days now but I have a series of question to ask:
I want to ask whether it is possible to do this:
Initiate a file system data context directly in a container in Azure blob storage instead of using DBFS root in databricks. If this is possible, can someone guide me on how to implement this?
If the Above is not possible, when we want to host and share datadocs in Azure blob storage, MUST I configure other artifacts such as expectation store, checkpoint stores etc, to be in the Azure blob storage too OR can I leave them in dbfs root and ONLY configure the hosting and sharing of docs in the blob storage?
Correct me if I am misreading the code.
Your context_root_dir is still pointing to dbfs root and the datadocs is the only part that is actually stored and hosted on azure Storage right?
I’m not sure to answer your question 100% precisely but afaik due to I create a EphemeralDataContext no gx yml file is stored anywhere at the beginning.
context_root_dir is a mount point to an ADLS container which is looks like this: /dbfs/mnt/myproject/DataQuality/GX/
Afaik you can’t have a direct link to your ADLS (maybe it’s possible these days?) so we solved this problem using mountpoints. All data is stored in ADLS containers. The expectations suites .json files and the checkpoint .yml files as well as all data docs files in the $web container.
I don’t think we can have a direct link to ADLS, I used a mounted volume equally. I noticed something after I configured the data context.
The config_variables.yml and great_expectations.yml files are both missing in the ADLS container where other gx contents are seen. Is this a problem?
I configured the context to equally keep the contents of datadocs inside the $web container which it actually does, but when I copy the primary endpoint url
and try to execute it on a browser, the page does not load. It displays an authentication error message.
Can you check whether I am missing something?
Below is the config:
I do not have config_variables.yml and great_expectations.yml either.
Everything GX needs on runtime is created by the function get_gx_context() in class GX_Context. You can put the code from the class directly in your Databricks notebook for testing purposes.
I just had my mount to DataQuality/GX/ on ADLS where 4 directories were created (checkpoints/, expectations/, profilers/ and uncommitted/)
After profiling my dataframes (see code below “To get the expectation-suite JSON files and the checkpoint YML files you can do a profiling…”) I had a .yml in checkpoints/ dir for each dataframe I profiled as well as .json in expectations/ dir.
Both files are loaded during runtime.
Maybe this schema helps a bit:
Then you can access the website via URL “Primary endpoint” (see pic (4)).
Your “roor_dir” seems to be a Volume (is UnityCatalog enabled?).
I do not have Unity enabled and my “root_dir” (= context_root_dir) is a mountpoint to ADLS
Can you sent what’s inside the context-Variable at the end?
print(context)
I got to find out that the issue with the endpoint url was related to a firewall in our network and not with the way it was set up.
I actually have a use case where I need to create multiple directories in the $web container and I think you complained of an issue where you created a sub directory inside the $web container and was unable to view the index.htl file properly. Were you later able to resolve this issue?
The content of your context object looks pretty much the same as mine, except my
“root_directory”: “/dbfs/mnt/sdl/ … DataQuality/GX/”
is a mount point where your root_directory leads to a Volume because you have UC enabled Databricks workspace (and you can’t use mount points any more afaik!?).
Regarding your context object following question:
Is there a “checkpoints” directory in your “/Volumes/…” folder?
Have you tried the profiling on any dataframe which would create a .yml file in your checkpoints subdir and a .json file in your expectations subdir by executing the last (3rd) code snippet in post GX-Databricks:Datasource-Data asset - Validator - #4 by hdamczy ?
Hello @hdamczy
Thank you
Yes, there is a checkpoint and expectations directory created in the container on azure blob storage. I created the my expectations for each data asset interactively with python, all in a databricks notebook. I have not tried profiling yet…
Regarding the multiple directories inside the $web container, I will take a look at it some more and If I find any thing tangible, I will let you know