Using ADLS instead of DBFS in Azure Databricks for all GX artefacts, especially data docs

Hello together,

I try to use GX in an Azure Databricks environment to validate (new) datasources and generate profiles in DEV and to execute checkpoints on PROD.
We are using Azure DevOps for our code and ADLS to store all our data and files.

I executed the tutorial described on Get started with Great Expectations and Databricks | Great Expectations . But at the end of this tutorial the data docs are written to DBFS (DataBricks File System) and I have to copy them via Databricks CLI to the ADLS (Azure Data Lake Storage).

My question is: Is there a way to store the data docs (and also all the other GX files) directly in an ADLS folder (which is accessable from Databricks via mountpoint) instead of an DBFS folder?

E.g. by overwriting the base_directory for data_docs_site in the great_expectations.yml? Are there methods to modify this base_directory which I could use in the Databricks notebook?

Thanks in advance for your help.
Holger

Hi @hdamczy, thanks for your question (and first Discourse post)!

If your ADLS folder is mounted in Databricks, then I think you should be able to just modify your Data Context setup to write your GX project files to your ADLS folder instead of DBFS.

If you modify your Data Context root directory to point to your ADLS folder instead of the /dbfs/great_expectations/ path used in the tutorial (I’ve added an example below), does this resolve the issue?

context_root_dir = "/path/to/adls/folder"
context = gx.get_context(context_root_dir=context_root_dir)

Hello @rachel.house,

thank you for your response. I’m happy that you’re trying to help me. :slight_smile:
Your suggestion is what I’ve already tried first but when I submit the context_root_dir value to the get_context method and then print the context no context_root_dir can be found.

try:

  context_root_dir = f"{project.MNT_PATH}/tests/gx"
  context = gx.get_context(context_root_dir=context_root_dir)

  print("context_root_dir: "+str(context_root_dir))
  print("context: "+str(context))

  dataframe_datasource = context.sources.add_or_update_spark(
      name="my_spark_in_memory_datasource",
  )
except Exception as exception:
  handle_exception(exception, dbutils.notebook.entry_point.getDbutils().notebook().getContext())
  raise exception

This results in the following console printout:

context_root_dir: /mnt/sdl/control-tower-ccu/tests/gx
context: {
  "anonymous_usage_statistics": {
    "explicit_url": false,
    "explicit_id": true,
    "enabled": true,
    "data_context_id": "9189d9aa-ed79-481f-8817-3cd5b9c3b15f",
    "usage_statistics_url": "https://stats.greatexpectations.io/great_expectations/v1/usage_statistics"
  },
  "checkpoint_store_name": "checkpoint_store",
  "config_version": 3,
  "data_docs_sites": {
    "local_site": {
      "class_name": "SiteBuilder",
      "show_how_to_buttons": true,
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "/tmp/tmpra0l_8jx"
      },
      "site_index_builder": {
        "class_name": "DefaultSiteIndexBuilder"
      }
    }
  },
  "datasources": {},
  "evaluation_parameter_store_name": "evaluation_parameter_store",
  "expectations_store_name": "expectations_store",
  "fluent_datasources": {},
  "include_rendered_content": {
    "expectation_suite": false,
    "expectation_validation_result": false,
    "globally": false
  },
  "profiler_store_name": "profiler_store",
  "stores": {
    "expectations_store": {
      "class_name": "ExpectationsStore",
      "store_backend": {
        "class_name": "InMemoryStoreBackend"
      }
    },
    "validations_store": {
      "class_name": "ValidationsStore",
      "store_backend": {
        "class_name": "InMemoryStoreBackend"
      }
    },
    "evaluation_parameter_store": {
      "class_name": "EvaluationParameterStore"
    },
    "checkpoint_store": {
      "class_name": "CheckpointStore",
      "store_backend": {
        "class_name": "InMemoryStoreBackend"
      }
    },
    "profiler_store": {
      "class_name": "ProfilerStore",
      "store_backend": {
        "class_name": "InMemoryStoreBackend"
      }
    }
  },
  "validations_store_name": "validations_store"
}

I assume this is why my directory in /mnt/sdl/control-tower-ccu/tests/gx remains empty after I follow the whole Tutorial Get started with Great Expectations and Databricks on the DataFrame-path.

After I created a gx context and then

context = gx.get_context(context_root_dir=context_root_dir)
dataframe_datasource = context.sources.add_or_update_spark  ...
dataframe_asset = dataframe_datasource.add_dataframe_asset ...
batch_request = dataframe_asset.build_batch_request()
context.add_or_update_expectation_suite ...
validator = context.get_validator ...
with the 2 expectations
validator.expect_column_values_to_not_be_null ...
validator.expect_column_values_to_be_between ...
validator.save_expectation_suite
checkpoint = Checkpoint ...
context.add_or_update_checkpoint(checkpoint=checkpoint)
checkpoint_result = checkpoint.run()

I can not find any results anywhere. I can’t even find /tmp/tmpra0l_8jx

Hi @hdamczy, apologies for the delay.

I dug into this some more, and was able to get the tutorial writing to my directory of choice on Databricks - I think this is what you’re looking for. There’s two things that I had to adjust to get it working.

  1. Set the context project root directory. The project_dir here can be a little tricky, if you’re using a DBFS directory (that otherwise you might access via dbutils.fs.ls("/tmp/discourse1427"), you want to make sure you begin the path with /dbfs as shown below. If the directory you’re using is not on DBFS but available at some other path on the Databricks machine you’re running the code, just use the full path name.
project_dir = "/dbfs/tmp/discourse1427/"

context = gx.get_context(project_root_dir=project_dir)
  1. The Data Docs config automatically creates the default local_site site with a temp directory path - this is what you’re seeing in the most recent output you included. Instead, create a new site config with your desired file path in the base_directory and remove the local_site like this:
context.add_data_docs_site(
    site_config={
        "class_name": "SiteBuilder",
        "store_backend": {
            "class_name": "TupleFilesystemStoreBackend",
            "base_directory": "/dbfs/tmp/discourse1427/data_docs",
        },
        "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    },
    site_name="my_new_data_docs_site",
)

context.delete_data_docs_site(site_name="local_site")

print(f"context_root_dir: {context_root_dir}")
print(f"context: {context}")

After I made these changes, I was able to run through the tutorial and verify that Data Docs output was written to the folder I specified in the site config base_directory.

1 Like

Thank you, @rachel.house , for the hints.
Based on these I figured out to use this setting to be able to write to DBFS:

  context_root_dir = f"/dbfs{project.MNT_PATH}/GX/"

  project_config = DataContextConfig(
    ## Local storage backend
    store_backend_defaults=FilesystemStoreBackendDefaults(
      root_directory=context_root_dir
    ),
    ## Data docs site storage
    data_docs_sites={
      "az_site": {
        "class_name": "SiteBuilder",
        "store_backend": {
          "class_name": "TupleAzureBlobStoreBackend",
          "container":  "\$web",
          "connection_string":  "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>;EndpointSuffix=core.windows.net",
        },
        "site_index_builder": {
          "class_name": "DefaultSiteIndexBuilder",
          "show_cta_footer": True,
        },
      }
    },
  )

  context = gx.get_context(project_config=project_config)```

But good to know also your way!
Greetings and Kudos!