Github Actions with Python and GX

Hi GX!

I am opening this ticket to seek assistance with integrating Great Expectations (GX) into our GitHub Actions CI/CD pipeline. We use Python to set up configurations and connections, and then execute GX. Our goal, within the pipeline, is to share the generated report file. However, I have encountered difficulties due to the lack of documentation/online resources on this specific approach.

Here is an mock/example of the GitHub action file:

name:  Run GX Validation

on:
  workflow_call:
    inputs:
      env:
        required: true
        type: string

jobs:
  run-gx-validation:
    runs-on: self-hosted
    if: ${{ inputs.env == 'stg' }}
    environment: stg
    env:
      SAMPLE_KEY: ${{ secrets.SAMPLE_KEY }}

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.x'

      - name: Install requirements
        run: pip install -r requirements.txt
    
      - name: Run main
        run: python main.py $SAMPLE_KEY

After the Run main step we aim to generate and share the GX report files. which should be reported back from the main.

Usually we’d use self.context.open_data_docs() to open the report, but this is not possible in GitHub Actions. I was hoping to use get_docs_sites_urls but the return value is None, even though I am able to see the generated report in a browser.

Any ideas/examples? Thanks!

Hi @luis.briceno, thanks for reaching out!

You are correct, this is not a use case that we have documentation for or (anecdotally) encountered frequently. Let me see what I can find that might be helpful, here’s two initial questions for you:

  1. You mentioned you are able to access the generated Data Docs using your browser - where are you persisting your rendered Data Docs files (e.g. locally on the Github runner or another filesystem, in S3/Azure Blob Store/GCS)?

  2. Are you using GX Cloud in any capacity? If not, I’m just going to bump this post over to the OSS Support forum.

Hi @rachel.house thank you for looking into this. Here are my responses to your questions:

  1. I am currently leveraging the default behavior for persisting the generated Data Docs files. As I’m new to this, I haven’t made any custom configurations to overwrite the default save location. Consequently, the files are being stored locally on the GitHub runner filesystem (with no specification to where it’s stored).
  2. No, we are not using GX Cloud at the moment, and thank you.

I would like to understand the best practices for configuring the save location for these reports. Initially I felt like I had to take the “Filesystem” approach, but I am also interested in learning more about the AWS approach, such as saving the reports to an S3 bucket and displaying the link in Github Actions or notifications. Any detailed guidance or examples on how to set this up would be extremely helpful.

Hi @luis.briceno, thanks for the extra context.

  • First, here is a link to our docs on hosting and sharing Data Docs in AWS S3/MS Azure Blob Storage/GCS/Specified filesystem: Host and share Data Docs | Great Expectations.

    • Data Docs are a static website, so you can write the files to where you want to serve the pages - by default GX will write them to the local filesystem. I’d agree that’s problematic when using a temporary GitHub runner - to my knowledge, the filesystem contents of the runner aren’t preserved outside of the CI/CD pipeline run.
    • If you have a permanent/external filesystem mounted to or available from your GitHub runner, you could write your Data Docs files to that specific external filesystem using the appropriate path.
  • The docs on hosting and sharing Data Docs guide you through setting up the configuration by editing the great_expectations.yml file. great_expectations.yml is created only if you are using a Filesystem Data Context. If you are using an Ephemeral Data Context, note that you can add add a Data Docs site using the Python API, for example:

context = gx.get_context()

context.add_data_docs_site(
    site_config={
        "class_name": "SiteBuilder",
        "store_backend": {
            "class_name": "TupleFilesystemStoreBackend",
            "base_directory": "/path/to/where/you/want_to_write_data_docs",
        },
        "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
    },
    site_name="Data Docs written to specific filepath",
)
  • If you’re interested generally in hosted validation results, another potential solution to look at is using GX Cloud - it is intended to make hosting and sharing your validation results easy! If you went this route, you’d need to make your cloud credentials available in the GitHub runner environment, which would trigger GX to create and use a Cloud Data Context. The Cloud Data Context uses your GX Cloud organization as its backend store for Data Sources & Assets, Expectation Suites, Checkpoints, and Validations.

This is great!

I will look more into the information you provided, thank you @rachel.house!