I have data docs stored in MinIO. Every time I run validations with a in-memory checkpoint I have to run
context.build_data_docs()
which rebuilds the entire data docs. This is fast for when I have only 1 validation, but becomes progressively slower when more validations pile up. I saw there is the option for
context.update_data_docs_site()
This takes 2 arguments which I have passed like this
However this doesn’t seem to work and the data docs aren’t being updated. How can I tell gx to only update the data docs with the newest validations rather than building it always from scratch?
Hey @erman! To update the data docs with only the newest validations, you can use the UpdateDataDocs Checkpoint Action or the validation_results_limit option.
The UpdateDataDocs Checkpoint Action allows you to render new validations only for the specific checkpoints containing the action. This is useful if you’re experiencing performance degradation due to too many validations or if the validations in your checkpoints are smaller in terms of expectation or data volumes compared to other validations in your environment. However, please note that this action is limited to the active checkpoint and changes made to GX or your local environment may not be captured in other data docs.
Alternatively, you can use the validation_results_limit option to specify the number of historical data docs that GX retains. This option is part of the site_index_builder parameter in your great_expectations.yml file. By setting a limit, only the specified number of validation results from previous checkpoints will be rendered and indexed. This can help improve performance without sacrificing data doc creation in your environment.
Thank you for your answer, but my use case is different. I need the data docs to display all the validations that have been made. When I make new validations I want them to be appended to the rest in the data docs. Both of your options only show a subset of all validations available. Is there an option to tell GX to check if there are new validations to be shown in data docs and to simply append them without redoing everything from scratch?
context.update_data_docs() looks like could be the solution but it doesn’t do anything when I run it. Perhaps the parameters I give it are wrong? Thank you for your help
Hey @erman I replied on your message on gx-community-support slack, mentioning the same message here:
So we no longer use Config-driven methods so first, we recommend that you fully upgrade to the Fluent Data Source (FDS) method that you are implementing partially. For instance, to call Checkpoint, you should use context.add_or_update_checkpoint method. If you need visual assistance, here’s a flowchart: Miro | Online Whiteboard for Visual Collaboration
Now as to your general issue, this is pretty common, particularly for folks with a huge amount of historical validations, or folks running massive validations regularly. If you’re using a BuildDataDocs action (which I can’t tell in your case as I’m not familiar with deprecated config method), every one of those html files will be re-built, which can eventually take quite a long time – swapping this for an UpdateDataDocs action will resolve that issue. When you run the FDS method, I believe it should default to UpdateDataDocs action in the checkpoint yml file that gets generated once the checkpoint is run (the yml will be under your checkpoint folder). If not, change to this action. UpdateDataDocs will only build docs for new validations, rather than rebuilding all docs. Update Data Docs after Validating a Checkpoint | Great Expectations
Alternatively, you could remove both data docs actions from your workflow, and run context.build_data_docs() off cadence outside of that workflow to isolate and minimize that process.
But when I run it it still generates the data docs after each validation in the validations_df1 list instead of after all validations have run, which results in even slower run times. Is there another way to generate the new data docs after running all the validations?