Memory Management for GX

schen · January 7, 2025, 3:53am

Hi All,

We’ve currently implemented a generic approach to create dataframe assets via sql in our GX context to do validations over the data.

We did this to avoid having to manage validation support for particular databases which work for one database type but not another.

We’ve hit an issue where if we are trying to validate extremely large volumes of data (GBs) this would take up large amounts of memory as all the data is loaded into memory and cause OOM.

Is there a recommended approach for memory management for Pandas assets as I can’t seem to find any docs on this particular topic.

We also found that in a Checkpoint when it’s run, it persists the batch data for all assets validated in that checkpoint due to the BatchManager class, it reduces load on the targeted backends but has a counter effect of increasing memory used by GX to cache the data.

github.com

great-expectations/great_expectations/blob/2b8c5e0951a474637db3391bef815a7139fa392f/great_expectations/core/batch_manager.py#L24


      
          
          if TYPE_CHECKING:
              from great_expectations.core.batch import AnyBatch
              from great_expectations.core.id_dict import BatchSpec
              from great_expectations.execution_engine import ExecutionEngine
          
          logger = logging.getLogger(__name__)
          logging.captureWarnings(True)
          
          
          class BatchManager:
              def __init__(
                  self,
                  execution_engine: ExecutionEngine,
                  batch_list: Optional[List[Batch]] = None,
              ) -> None:
                  """
                  Args:
                      execution_engine: The ExecutionEngine to be used to access cache of loaded Batch objects.
                      batch_list: List of Batch objects available from external source (default is None).
                  """  # noqa: E501 # FIXME CoP

Meaning if I checkpoint has sufficient enough validations associated it can consume a lot of memory when it’s run, so there seems to be a memory management issue at this level as well.

Any tips/doc links would be greatly appreciated, ideally we don’t want to throw money at the problem and give it more memory.

Thanks,
Sidney

Topic		Replies	Views
Slow validations GX Core Support	1	270	August 31, 2023
ExpectTableRowCount under the hood GX Core Support help-wanted	1	25	October 11, 2024
How to load a Pandas dataframe in memory as a batch Archive how-to , help-wanted	1	693	April 1, 2021
EphemeralDataContext' object has no attribute 'sources' GX Core Support	1	751	March 12, 2025
Validate different dataframes with respective expectation suites using checkpoint GX Core Support how-to , databricks	5	296	August 29, 2025

Memory Management for GX

Related topics