Hi All,
We’ve currently implemented a generic approach to create dataframe assets via sql in our GX context to do validations over the data.
We did this to avoid having to manage validation support for particular databases which work for one database type but not another.
We’ve hit an issue where if we are trying to validate extremely large volumes of data (GBs) this would take up large amounts of memory as all the data is loaded into memory and cause OOM.
Is there a recommended approach for memory management for Pandas assets as I can’t seem to find any docs on this particular topic.
We also found that in a Checkpoint when it’s run, it persists the batch data for all assets validated in that checkpoint due to the BatchManager class, it reduces load on the targeted backends but has a counter effect of increasing memory used by GX to cache the data.
Meaning if I checkpoint has sufficient enough validations associated it can consume a lot of memory when it’s run, so there seems to be a memory management issue at this level as well.
Any tips/doc links would be greatly appreciated, ideally we don’t want to throw money at the problem and give it more memory.
Thanks,
Sidney