When validating new data that depends on old data, do you need to rerun the old data?

isaactietz · March 13, 2024, 5:52pm

Hi I’m new to gx and I was curious about memory use / efficiency.

Say there’s a table with 2+ years of data being used as a data source or say a pandas df. With the expectation of expect_column_values_to_be_unique, do you have to rerun the expectation on the entire data source each time you want to verify there are no duplicates? Or is there a way to save resources by only having to run it on the new data added to the table after running the entire 2+ years worth only once? My hunch is you have to run it against the entire table as you need to check old records to see if any of the new records are duplicates.

I ask this because the batch / checkpoint workflow kinda confuses me.

Topic		Replies	Views
Memory Management for GX GX Core Support	0	49	January 7, 2025
ExpectTableRowCount under the hood GX Core Support help-wanted	1	19	October 11, 2024
New to GX can someone help? GX Core Support how-to	2	300	September 14, 2023
Need to validate thousands of batches in an expectation suite GX Core Support	1	245	September 6, 2023
Validation with Checkpoint VS Batch GX Core Support	1	54	October 9, 2024

When validating new data that depends on old data, do you need to rerun the old data?

Related topics