Hi everyone! We are designing a data validation pipeline using great expectations that uses the information from one dataset to set expectations to another in pyspark. So far everything has been working great so I want to congratulate the team for the awesome work! There is one thing that seems a bit weird to me that I would appreciate some help.
We are taking advantage of the data_docs generated by the package but it seems that we are running the validation twice. I think there should be a way to avoid this but currently in our set up we have:
- We call all the expectations for our target dataset
df.expect_something
. Here we take about 40 min for our current expectation set. - Then we call
ge_env.context.run_validation_operator
which for me should only store the results from point 1. However, by looking atstore_validation_result
action’s code it seems to also be validating before storing the results. This takes an additional and hopefully unnecessary 20 min.
Is there anyway we can change this behaviour so that we can avoid the extra 20 min?