Connection Timeout and Suite/Checkpoint Not Found in Multi-threaded Execution (Using Separate Contexts per Thread)

Hi Great Expectations team and community,

I’m using Great Expectations with PySpark DataFrames in Databricks for data validation, following this general flow wrapped in a function:

  1. Create a new EphermalDataContext
  2. Create an Expectation Suite, add rules, and register it with the context
  3. Create a Checkpoint and add it to the context
  4. Run validation using the defined rules

:white_check_mark: Sequential Execution

When running this sequentially across multiple dataframes, everything works as expected. However, I still see some ConnectionError(timeout) messages in the logs, though validations pass.

Q1: What causes these timeouts? Are they related to usage tracking, Data Docs, or any API interactions?


:warning: Multi-threaded Execution

To improve performance, I switched to multi-threading using concurrent.futures.ThreadPoolExecutor. I ensure that each thread creates its own fresh DataContext.

Still, for some dataframes, I get intermittent errors like:

  • Expectation suite not found in the context
  • Checkpoint not found in the context
  • Validation definition not found

These errors suggest that even though the context is isolated per thread, GE sometimes fails to locate suites or checkpoints created moments earlier in the same thread.

Q2: Is there any known issue with thread safety or race conditions in how suites and checkpoints are added and accessed in memory?

Q3: Would switching to multi-processing (vs threading) be more reliable for this use case?


Any advice, patterns, or configuration suggestions for safely running GE in a concurrent environment (especially with PySpark) would be highly appreciated.

Thanks in advance!