Parse_strings_as_datetimes fails when using Pyspark

I am using Great Expectations 0.18.8, running pyspark code in an AWS Sagemaker Studio Jupyter Lab. When I am configuring expect_column_min_to_be_between or expect_column_max_to_be_between with parse_strings_as_datetimes = True, whether I set the max_value or min_value to be a datetime.strptime or a string, I receive the following error:

"'>=' not supported between instances of 'str' and 'datetime.datetime'"

Here is an example configuration:


expectation_config = ExpectationConfiguration(
    expectation_type="expect_column_min_to_be_between",
    kwargs={
        "column": "date_of_service",
        'min_value': datetime.datetime.strptime('2020-01-01', '%Y-%m-%d'),
        'max_value': datetime.datetime.strptime('2020-01-01', '%Y-%m-%d'),
        "parse_strings_as_datetimes": True
    }
)

I have cross-posted this in Slack and I know this issue has also been raised over github. Would definitely appreciate a speedy update to this function. Otherwise, I may just create a custom expectation to handle this.

Hi @molly-srour, thanks for reaching out, and welcome to the GX community!

I’m trying to replicate the behavior you are seeing, but have not been able to reproduce the same error (currently getting a warning, but no error). To help us investigate the issue on our end:

  • Can you provide your full GX code that generates the error?
  • What is the data type of the column (date_of_service) that you’re validating with the problem Expectation?

Hi Rachel! Thanks for looking into my question. My column date_of_service is a datetime.datetime object stored in a parquet file. If we looked at a single value when loading the parquet file via spark, we’d see the following for one row:

datetime.datetime(2020, 1, 1, 0,0)

I’m running gx using a checkpoint with many expectation suite jsons. The portion that uses this code was configured using the configuration I sent above, and looks like this in the JSON:

{
      "expectation_type": "expect_column_min_to_be_between",
      "kwargs": {
        "column": "date_of_service",
        "max_value": "2020-01-01T00:00:00",
        "min_value": "2020-01-01T00:00:00",
        "parse_strings_as_datetimes": true
      },
      "meta": {}
    }

then I would execute using the checkpoint: context.run_checkpoint(checkpoint_name=“claim_checkpoint_2023”)

Let me know if I can provide any further information! I’m using spark as the engine.

Got it, thanks! I’ll take another look and will let you know if I have any further questions.

Hi @molly-srour, I was able to reproduce your error and checked in with Engineering about the findings.

The short answer is that the parse_strings_as_datetimes argument is deprecated - remove it from your Expectation definition.

A couple related points are:

  • In my reproduction testing, the expect_column_min_to_be_between Expectation worked as expected (without the parse_strings_as_datetimes arg) on a column containing datetime objects, nulls, and string datetimes.

  • Using the parse_strings_as_datetimes arg surfaced a DeprecationWarning on my end - was this surfaced for you in your Sagemaker notebook?:

    DeprecationWarning: The parameter "parse_strings_as_datetimes" is deprecated as of v0.13.41 in v0.16. As part of the V3 API transition, we've moved away from input transformation.
    
  • Your post has surfaced the disconnect between our codebase and Expectations Gallery in reflecting deprecated args. We will discuss internally on how this can be addressed to reduce confusion in the future.

Let us know if you run into further errors using expect_column_min_to_be_between with datetime data in a Spark dataframe.

Hi <@U05PNEM25C0>,

How to remove ‘parse_strings_as_datetimes’ from my expectation. I am passing/using this argument in my case. I am getting same error.

Please refer to https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1709549641061479
Note: @ originally posted this reply in Slack. It might not have transferred perfectly.