Can we specify the data schema in yaml config?

We currently set up our CSV batch requests with “inferSchema: True”. However, we sometimes receive a data file where a datetime column contains all nulls. As one might expect, the engine does not identify the column as containing datetimes but rather strings. This then leads to our expect_column_max_to_be_between expectation erroring, as it can’t compare a datetime with a string.

Is there a way we can specify the schema we are expecting, so that the column will be treated as a datetime even when all values are null? (We’re struggling to find any documentation for the correct syntax.)

  - batch_request:
      datasource_name: blob_datasource
      data_connector_name: inbound_blob_data_connector
      data_asset_name: Patient
      batch_spec_passthrough:
        reader_method: csv
        reader_options:
          header: True
          inferSchema: True

We are using the SparkDFExecutionEngine exection engine.

Many thanks

Update: We have tried the following:

- batch_request:
      datasource_name: blob_datasource
      data_connector_name: inbound_blob_data_connector
      data_asset_name: Patient
      batch_spec_passthrough:
        reader_method: csv
        reader_options:
          header: True
          inferSchema: False
          enforceSchema: False    # Have also tried True
          schema:
            FieldName1: StringType
            FieldName2: StringType
            FieldName3: DateType
            FieldName4: BooleanType
            FieldName5: StringType

But are seeing the error:

Exception occurred during checkpoint: 'fields'

I have no idea if this works but consider writing the data connector in Python and explicitly specifying a schema for the CSV.
Something like this:

schema = StructType([
    StructField("event_time", TimestampType(), True),
    # Add other fields as necessary
    # StructField("column_name", DataType(), True/False),
])
asset = datasource.add_csv_asset(
    asset_name, 
    filepath_or_buffer=path_to_data,
    schema=schema,
)

If I remember correctly, GX just directly passes any additional arguments to Spark so this would translate to roughly the Spark code below:

df = spark.read.csv(
    path_to_data,
    schema=schema,
)

Alternatively, you can create a Custom Expectation that can handle this special case with only nulls in the column.

Thanks Toivo. I’ll give that a try.

I suppose I’ll need to translate our yaml config into python. (Is the yaml route no longer recommended?)

Yes, as far as I know, GX recommends programmatically developing and maintaining the Checkpoints etc using Python. Under the hood, GX then stores the configurations as YAML, which is easier to i.e. push to production.

I personally use Jupyter Notebooks when working with GX. It enables getting immediate feedback when something I’ve written doesn’t work. For example, when there’s something wrong with connecting to the data or when the data I’m working with doesn’t actually conform to the Expectations I’m writing.