Unable to run ‘ExpectColumnPairValuesToBeEqual’ with spark on Databricks

motaslimi · January 21, 2025, 9:11am

When I’m trying to validate any expectation with parameters column_A, column_B, I get the following error:

[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "col_1". It’s probably because of illegal references like df1.select(df2.col(\"a\")). SQLSTATE: 42704

My code:

df = spark.sql(f"SELECT * FROM samples.sales_schema.sales")
context = gx.get_context()
data_source_name = sales
data_source = context.data_sources.add_spark(name=data_source_name)
data_asset_name = sales_data_asset
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
batch_parameters = {“dataframe”: df}
batch_definition_name = f"sales_batch_definition"
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
Exp = gxe.ExpectColumnPairValuesToBeEqual(column_A = col_1, column_B = col_2, mostly=0.5)
batch.validate(Exp)

My environment:
Databricks Runtime: 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
great_expectations[spark]: 1.3.2

The same issue happens also for: ExpectColumnPairValuesAToBeGreaterThanB, ExpectColumnPairValuesToBeInSet.

bidek56 · January 21, 2025, 10:30pm

What does your df look like? Does it have col_1 and col_2?

Han · January 22, 2025, 1:59am

This works for me

Possible to print out your dataframe? Also, check if there is missing quotation

motaslimi · January 22, 2025, 12:16pm

yes it does, all other expectations with only one column work for me.

motaslimi · January 22, 2025, 12:18pm

I checked with quotations but still gives me the same error. It works for Expectations with one column input.

bidek56 · January 22, 2025, 3:47pm

This local pyspark example works fine for me.

import great_expectations as gx
from pyspark.sql.functions import col
from pyspark.sql import SparkSession, DataFrame

import os
os.environ["GX_ANALYTICS_ENABLED"] = "false"

def check_frame():

    spark = SparkSession.builder.master("local").getOrCreate()

    df: DataFrame = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],("id", "col_1")) \
            .withColumn("col_2", col("col_1") )

    df.createOrReplaceTempView("tbl2")
    df2 = spark.sql(f"SELECT * FROM tbl2")
    print(df2.show())

    context = gx.get_context()
    data_source = context.data_sources.add_spark(name="data_source_name")
    data_asset = data_source.add_dataframe_asset(name="data_asset_name")
    batch_parameters = {"dataframe": df} 
    batch_definition_name = f"sales_batch_definition" 
    batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
    batch = batch_definition.get_batch(batch_parameters=batch_parameters)
    Exp = gx.expectations.ExpectColumnPairValuesToBeEqual(column_A = "col_1", column_B = "col_2", mostly=0.5)
    batch.validate(Exp)

if __name__ == '__main__':
    check_frame()

OhadE · March 25, 2025, 9:45pm

Having the exact same issue.
My environment:

Databricks Runtime: 15.4 LTS (Apache Spark 3.5.0, Scala 2.12)
great_expectations[spark]: 1.3.9

It always points to column_A. I tried switching the values of A and B, but it still reports the issue with column_A.

Also tried running the example shared by @bidek56, but got an error there as well.
Really frustrating.

OhadE · March 26, 2025, 12:59pm

I found the root cause and identified a workaround.

The issue is with the logic used to filter rows, controlled by the ignore_row_if parameter. It can take one of the following values:

both_values_are_missing (default)
either_value_is_missing
neither

What happens is that with the default value, some rows are filtered out, which modifies the DataFrame. This occurs in sparkdf_execution_engine.py, inside the get_domain_records method.

The modified DataFrame then causes errors later, such as the one we saw:
df1.select(df2.col("a"))
This likely results from operations being applied to DataFrames of different sizes. I suspect the issue comes from this part of the code:

data = df.withColumn("__unexpected", unexpected_condition)
filtered = data.filter(F.col("__unexpected") == True).drop(F.col("__unexpected"))

Workaround:
Set ignore_row_if to neither. This prevents the DataFrame from being modified and allows the code and tests to run as expected.

Example:

suite.add_expectation(gxe.ExpectColumnPairValuesToBeEqual(
    column_A="my_column_a",
    column_B="my_column_b",
    ignore_row_if="neither"
))

motaslimi · March 26, 2025, 1:54pm

Thanks a lot OhadE! Your workaround works perfectly!

brett.koblinger · July 28, 2025, 12:12pm

I think the issue also depends on the access mode of the Databricks cluster. I can successfully run ExpectColumnPairValuesAToBeGreaterThanB expectations when the access mode is “Dedicated (formerly: Single user)”. If the access mode is changed to “Standard (formerly: Shared)”, I get exceptions of the type " [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column"

I am currently using Databricks runtime 17.0, but the same behavior existed with previous runtime’s too.

Topic		Replies	Views
Unable to run 'row_condition' with spark GX Core Support help-wanted	8	216	April 9, 2025
ExpectColumnValuesToBeOfType is not working for databricks/spark GX Core Support databricks	2	119	April 25, 2025
Not able to create expectation suite and data docs in databricks using spark GX Core Support	0	26	July 9, 2025
Unable to parse condition: col("guid").isNull() GX Core Support help-wanted	2	88	November 10, 2024
Columns with Dots: The column in BatchData does not exist GX Core Support help-wanted , databricks	1	50	June 13, 2025

Unable to run ‘ExpectColumnPairValuesToBeEqual’ with spark on Databricks

Related topics