How to create custom Expectations for Spark

This article is for comments to: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_and_editing_expectations/how_to_create_custom_expectations_for_spark.html

Please comment +1 if this How to is important to you.

1 Like

any sample code is available for custom spark expectations it does not work the way panda works

The article links are not working. Please give us the correct direction where i can find the sample custom expectation for spark.

+1 for this - it would be very useful!

Probably a stupid question, but how do you actually run/add the custom expectations you create in 0.13? Say I have a SparkDFDataset in a notebook and I want to add the expectation. With out-of-the-box expectations I can just do: dataset.expect_column_values…

1 Like

I would like to use GE pre-0.13 release to create custom expectations for a SparkDFDataset. I’m working in an Azure Databricks Notebook environment and I have a pre-existing data pipeline which loads data from my data lake into a spark dataframe then performs custom business validations. I’m looking for a code example which shows me how create, instantiate and use custom expectations on my spark dataframe.

1 Like

Since this how-to guide is a placeholder, we will put some pointers in the following comment:

If you want to implement a pre-0.13-style Expectation for Spark,

Here is an example of implementing a new Expectation: expect_column_value_lengths_to_equal_5. This expectation is not terribly useful and is used, but shows the mechanics of adding a custom expectation:

  1. If you configure your Data Context using great_expectations.yml file, create a new Python file in the plugins directory of your GE project. I’ll call my “my_custom_expectations.py” in this example. GE automatically imports all modules defined in that directory.
    If you instantiate your Data Context in your code without storing its configuration in the yml file (most likely, because you are running AWS EMR Spark, Databricks, or a similar environment), you can implement your custom expectations the file of your choosing, but don’t forget to import your module, since GE will not do it automatically.

The actual code for the expectation is here:

from great_expectations.dataset import SparkDFDataset, MetaSparkDFDataset

from pyspark.sql.functions import length as length_
from pyspark.sql.functions import (
lit,
when,
)

class CustomSparkDFDataset(SparkDFDataset):

_data_asset_type = "CustomSparkDFDataset"

@MetaSparkDFDataset.column_map_expectation
def expect_column_value_lengths_to_equal_5(
    self,
    column,
    mostly=None,
    result_format=None,
    include_config=True,
    catch_exceptions=None,
    meta=None,
):
    return column.withColumn(
        "__success",
        when(length_(column[0]) == 5, lit(True)).otherwise(lit(False)),
    )
  1. Update the configuration of your Datasource. For your new expectations to be picked up, you have to tell the Datasource to use your new class as data_asset_type, instead of the default SparkDFDataset. Since you extended this class, your new class (CustomSparkDFDataset in this example) has both the built-in expectations and your custom ones:

data_asset_type:
module_name: my_custom_expectations
class_name: CustomSparkDFDataset

Now you can use your new expectations defined in that module in all the use cases - creating suites, editing suites, etc.

For more examples of implementing particular expectations for Spark, follow the example of an existing expectation, e.g., expect_column_values_to_not_be_null:

For implementing new style (modular) expectations, please consult this how-to guide:
How to create custom Expectations — great_expectations documentation (be sure to click the “Show Docs for Experimental API” tab.

@eugene.mandel It’s still not clear for me how to work with the new style modular expectations. I can manage to create a new expectation based on the docs you posted, but how do I actually use it for validation on a batch of data?

1 Like

@CasperBojer To use your custom expectation (implemented using Modular Expectations), you need to import your expectation class and now it is ready to be used just like any other expectation. If you have further questions or would like to do a user testing session on this, please reach out in Slack.

+1 <This is just to reach the 20 character limit>

Hi,

I have implemented the sample custom expectation in our framework. Now we want to setup a real case custom expectation - the daily count is within a percentage of average of last 30 days.
in this case, do I need to use the decorator of column_map_expectation or column_aggregate_expectation ?
i am confused about these two decorators. in which situation we need to use.
i didn’t see any expectation use the later one in sparkdf_dataset.py
can you shed some light on this?

Thanks,
Lei