Can GE access / validate data from Spark, stored in an S3 bucket?

I’d like to be able to run Great Expectations at several steps in an airflow pipeline. Between which, Spark is being used used to cleanse / transform data and store parquet files in S3 buckets. Can Great Expectations access these as a data source? I can find only references to Spark on filesystem, and S3 in conjunction with Pandas.

Any help appreciated.


@alexc Yes, you can validate a Parquet file in an S3 bucket as a step in your Airflow DAG. We will create a documentation article for this case, but in the meantime,

  1. please use this article to see how to validate a Spark dataframe: How to load a Spark DataFrame as a Batch — great_expectations documentation
  2. Replace the df in runtime_parameters with {“path”: “s3://my.s3.path”}