Can GE access / validate data from Spark, stored in an S3 bucket?

alexc · April 1, 2021, 1:28pm

I’d like to be able to run Great Expectations at several steps in an airflow pipeline. Between which, Spark is being used used to cleanse / transform data and store parquet files in S3 buckets. Can Great Expectations access these as a data source? I can find only references to Spark on filesystem, and S3 in conjunction with Pandas.

Any help appreciated.

Thanks,
Alex.

eugene.mandel · April 20, 2021, 3:14pm

@alexc Yes, you can validate a Parquet file in an S3 bucket as a step in your Airflow DAG. We will create a documentation article for this case, but in the meantime,

please use this article to see how to validate a Spark dataframe: How to load a Spark DataFrame as a Batch — great_expectations documentation
Replace the df in runtime_parameters with {“path”: “s3://my.s3.path”}

Topic		Replies	Views
Unable to configure the GE checks to validate parquet files on s3 GX Core Support airflow , help-wanted , s3	1	54	March 31, 2025
How to configure a PySpark datasource for accessing the data from AWS S3? Archive	1	1437	March 28, 2020
Can GX read data from S3? GX Core Support s3 , types-of-expectation	1	320	July 14, 2023
Validate only latest file in S3 Archive	3	486	May 27, 2020
Not able to create expectation suite and data docs in databricks using spark GX Core Support	0	35	July 9, 2025

Can GE access / validate data from Spark, stored in an S3 bucket?

Related topics