Leveraging AWS DevOps Pipeline with High Expectations

Hello everyone :hugs:,

I’m looking for advice and best practices from people who have created a data validation pipeline using Great Expectations in an AWS DevOps environment. I’m currently working on this project.

A data pipeline that ingests, transforms, and loads data into an Amazon Redshift cluster is the focus of my project. Here’s a quick rundown of my configuration:

  • Data Ingestion: Information is absorbed into a staging area in S3 from a number of sources, including RDS and S3.
  • Data Transformation: The data is transformed using AWS Glue tasks.
  • Data Loading: Amazon Redshift receives the altered data.

In order to guarantee data quality and integrity at various points in this pipeline, I wish to incorporate Great Expectations. More specifically, I want to:

  • As soon as the raw data is ingested into S3, validate it.
  • Before importing converted data into Redshift, make sure it is valid.
  • Establish ongoing data monitoring and validation within Redshift.

I also visit this for reference: DevOps pipeline example - AWS CodePipeline but I am not getting exact clarification.

Here are my questions:

Which techniques work best when combining an AWS DevOps process with Great Expectations?

Using CodePipeline and CodeBuild, two AWS DevOps products, how can I automate the validation steps?

Anyone who has overcome comparable obstacles or has suggestions for improving this configuration would be greatly appreciated.

I appreciate :+1: your guidance and support in advance!

Hey @Oliver,

I’d first get a sense of the GX components and its workflow: Great Expectations overview | Great Expectations

Then I’d check out the the tutorials showing how you’d integrate with AWS: Integrate Great Expectations with AWS | Great Expectations

I’m not sure if GX works with CodePipeline and CodeBuild so I’d experiment with that first to make sure your proposed solution is a viable one.

Josh