Hello everyone ,
I’m looking for advice and best practices from people who have created a data validation pipeline using Great Expectations in an AWS DevOps environment. I’m currently working on this project.
A data pipeline that ingests, transforms, and loads data into an Amazon Redshift cluster is the focus of my project. Here’s a quick rundown of my configuration:
- Data Ingestion: Information is absorbed into a staging area in S3 from a number of sources, including RDS and S3.
- Data Transformation: The data is transformed using AWS Glue tasks.
- Data Loading: Amazon Redshift receives the altered data.
In order to guarantee data quality and integrity at various points in this pipeline, I wish to incorporate Great Expectations. More specifically, I want to:
- As soon as the raw data is ingested into S3, validate it.
- Before importing converted data into Redshift, make sure it is valid.
- Establish ongoing data monitoring and validation within Redshift.
I also visit this for reference: DevOps pipeline example - AWS CodePipeline but I am not getting exact clarification.
Here are my questions:
Which techniques work best when combining an AWS DevOps process with Great Expectations?
Using CodePipeline and CodeBuild, two AWS DevOps products, how can I automate the validation steps?
Anyone who has overcome comparable obstacles or has suggestions for improving this configuration would be greatly appreciated.
I appreciate your guidance and support in advance!