Hello everyone,
I’m new to this and would like to ask whether it’s possible to implement the GX Core library so that it can run across multiple, unrelated files.
We handle several projects where we may ingest dozens of files each day, and I need to create expectations to validate data integrity. My current process imports the files, performs cleanup by removing fully null rows, standardising column names, and then saves them as Parquet files. These are later loaded into a Lakehouse Delta table, where I run further checks to ensure column names are correct and verify the row counts between the raw file and the landed table (taking into account the deleted empty rows), among other validations.
From my reading of the GX Core documentation, it seems that columns need to be explicitly defined and that multi-file processing isn’t directly supported. I do like the features offered by GX Core - particularly the documentation capabilities, which would be very useful for our workflow.
I’m also considering creating a custom expectation in Python that scans a column, determines the most frequently used format within that column, assumes it as the expected format, and then reports any anomalies against it.
Before I proceed further, I’d be interested to know how others have implemented GX Core in similar projects. Thank you.
I’m currently working on the ingestion process, while the rest of the work is handled by our data engineering team.