I am looking for documentation on how to use Spark datasource (and files are stored remotely say S3).
I found two articles below. Is the first article about using a local Spark for validation, and the second method uses a Spark cluster? So if I want the validation to be scalable(larger amount of data), should I follow the second article? Is it true the first article’s method will only be useful for validating small amount of data (since it seems to be using a local Spark and will involve downloading data from remote location during validation)?