I’m running into the following incompatibility with GE and Airflow:
Great Expectations requires a marshmallow >3.0, while airflow wants marshmallow < 3.0.
Specifically, pip install great_expectations installs marshmallow 3.5.2, and pip install apache-airflow uninstalls marshmallow 3.5.2 and installs marshmallow 2.21.0.
Then when I try to run great_expectations (Even just great_expectations --version), I get:
" ImportError: cannot import name ‘INCLUDE’ from ‘marshmallow’ ".
If, after installing apache-airflow, I uninstall and reinstall great_expectations, it uninstalls marshmallow 2.21.0 and installs marshmallow 3.5.2.
Then when I try to import great_expectations in a python script, it attempts to fill up the DagBag and write grammar tables, which fails. I can still run airflow, it just gives the same “writing failed” message every time.
Steps to reproduce the behavior:
pip install great_expectations
pip install apache-airflow
great_expectations --version
pip uninstall great_expectations
pip install great_expectations
in a python script: import great_expectations as ge
It sounds like the workaround right now is not running them in the same virtualenv, although maybe a ge package compatible with marshmallow 2.3 could be possible.
It seems that the incompatibility is resolved in some versions of Airflow. I was unable to set up GE with Airflow 1.10.6, but it did work with Airflow 1.10.3. Checking with the latest version of Airflow (1.10.11), it seems like there are no constraints on the marshmallow version.
We just ran into this same issue ourselves and ended up solving it by wrapping GE inside of a docker image which should prevent any future incompatibility issues as well. Here’s our Dockerfile:
FROM python:3.7.7-slim
RUN apt-get update && apt-get install -y git
RUN pip install great_expectations~=0.11 boto3
COPY ./great_expectations/ /app/great_expectations
COPY ./scripts/ /app/
WORKDIR /app
CMD ["./validate_data", "--help"]
A couple of the benefits outside of the decoupling of dependencies we get out of this:
We’ve been able to customize the validation operator. In a deployed setting it’s great to be able to run our validations, but if we want to dig into the issues as the result of an error we want to quickly see some output from the logs. Our script prints a link to the generated data docs as well as the failed validation metadata.
We are able to run this in tandem with the regular CLI so we can easily develop locally without even thinking about the Docker image. It’s only used in a deployed context.