GE with Airflow -- Marshmallow incompatability

I’m running into the following incompatibility with GE and Airflow:
Great Expectations requires a marshmallow >3.0, while airflow wants marshmallow < 3.0.
Specifically, pip install great_expectations installs marshmallow 3.5.2, and pip install apache-airflow uninstalls marshmallow 3.5.2 and installs marshmallow 2.21.0.
Then when I try to run great_expectations (Even just great_expectations --version), I get:
" ImportError: cannot import name ‘INCLUDE’ from ‘marshmallow’ ".
If, after installing apache-airflow, I uninstall and reinstall great_expectations, it uninstalls marshmallow 2.21.0 and installs marshmallow 3.5.2.
Then when I try to import great_expectations in a python script, it attempts to fill up the DagBag and write grammar tables, which fails. I can still run airflow, it just gives the same “writing failed” message every time.

Steps to reproduce the behavior:

  1. pip install great_expectations
  2. pip install apache-airflow
  3. great_expectations --version
  4. pip uninstall great_expectations
  5. pip install great_expectations
  6. in a python script: import great_expectations as ge

It sounds like the workaround right now is not running them in the same virtualenv, although maybe a ge package compatible with marshmallow 2.3 could be possible.

2 Likes

The best answer at the moment is not using airflow and Great Expectations in the same virtual environment.

This could take the shape as virtual environments you manage yourself, or perhaps the airflow.operators.python_operator.PythonVirtualenvOperator: https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html#airflow.operators.python_operator.PythonVirtualenvOperator

Internally Great Expectations uses features in recent versions of marshmallow, and we are not likely to reverse that decision in the near future.

2 Likes

It seems that the incompatibility is resolved in some versions of Airflow. I was unable to set up GE with Airflow 1.10.6, but it did work with Airflow 1.10.3. Checking with the latest version of Airflow (1.10.11), it seems like there are no constraints on the marshmallow version.

1 Like

We just ran into this same issue ourselves and ended up solving it by wrapping GE inside of a docker image which should prevent any future incompatibility issues as well. Here’s our Dockerfile:

FROM python:3.7.7-slim

RUN apt-get update && apt-get install -y git

RUN pip install great_expectations~=0.11 boto3

COPY ./great_expectations/ /app/great_expectations
COPY ./scripts/ /app/
WORKDIR /app

CMD ["./validate_data", "--help"]

A couple of the benefits outside of the decoupling of dependencies we get out of this:

  • We’ve been able to customize the validation operator. In a deployed setting it’s great to be able to run our validations, but if we want to dig into the issues as the result of an error we want to quickly see some output from the logs. Our script prints a link to the generated data docs as well as the failed validation metadata.
  • We are able to run this in tandem with the regular CLI so we can easily develop locally without even thinking about the Docker image. It’s only used in a deployed context.
2 Likes

As of version 0.12.1, GE now bundles the marshmallow code, so there is no longer a dependency on the package. This should resolve a lot of the issues we’ve seen here! https://greatexpectationstalk.slack.com/archives/CUUN5R2RZ/p1599087206011000

2 Likes

@dlc Do you mind to share your validate_data script?

I was so excited to see this. Thank you!

2 Likes