GE run fails on AWS Airflow with File System [Errno 30] Read-only file system: '/usr/local/airflow/dags/data-quality/gx'; 3080

Hi,

We are upgrading Great Expectations to 0.18.10. We are currently experiencing a blocker issue from one of the packages that throws a read-only file system error.

Can someone please help how can we resolve this error?

config file:

datasources:   
  sandbox:
    data_connectors:
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        name: default_runtime_data_connector_name
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
      default_inferred_data_connector_name:
        include_schema_name: true
        name: default_inferred_data_connector_name
        class_name: InferredAssetSqlDataConnector
        module_name: great_expectations.datasource.data_connector
    execution_engine:
      credentials:
        host: ${host}
        port: 4343
        username: secret
        password: secret
        database: db
        query:
          sslmode: prefer
        drivername: postgresql+psycopg2
      module_name: great_expectations.execution_engine
      class_name: SqlAlchemyExecutionEngine
    module_name: great_expectations.datasource
    class_name: Datasource

config_variables_file_path: ${AIRFLOW_HOME}/dags/data-quality/config_variables/sandbox/config_variables.yml


stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_filesystem_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-sandbox-great-expectations
      prefix: data-quality/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: /tmp/great_expectations/profilers/

expectations_store_name: expectations_store
validations_store_name: validations_filesystem_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-sandbox-great-expectations
      prefix: data-quality/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  data_context_id: ddf20563-8d61-4ebe-9814-2hg5115c4710
  enabled: false
notebooks:
include_rendered_content:
  globally: false
  expectation_suite: false
  expectation_validation_result: false
progress_bars:
  globally: false
  profilers: false
  metric_calculations: false

Error Logs:

[2024-04-12, 03:09:46 UTC] {{taskinstance.py:2480}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='data-model-quality' AIRFLOW_CTX_TASK_ID='accounts' AIRFLOW_CTX_EXECUTION_DATE='2024-04-12T03:08:55.420641+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2024-04-12T03:08:55.420641+00:00'
[2024-04-12, 03:09:46 UTC] {{great_expectations.py:580}} INFO - Running validation with Great Expectations...
[2024-04-12, 03:09:46 UTC] {{great_expectations.py:582}} INFO - Instantiating Data Context...
[2024-04-12, 03:09:46 UTC] {{taskinstance.py:2698}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations_provider/operators/great_expectations.py", line 586, in execute
    self.data_context = ge.data_context.FileDataContext(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 64, in __init__
    self._scaffold_project()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 98, in _scaffold_project
    self._scaffold(
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 219, in _scaffold
    gx_dir.mkdir(parents=True, exist_ok=True)
  File "/usr/local/lib/python3.11/pathlib.py", line 1116, in mkdir
    os.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/usr/local/airflow/dags/data-model-schema-data-quality/gx'
[2024-04-12, 03:09:46 UTC] {{taskinstance.py:1138}} INFO - Marking task as FAILED. dag_id=data-model-quality, task_id=accounts, execution_date=20240412T030855, start_date=20240412T030946, end_date=20240412T030946
[2024-04-12, 03:09:47 UTC] {{standard_task_runner.py:107}} ERROR - Failed to execute job 37515 for task accounts ([Errno 30] Read-only file system: '/usr/local/airflow/dags/data-quality/gx'; 3080)
[2024-04-12, 03:09:47 UTC] {{local_task_job_runner.py:234}} INFO - Task exited with return code 1
[2024-04-12, 03:09:47 UTC] {{taskinstance.py:3280}} INFO - 0 downstream tasks scheduled from follow-on schedule check

The line that throws the error, gx_dir.mkdir(parents=True, exist_ok=True) is trying to create the root directory.

Seems that GX is not finding the project root directory for some reason.
Are the GX project files stored on the machine you are running this?
If yes, what’s the folder structure?

If the root folder exists just fine, you can try specifying project_root_dir in gx.get_context()

Something like gx.get_context(project_root_dir="path/to/gx")

The root directory is either “gx” or “great_expectations”, depending on your GX version.

Thank you for your reply. Despite checking all the required information, we are still experiencing issue.

Our version information:

I double-checked the following:

  1. The gx directory is located in the correct folder.
  2. .gitignore already exists in the gx folder.

DAG file code:

import datetime
import os
from functools import partial
from pathlib import Path
from typing import List
import shutil
import uuid

import pendulum
import great_expectations as gx
from airflow import DAG
from airflow.models import DagRun, TaskInstance
from airflow.operators.python import PythonOperator
from airflow.utils.state import TaskInstanceState
from airflow.utils.trigger_rule import TriggerRule
from great_expectations_provider.operators.great_expectations import (
    GreatExpectationsOperator,GreatExpectationsDataDocsLink
)
from airflow.operators.empty import EmptyOperator
from pycommons.airflow.settings import EnvVars, get_timezone
from pycommons.airflow.utils import callback_notification, send_notification

env_vars = EnvVars.load()
base_path = Path(__file__).parents[1]
ge_root_dir = os.path.join(base_path, "gx")

with DAG(
    dag_id="gx-dag-file",
    schedule_interval="30 5 * * 1-5",
    start_date=pendulum.datetime(2023, 2, 7, 7, tz=get_timezone(env_vars)),
    catchup=False,
    dagrun_timeout=datetime.timedelta(minutes=120),
    tags=["great-expectation", "airflow", "dataquality", "raw"],
    on_failure_callback=partial(
        callback_notification,
        subject="[ERROR] The Great-expectations AirFlow to load RAW layer data from Redshift failed",
    ),
) as dag:
    review_dag_folder = PythonOperator(
        task_id="review_dag_folder_configuration",
        python_callable=print_dir_configuration,
        trigger_rule=TriggerRule.ALL_DONE,
    )

    email_accounts = GreatExpectationsOperator(
        task_id="email_accounts",
        data_context_root_dir=ge_root_dir,
        checkpoint_name="raw.email_accounts_chk",
        return_json_dict=True,
        fail_task_on_validation_failure=True,
    )

    finish_checks = PythonOperator(
        task_id="finish_data_quality_checks",
        python_callable=check_checkpoint_execution,
        trigger_rule=TriggerRule.ALL_DONE,
    )


(
    review_dag_folder
    >> [
        email_accounts
     ]
    >> finish_checks
)

Latest Log Messages

ip-172-20-83-2.ap-southeast-2.compute.internal
*** Reading remote log from Cloudwatch log_group: airflow-redshift-data-airflow-testing_env-Task log_stream: dag_id=dag-file-quality/run_id=manual__2024-04-24T03_24_48.087894+00_00/task_id=email_accounts/attempt=1.log.
[2024-04-24T03:25:13.676+0000] {{taskinstance.py:1956}} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: dag-file-quality.email_accounts manual__2024-04-24T03:24:48.087894+00:00 [queued]>
[2024-04-24T03:25:13.701+0000] {{taskinstance.py:1956}} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: dag-file-quality.email_accounts manual__2024-04-24T03:24:48.087894+00:00 [queued]>
[2024-04-24T03:25:13.702+0000] {{taskinstance.py:2170}} INFO - Starting attempt 1 of 1
[2024-04-24T03:25:13.740+0000] {{taskinstance.py:2191}} INFO - Executing <Task(GreatExpectationsOperator): email_accounts> on 2024-04-24 03:24:48.087894+00:00
[2024-04-24T03:25:13.747+0000] {{standard_task_runner.py:60}} INFO - Started process 22022 to run task
[2024-04-24T03:25:13.753+0000] {{standard_task_runner.py:87}} INFO - Running: ['airflow', 'tasks', 'run', 'dag-file-quality', 'email_accounts', 'manual__2024-04-24T03:24:48.087894+00:00', '--job-id', '38995', '--raw', '--subdir', 'DAGS_FOLDER/gx-dag-file/dags/dag-file-quality.py', '--cfg-path', '/tmp/tmpey_y8cdh']
[2024-04-24T03:25:13.755+0000] {{standard_task_runner.py:88}} INFO - Job 38995: Subtask email_accounts
[2024-04-24T03:25:14.220+0000] {{taskinstance.py:2480}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='dag-file-quality' AIRFLOW_CTX_TASK_ID='email_accounts' AIRFLOW_CTX_EXECUTION_DATE='2024-04-24T03:24:48.087894+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2024-04-24T03:24:48.087894+00:00'
[2024-04-24T03:25:14.221+0000] {{great_expectations.py:580}} INFO - Running validation with Great Expectations...
[2024-04-24T03:25:14.222+0000] {{great_expectations.py:582}} INFO - Instantiating Data Context...
[2024-04-24T03:25:14.223+0000] {{taskinstance.py:2698}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 295, in _scaffold_directories
    cls._scaffold_gitignore(base_dir)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 340, in _scaffold_gitignore
    with gitignore.open("a") as f:
         ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/usr/local/airflow/dags/gx-dag-file/gx/.gitignore'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations_provider/operators/great_expectations.py", line 586, in execute
    self.data_context = ge.data_context.FileDataContext(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 64, in __init__
    self._scaffold_project()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 98, in _scaffold_project
    self._scaffold(
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 220, in _scaffold
    cls._scaffold_directories(gx_dir)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 297, in _scaffold_directories
    raise gx_exceptions.GitIgnoreScaffoldingError(
great_expectations.exceptions.exceptions.GitIgnoreScaffoldingError: Could not create .gitignore in /usr/local/airflow/dags/gx-dag-file/gx because of an error: [Errno 13] Permission denied: '/usr/local/airflow/dags/gx-dag-file/gx/.gitignore'
[2024-04-24T03:25:14.246+0000] {{taskinstance.py:1138}} INFO - Marking task as FAILED. dag_id=dag-file-quality, task_id=email_accounts, execution_date=20240424T032448, start_date=20240424T032513, end_date=20240424T032514
[2024-04-24T03:25:14.274+0000] {{standard_task_runner.py:107}} ERROR - Failed to execute job 38995 for task email_accounts (Could not create .gitignore in /usr/local/airflow/dags/gx-dag-file/gx because of an error: [Errno 13] Permission denied: '/usr/local/airflow/dags/gx-dag-file/gx/.gitignore'; 22022)
[2024-04-24T03:25:14.606+0000] {{local_task_job_runner.py:234}} INFO - Task exited with return code 1
[2024-04-24T03:25:14.644+0000] {{taskinstance.py:3280}} INFO - 0 downstream tasks scheduled from follow-on schedule check

It works fine if we create a custom operator and inherit GreatExpectationsOperator and GreatExpectationsDataDocsLink. Could there be any issue in the GreatExpectationsOperator?

class MyGXOperator(GreatExpectationsOperator,GreatExpectationsDataDocsLink):
    def __init__(self, *args, **kwargs):
        self.unique_id = uuid.uuid4()
        self.source_dir = ge_root_dir
        self.target_dir = f"/tmp/{self.unique_id}/{self.source_dir.split('/')[-2]}/{self.source_dir.split('/')[-1]}"
        kwargs['data_context_root_dir'] = self.target_dir
        super().__init__(*args, **kwargs)

Hi!

Great that the custom operator works!

I’m not familiar with Airflow or DAGs so I’m simply going by what the error message says.
As you probably noticed, GX is trying to create a .gitignore file but doesn’t have write permission to the folder. Have you checked that whatever process runs GX has permissions to create files?
Regardless, it’s weird that the stock Operator doesn’t work but then the custom operator does, since I’d assume that both of them try to create the .gitignore file.

Hi @ToivoMattila - I appreciate your further help in to resolving this issue. Despite giving correct URL / path to the great_expectations project directory and great_expectation.yml config file, we are still experiencing permission error.

Why is it creating .gitignore file? Is there any way to ignore the creation of files?

Code line throwing an error: context = gx.get_context(project_root_dir=base_path)

2024-04-30, 05:30:18 AEST] {{taskinstance.py:2191}} INFO - Executing <Task(PythonOperator): train_tickets> on 2024-04-28 19:30:00+00:00
[2024-04-30, 05:30:18 AEST] {{standard_task_runner.py:60}} INFO - Started process 524 to run task
[2024-04-30, 05:30:18 AEST] {{standard_task_runner.py:87}} INFO - Running: ['airflow', 'tasks', 'run', 'tickets_data_quality', 'train_tickets', 'scheduled__2024-04-28T19:30:00+00:00', '--job-id', '39590', '--raw', '--subdir', 'DAGS_FOLDER/data-quality/dags/tickets_data_quality.py', '--cfg-path', '/tmp/tmpmub2p8iz']
[2024-04-30, 05:30:18 AEST] {{standard_task_runner.py:88}} INFO - Job 39590: Subtask train_tickets
[2024-04-30, 05:30:19 AEST] {{taskinstance.py:2480}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='tickets_data_quality' AIRFLOW_CTX_TASK_ID='train_tickets' AIRFLOW_CTX_EXECUTION_DATE='2024-04-28T19:30:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-04-28T19:30:00+00:00'
[2024-04-30, 05:30:19 AEST] {{logging_mixin.py:188}} INFO - ['expectations', '.gitignore', 'plugins', 'great_expectations_local.yml', 'checkpoints', 'great_expectations.yml']
[2024-04-30, 05:30:19 AEST] {{taskinstance.py:2698}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 295, in _scaffold_directories
    cls._scaffold_gitignore(base_dir)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 340, in _scaffold_gitignore
    with gitignore.open("a") as f:
         ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/usr/local/airflow/dags/data-quality/gx/.gitignore'
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dags/data-model-schema-data-quality/dags/tickets_data_quality.py", line 57, in runGECheck
    context = gx.get_context(project_root_dir=base_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/context_factory.py", line 263, in get_context
    context = _get_context(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/context_factory.py", line 302, in _get_context
    file_context = _get_file_context(
                   ^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/context_factory.py", line 383, in _get_file_context
    return FileDataContext(
           ^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 64, in __init__
    self._scaffold_project()
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 98, in _scaffold_project
    self._scaffold(
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 220, in _scaffold
    cls._scaffold_directories(gx_dir)
  File "/usr/local/airflow/.local/lib/python3.11/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 297, in _scaffold_directories
    raise gx_exceptions.GitIgnoreScaffoldingError(
great_expectations.exceptions.exceptions.GitIgnoreScaffoldingError: Could not create .gitignore in /usr/local/airflow/dags/data-quality/gx because of an error: [Errno 13] Permission denied: '/usr/local/airflow/dags/data-quality/gx/.gitignore'