Hello,
I use the notebooks produced from CLI and try to adapt them in order to run in Databricks so I run the following:
import great_expectations as gx
import pandas as pd
import great_expectations.jupyter_ux
from great_expectations.core.batch import BatchRequest
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.exceptions import DataContextError
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
from great_expectations.data_context import FileDataContext
from pyspark.sql import SparkSession
from great_expectations.datasource import SparkDatasource
from great_expectations.execution_engine.sparkdf_execution_engine import SparkDFExecutionEngine
context_root_dir = 'path'
context = FileDataContext.create(project_root_dir=context_root_dir)
datasource_name = "my_datasource"
example_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
class_name: SparkDFExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: ..
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
assets:
my_runtime_asset_name:
batch_identifiers:
- runtime_batch_identifier_name
"""
context.test_yaml_config(yaml_config=example_yaml)
sanitize_yaml_and_save_datasource(context, example_yaml, overwrite_existing=True)
context.list_datasources()
df_spark = spark.table(f'{table_name_to_validate}')
df_pandas = df_spark.toPandas()
df_pandas_gx = gx.from_pandas(df_pandas)
dataframe_datasource = SparkDatasource(name="my_datasource")
data_asset = datasource_name.add_dataframe_asset(
name="dataframe_asset",
dataframe=df_pandas_gx
)
The line where defining dataframe_datasource either as SparkDatasource or as
dataframe_datasource = context.sources.add_or_update_spark(name=“case_list_datasource”)
produces an error
ImportError: cannot import name ‘SparkDatasource’ from ‘great_expectations.datasource’
so datasource is not properly created, the data asset cannot be added and consequently validator cannot run.
In CLI, the datasource and data asset are declared manually so what is the corresponding command in python ?
Thank you,
Katerina