We’re looking to hear anyone’s experience using Great Expectations with ClickHouse. Any tips you can share with the community would be awesome!
Hi I’m currently working on using GE with Clickhouse with further integration with Airflow. I successfully managed to connect to CH via conn string clickhouse+native://{user}:{password}@{hostname}:{port}/?session_id=‘qwerty123’
But I’m stuck with executing f.i. batch.head (the problem is described in my post).
Hi, you probably have solved the issue. However, here is how it could be fixed.
On the step creating BatchRequest in the kwargs define the following property:
"batch_spec_passthrough": {"create_temp_table": False}
Hi. Great tool. I’m making myself familiar with great expectations and using clickhouse for in-house analytics.
The issue I face so far is that many expectations use the same alias for multiple columns in SQL query.
For instance, trying to check expect_column_values_to_not_be_null produce the query like that:
SELECT sum(CASE WHEN (column_1 IS NULL) THEN 1 ELSE 0 END) AS "column_values.nonnull.unexpected_count",
sum(CASE WHEN (column_2 IS NULL) THEN 1 ELSE 0 END) AS "column_values.nonnull.unexpected_count"
FROM (SELECT *
FROM schema.my_table
WHERE 1 = 1) AS great_expectations_sub_selection
ClickHouse does not allow using the same alias for different columns and throws an exception. If anyone knows how to avoid it will be much appreciated for any help.
Hey @Sergii
This expectation works well for me.
validator.expect_column_values_to_not_be_null(“type”)
and the result:
{
“success”: true,
“meta”: {},
“result”: {
“element_count”: 8762,
“unexpected_count”: 0,
“unexpected_percent”: 0.0,
“partial_unexpected_list”: []
},
“exception_info”: {
“raised_exception”: false,
“exception_traceback”: null,
“exception_message”: null
}
}
What GE and ClickHouse version?
I have GE 0.14.5 and CH 21.11.4.14.
But I faced this problem while saving the results in ClickHouse ValidationStore:
It tries to make cross join and fails:
SELECT count(value) AS count_1
FROM ge_validations_store
, ge_validations_store
WHERE ge_validations_store.expectation_suite_name = ‘test_suite’
AND ge_validations_store.run_name = ‘20220205-153837-my-run-name-template’
AND ge_validations_store.run_time = ‘20220205T153837.328884Z’
AND ge_validations_store.batch_identifier = ‘6e263a1470219e3ef29360ea86dd4508’
Greetings @mrkoloev
-
Running the checkpoint with validations against more than 1 column produced the ‘MULTIPLE_ALIASES_FOR_EXPRESSION’ exception. Now it is quick-fixed in a forked repo by randomizing the aliases and works well.
-
Your problem is very likely the same our team faced recently with SQLAlchemy. It was fixed by explicitly defining ALL the columns that are used in a query. Also defining the key column as well.
Otherwise, SQLAlchemy produced the cross joins.
Not much help, you might expect. However, try debugging the issue near these lines of code.
great_expectations/validations_store.py at 342abb4cffc71435059a8af62562527a5d8a99ae · great-expectations/great_expectations · GitHub
and
great_expectations/database_store_backend.py at 342abb4cffc71435059a8af62562527a5d8a99ae · great-expectations/great_expectations · GitHub
Good luck.
Wow, thank you! Hope this helps.
Hi there!
Is anyone else having issues with expectations expect_column_values_to_be_in_type_list and expect_column_quantile_values_to_be_between with ClickHouse?
Regarding expectation expect_column_values_to_be_in_type_list I’ve opened this issue Expectation expect_column_values_to_be_in_type_list not working as expected · Issue #5832 · great-expectations/great_expectations · GitHub
For the expect_column_quantile_values_to_be_between expectation I’m getting a Syntax Error
**`Orig exception: Code: 62.
DB::Exception: Syntax error: failed at position 43 ('(') (line 1, col
43): (ORDER BY "AVGLISTINGPOSITION" ASC) AS anon_1,
percentile_disc(0.25) WITHIN GROUP (ORDER BY "AVGLISTINGPOSITION" ASC)
AS anon_2, percentile_disc(0.5) WITHIN GRO`**
Using GE version: 0.15.19 and CH 20.7.2.30
Hi @mrkoloev .
Can you tell me please what particular version of GX are you using? We are trying to setup clickhouse connection, but unsuccessful for a pity using the latest version. with
clickhouse://{user}:{password}@{hostname}:{port}/?session_id=‘qwerty123’
the db connection was success but in expectation suite there is an error no session id
hey @Halina . I no longer use GX. It was quite difficult to integrate gx with CH so we moved to sodacore.
For gx try mysql connection with 9004 port. MySQL Interface | ClickHouse Docs
Hello. Try using a user with the CREATE TEMPORARY TABLE permissions granted.
Another workaround would be - to edit the library source code and make all create_temp_table
variables = False as default.
Or define the create_temp_table = False in the BatchRequest config.
WITHIN GROUP
clause seems to be not supported by Clickhouse. Thus I would exclude this expectation.
Hey @Halina
I’m facing the same issue with ClickHouse and tried the solution which @Sergii described, but no luck.
But I checked in clickhouse-sqlalchemy pypi docs (Connection configuration — clickhouse-sqlalchemy 0.2.1 documentation) that it’s a way to inform to SQLAlchemy that you need to use Sessions, like this.
from sqlalchemy import create_engine
from requests import Session
engine = create_engine(
'clickhouse+http://localhost/test',
connect_args={'http_session': Session()}
)
I read the GE code about how to invoke SQLAlchemy create_engine method and there’s no option for using connect_args option, but I think we can discover a solution for this, if you didn’t solve yet.
Hi, @fulviomascara
I actually did the fix for sqlalchemy-clickhouse library
I tried to persuade the plugin author to merge it, but then we have switched to openmetadata.
Hi @Halina,
Could you use your version internally, specifying the code below?
"batch_spec_passthrough": {"create_temp_table": False}
Or is there another way to use your version with GE?
Thanks a lot !