Skip to main content
Version: 1.2.0

ClickHouse catalog

Introduction

Apache Gravitino can manage ClickHouse metadata through a JDBC catalog. This document describes the capabilities and limitations of the ClickHouse catalog, as well as the supported operations and properties for catalogs, schemas, and tables.

caution

ClickHouse catalog is not included in the standard Gravitino server distribution due to the large size of the ClickHouse JDBC driver and potential licensing issues. To use the ClickHouse catalog, you can build from source code, and refer to document how-to-build for more details.

Catalog

Catalog capabilities

ItemDescription
ScopeOne catalog maps to one ClickHouse instance
Metadata/DDLSupports JDBC-based metadata management and DDL
Column defaultsSupports column default values
DriversRequires user-provided ClickHouse JDBC driver in ${GRAVITINO_HOME}/catalogs/jdbc-clickhouse/libs, please download the jar from link
Supported versionAll the codes are tested by ClickHouse 24.8.14, newer versions like 25.x may also work but we did not conduct thorough tests. Report to the community if something does not work as expected.
note

ClickHouse Driver 0.7.1 is recommended for better compatibility. Some older versions of the ClickHouse JDBC driver have issues such as incorrect metadata reporting that may cause problems with Gravitino's metadata management. If you encounter issues, please check the driver version and consider upgrading to 0.7.1 or later. If the problem still persists, please report it to the Gravitino community for further investigation.

Catalog properties

You can pass any JDBC pool property that Gravitino does not define by adding the gravitino.bypass. prefix (for example gravitino.bypass.maxWaitMillis). See commons-dbcp configuration for details.

When using the JDBC catalog you must provide jdbc-url, jdbc-driver, jdbc-user, and jdbc-password. Common catalog properties are listed here; ClickHouse adds no extra catalog-scoped keys.

Configuration itemDescriptionDefault valueRequiredSince Version
jdbc-urlJDBC URL, for example jdbc:clickhouse://localhost:8123(none)Yes1.2.0
jdbc-driverJDBC driver class, for example com.clickhouse.jdbc.ClickHouseDriver(none)Yes1.2.0
jdbc-userJDBC user name(none)Yes1.2.0
jdbc-passwordJDBC password(none)Yes1.2.0
jdbc.pool.min-sizeMinimum pool size2No1.2.0
jdbc.pool.max-sizeMaximum pool size10No1.2.0
jdbc.pool.max-wait-msMax wait time for a connection30000No1.2.0

Create a ClickHouse catalog

The following example creates a ClickHouse catalog with the required JDBC properties and optional connection pool settings. Note that the jdbc-driver class must be available in the Gravitino classpath (for example by placing the ClickHouse JDBC driver JAR in ${GRAVITINO_HOME}/catalogs/jdbc-clickhouse/libs). Description about some of the properties:

  • provider: must be jdbc-clickhouse for Gravitino to recognize the catalog as ClickHouse;
  • type: must be RELATIONAL since ClickHouse is a relational database;
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "ck",
"type": "RELATIONAL",
"comment": "ClickHouse catalog",
"provider": "jdbc-clickhouse",
"properties": {
"jdbc-url": "jdbc:clickhouse://localhost:8123",
"jdbc-driver": "com.clickhouse.jdbc.ClickHouseDriver",
"jdbc-user": "default",
"jdbc-password": "password"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

See Manage Relational Metadata Using Gravitino for other catalog operations.

Schema

Schema capabilities

ItemDescription
MappingGravitino schema maps to a ClickHouse database
OperationsCreate / drop / load / list (ClickHouse supports cascade drop)
CommentsSchema comments supported
Cluster modeOptional ON CLUSTER for creation when cluster-name is provided

Schema properties

Property NameDescriptionDefault ValueRequiredImmutableSince version
on-clusterUse ON CLUSTER when creating the databasefalseNoNo1.2.0
cluster-nameCluster name used with ON CLUSTER (must align with table-level cluster settings)(none)NoNo1.2.0

Create a schema

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "sales",
"comment": "Sales database",
"properties": {
"on-cluster": "true",
"cluster-name": "ck_cluster"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/ck/schemas

See Manage Relational Metadata Using Gravitino for more schema operations.

Table

Table capabilities

AreaDetails
MappingGravitino table maps to a ClickHouse table
EnginesLocal engines: MergeTree family (MergeTree default, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, GraphiteMergeTree), Tiny/Stripe/Log, Memory, File, Null, Set, Join, View, Buffer, KeeperMap, etc. Distributed engine supports cluster mode with remote database/table and sharding key.
Ordering/PartitionMergeTree-family requires exactly one ORDER BY column; only single-column identity PARTITION BY is supported on MergeTree engines. Other engines reject ORDER BY/PARTITION BY.
IndexesPrimary key; data-skipping indexes DATA_SKIPPING_MINMAX and DATA_SKIPPING_BLOOM_FILTER (fixed granularities).
DistributionGravitino enforces Distributions.NONE; no custom distribution strategies.
Column defaultsSupported.
UnsupportedEngine change after creation; removing table properties; auto-increment columns.

Table column types

Gravitino TypeClickHouse Type
ByteInt8
Unsigned ByteUInt8
ShortInt16
Unsigned ShortUInt16
IntegerInt32
Unsigned IntegerUInt32
LongInt64
Unsigned LongUInt64
FloatFloat32
DoubleFloat64
Decimal(p,s)Decimal(p,s)
String/VarCharString
FixedChar(n)FixedString(n)
DateDate
Timestamp[(p)]DateTime (precision defaults to 0)
BOOLEANBool
UUIDUUID

Other ClickHouse types are exposed as External Type.

Table properties

note
  • settings.* keys are passed to the ClickHouse SETTINGS clause verbatim.
  • The engine value is immutable after creation.
  • When loading table metadata, Gravitino cannot determine whether it is a cluster table or a local table, because properties such as cluster-name and on-cluster are not available from the JDBC metadata.
Property NameDescriptionDefault ValueRequiredReservedImmutableSince version
engineTable engine (for example MergeTree, ReplacingMergeTree, Distributed, Memory, etc.)MergeTreeNoNoYes1.2.0
cluster-nameCluster name used with ON CLUSTER and Distributed engine(none)No*NoNo1.2.0
on-clusterUse ON CLUSTER when creating the table(none)NoNoNo1.2.0
cluster-remote-databaseRemote database for Distributed engine(none)No**NoNo1.2.0
cluster-remote-tableRemote table for Distributed engine(none)No**NoNo1.2.0
cluster-sharding-keySharding key for Distributed engine (expression allowed; referenced columns must be non-null integral)(none)No**NoNo1.2.0
settings.<name>ClickHouse engine setting forwarded as SETTINGS <name>=<value>(none)NoNoNo1.2.0

* Required when on-cluster=true or engine=Distributed.
** Required when engine=Distributed.

Table indexes

  • PRIMARY_KEY
  • Data-skipping indexes:
    • DATA_SKIPPING_MINMAX (GRANULARITY fixed to 1)
    • DATA_SKIPPING_BLOOM_FILTER (GRANULARITY fixed to 3)

Partitioning, sorting, and distribution

  • ORDER BY: required for MergeTree-family engines and only columns identity are supported;

    • Accept format: id, (id, name), (func(id), name), func(id);
    • Reject format: (id + 1), (func(id) + 1), etc.
  • PARTITION BY: single-column identity and some functions are supported only, and only for MergeTree-family engines. For example PARTITION BY created_at or PARTITION BY toYYYYMM(created_at) are supported, but PARTITION BY (created_at + 1) are not supported. In all, the following partitioning expressions are supported:

    • Identity: PARTITION BY column_name
    • Functions: PARTITION BY toDate(column_name), PARTITION BY toYear(column_name), PARTITION BY toYYYYMM(column_name). Other functions are not supported.
    • Not support: PARTITION BY (column_name + 1), PARTITION BY (toYear(column_name) + 1), etc. (Note: ClickHouse itself does support arbitrary partitioning expressions, but Gravitino currently only supports the above patterns for partitioning).
  • Distribution: fixed to Distributions.NONE. For a Distributed engine table, you can specify the sharding key and remote database/table through table properties to fulfill the same use cases. We will later consider adding more flexible distribution strategies if there is demand.

Create a table

The following example creates a MergeTree table with ORDER BY, partitioning, indexes, comments, and properties including ON CLUSTER. Note that the engine property is required for MergeTree-family tables, and that the cluster properties must align with the schema-level cluster settings if on-cluster=true.

This is a create table statement that would be executed in ClickHouse to create the corresponding table:

CREATE TABLE sales.orders ON CLUSTER ck_cluster (
order_id Int32,
user_id Int32,
amount Decimal(18,2),
created_at DateTime,
primary key (order_id),
) ENGINE = MergeTree order BY order_id PARTITION BY created_at;

The same table can be created through the API as follows:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "orders",
"comment": "Orders table",
"columns": [
{"name": "order_id", "type": "int", "nullable": false},
{"name": "user_id", "type": "int", "nullable": false},
{"name": "amount", "type": "decimal(18,2)", "nullable": false},
{"name": "created_at", "type": "timestamp", "nullable": false}
],
"properties": {
"engine": "MergeTree",
"on-cluster": "true",
"cluster-name": "ck_cluster"
},
"sortOrders": [
{"expression": "order_id", "direction": "ASCENDING"}
],
"partitioning": ["created_at"],
"indexes": [
{"indexType": "primary_key", "name": "pk_order", "fieldNames": [["order_id"]]}
]
}' http://localhost:8090/api/metalakes/metalake/catalogs/ck/schemas/sales/tables

Table operations

Supported:

  • Create table with engine, ORDER BY, optional partition, indexes, comments, default values, and SETTINGS.
  • Add column (with nullable flag, default, comment, position).
  • Rename column.
  • Update column type/comment/default/position/nullability.
  • Delete columns (with IF EXISTS support).
  • Add primary or data-skipping indexes; drop data-skipping indexes.
  • Update table comment.

Unsupported:

  • Changing engine after creation.
  • Removing table properties or arbitrary ALTER TABLE ... SETTINGS.
  • Auto-increment columns.

See Manage Relational Metadata Using Gravitino for common JDBC semantics.