Version: 1.2.0

ClickHouse catalog

Introduction

Apache Gravitino can manage ClickHouse metadata through a JDBC catalog. This document describes the capabilities and limitations of the ClickHouse catalog, as well as the supported operations and properties for catalogs, schemas, and tables.

caution

ClickHouse catalog is not included in the standard Gravitino server distribution due to the large size of the ClickHouse JDBC driver and potential licensing issues. To use the ClickHouse catalog, you can build from source code, and refer to document how-to-build for more details.

Catalog

Catalog capabilities

Item	Description
Scope	One catalog maps to one ClickHouse instance
Metadata/DDL	Supports JDBC-based metadata management and DDL
Column defaults	Supports column default values
Drivers	Requires user-provided ClickHouse JDBC driver in `${GRAVITINO_HOME}/catalogs/jdbc-clickhouse/libs`, please download the jar from link
Supported version	All the codes are tested by ClickHouse `24.8.14`, newer versions like `25.x` may also work but we did not conduct thorough tests. Report to the community if something does not work as expected.

note

ClickHouse Driver 0.7.1 is recommended for better compatibility. Some older versions of the ClickHouse JDBC driver have issues such as incorrect metadata reporting that may cause problems with Gravitino's metadata management. If you encounter issues, please check the driver version and consider upgrading to 0.7.1 or later. If the problem still persists, please report it to the Gravitino community for further investigation.

Catalog properties

You can pass any JDBC pool property that Gravitino does not define by adding the gravitino.bypass. prefix (for example gravitino.bypass.maxWaitMillis). See commons-dbcp configuration for details.

When using the JDBC catalog you must provide jdbc-url, jdbc-driver, jdbc-user, and jdbc-password. Common catalog properties are listed here; ClickHouse adds no extra catalog-scoped keys.

Configuration item	Description	Default value	Required	Since Version
`jdbc-url`	JDBC URL, for example `jdbc:clickhouse://localhost:8123`	(none)	Yes	1.2.0
`jdbc-driver`	JDBC driver class, for example `com.clickhouse.jdbc.ClickHouseDriver`	(none)	Yes	1.2.0
`jdbc-user`	JDBC user name	(none)	Yes	1.2.0
`jdbc-password`	JDBC password	(none)	Yes	1.2.0
`jdbc.pool.min-size`	Minimum pool size	`2`	No	1.2.0
`jdbc.pool.max-size`	Maximum pool size	`10`	No	1.2.0
`jdbc.pool.max-wait-ms`	Max wait time for a connection	`30000`	No	1.2.0

Create a ClickHouse catalog

The following example creates a ClickHouse catalog with the required JDBC properties and optional connection pool settings. Note that the jdbc-driver class must be available in the Gravitino classpath (for example by placing the ClickHouse JDBC driver JAR in ${GRAVITINO_HOME}/catalogs/jdbc-clickhouse/libs). Description about some of the properties:

provider: must be jdbc-clickhouse for Gravitino to recognize the catalog as ClickHouse;
type: must be RELATIONAL since ClickHouse is a relational database;

Shell
Java

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "ck",
  "type": "RELATIONAL",
  "comment": "ClickHouse catalog",
  "provider": "jdbc-clickhouse",
  "properties": {
    "jdbc-url": "jdbc:clickhouse://localhost:8123",
    "jdbc-driver": "com.clickhouse.jdbc.ClickHouseDriver",
    "jdbc-user": "default",
    "jdbc-password": "password"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs

GravitinoClient client = GravitinoClient.builder("http://localhost:8090")
    .withMetalake("metalake")
    .build();

Map<String, String> ckProps = ImmutableMap.<String, String>builder()
    .put("jdbc-url", "jdbc:clickhouse://localhost:8123")
    .put("jdbc-driver", "com.clickhouse.jdbc.ClickHouseDriver")
    .put("jdbc-user", "default")
    .put("jdbc-password", "password")
    .build();

Catalog catalog =
    client.createCatalog("ck", Catalog.Type.RELATIONAL, "jdbc-clickhouse", "ClickHouse catalog", ckProps);

See Manage Relational Metadata Using Gravitino for other catalog operations.

Schema

Schema capabilities

Item	Description
Mapping	Gravitino schema maps to a ClickHouse database
Operations	Create / drop / load / list (ClickHouse supports cascade drop)
Comments	Schema comments supported
Cluster mode	Optional `ON CLUSTER` for creation when `cluster-name` is provided

Schema properties

Property Name	Description	Default Value	Required	Immutable	Since version
`on-cluster`	Use `ON CLUSTER` when creating the database	`false`	No	No	1.2.0
`cluster-name`	Cluster name used with `ON CLUSTER` (must align with table-level cluster settings)	(none)	No	No	1.2.0

Create a schema

Shell
Java

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "sales",
  "comment": "Sales database",
  "properties": {
    "on-cluster": "true",
    "cluster-name": "ck_cluster"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs/ck/schemas

Catalog catalog = client.loadCatalog("ck");
Schema schema = catalog.asTableCatalog()
    .createSchema("sales", "Sales database",
        ImmutableMap.of("on-cluster", "true", "cluster-name", "ck_cluster"));

See Manage Relational Metadata Using Gravitino for more schema operations.

Table

Table capabilities

Area	Details
Mapping	Gravitino table maps to a ClickHouse table
Engines	Local engines: MergeTree family (`MergeTree` default, `ReplacingMergeTree`, `SummingMergeTree`, `AggregatingMergeTree`, `CollapsingMergeTree`, `VersionedCollapsingMergeTree`, `GraphiteMergeTree`), Tiny/Stripe/Log, Memory, File, Null, Set, Join, View, Buffer, KeeperMap, etc. Distributed engine supports cluster mode with remote database/table and sharding key.
Ordering/Partition	MergeTree-family requires exactly one `ORDER BY` column; only single-column identity `PARTITION BY` is supported on MergeTree engines. Other engines reject `ORDER BY`/`PARTITION BY`.
Indexes	Primary key; data-skipping indexes `DATA_SKIPPING_MINMAX` and `DATA_SKIPPING_BLOOM_FILTER` (fixed granularities).
Distribution	Gravitino enforces `Distributions.NONE`; no custom distribution strategies.
Column defaults	Supported.
Unsupported	Engine change after creation; removing table properties; auto-increment columns.

Table column types

Gravitino Type	ClickHouse Type
`Byte`	`Int8`
`Unsigned Byte`	`UInt8`
`Short`	`Int16`
`Unsigned Short`	`UInt16`
`Integer`	`Int32`
`Unsigned Integer`	`UInt32`
`Long`	`Int64`
`Unsigned Long`	`UInt64`
`Float`	`Float32`
`Double`	`Float64`
`Decimal(p,s)`	`Decimal(p,s)`
`String`/`VarChar`	`String`
`FixedChar(n)`	`FixedString(n)`
`Date`	`Date`
`Timestamp[(p)]`	`DateTime` (precision defaults to `0`)
`BOOLEAN`	`Bool`
`UUID`	`UUID`

Other ClickHouse types are exposed as External Type.

Table properties

note

settings.* keys are passed to the ClickHouse SETTINGS clause verbatim.
The engine value is immutable after creation.
When loading table metadata, Gravitino cannot determine whether it is a cluster table or a local table, because properties such as cluster-name and on-cluster are not available from the JDBC metadata.

Property Name	Description	Default Value	Required	Reserved	Immutable	Since version
`engine`	Table engine (for example `MergeTree`, `ReplacingMergeTree`, `Distributed`, `Memory`, etc.)	`MergeTree`	No	No	Yes	1.2.0
`cluster-name`	Cluster name used with `ON CLUSTER` and Distributed engine	(none)	No*	No	No	1.2.0
`on-cluster`	Use `ON CLUSTER` when creating the table	(none)	No	No	No	1.2.0
`cluster-remote-database`	Remote database for `Distributed` engine	(none)	No**	No	No	1.2.0
`cluster-remote-table`	Remote table for `Distributed` engine	(none)	No**	No	No	1.2.0
`cluster-sharding-key`	Sharding key for `Distributed` engine (expression allowed; referenced columns must be non-null integral)	(none)	No**	No	No	1.2.0
`settings.<name>`	ClickHouse engine setting forwarded as `SETTINGS <name>=<value>`	(none)	No	No	No	1.2.0

* Required when on-cluster=true or engine=Distributed.
** Required when engine=Distributed.

Table indexes

PRIMARY_KEY
Data-skipping indexes:
- DATA_SKIPPING_MINMAX (GRANULARITY fixed to 1)
- DATA_SKIPPING_BLOOM_FILTER (GRANULARITY fixed to 3)

Partitioning, sorting, and distribution

ORDER BY: required for MergeTree-family engines and only columns identity are supported;
- Accept format: id, (id, name), (func(id), name), func(id);
- Reject format: (id + 1), (func(id) + 1), etc.
PARTITION BY: single-column identity and some functions are supported only, and only for MergeTree-family engines. For example PARTITION BY created_at or PARTITION BY toYYYYMM(created_at) are supported, but PARTITION BY (created_at + 1) are not supported. In all, the following partitioning expressions are supported:
- Identity: PARTITION BY column_name
- Functions: PARTITION BY toDate(column_name), PARTITION BY toYear(column_name), PARTITION BY toYYYYMM(column_name). Other functions are not supported.
- Not support: PARTITION BY (column_name + 1), PARTITION BY (toYear(column_name) + 1), etc. (Note: ClickHouse itself does support arbitrary partitioning expressions, but Gravitino currently only supports the above patterns for partitioning).
Distribution: fixed to Distributions.NONE. For a Distributed engine table, you can specify the sharding key and remote database/table through table properties to fulfill the same use cases. We will later consider adding more flexible distribution strategies if there is demand.

Create a table

The following example creates a MergeTree table with ORDER BY, partitioning, indexes, comments, and properties including ON CLUSTER. Note that the engine property is required for MergeTree-family tables, and that the cluster properties must align with the schema-level cluster settings if on-cluster=true.

This is a create table statement that would be executed in ClickHouse to create the corresponding table:

CREATE TABLE sales.orders ON CLUSTER ck_cluster (
  order_id Int32,
  user_id Int32,
  amount Decimal(18,2),
  created_at DateTime,
  primary key (order_id),
) ENGINE = MergeTree order BY order_id PARTITION BY created_at;

The same table can be created through the API as follows:

Shell
Java

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "orders",
  "comment": "Orders table",
  "columns": [
    {"name": "order_id", "type": "int", "nullable": false},
    {"name": "user_id", "type": "int", "nullable": false},
    {"name": "amount", "type": "decimal(18,2)", "nullable": false},
    {"name": "created_at", "type": "timestamp", "nullable": false}
  ],
  "properties": {
    "engine": "MergeTree",
    "on-cluster": "true",
    "cluster-name": "ck_cluster"
  },
  "sortOrders": [
    {"expression": "order_id", "direction": "ASCENDING"}
  ],
  "partitioning": ["created_at"],
  "indexes": [
    {"indexType": "primary_key", "name": "pk_order", "fieldNames": [["order_id"]]}
  ]
}' http://localhost:8090/api/metalakes/metalake/catalogs/ck/schemas/sales/tables

TableCatalog tableCatalog = client.loadCatalog("ck").asTableCatalog();

Column[] columns = new Column[] {
    Column.of("order_id", Types.IntegerType.get(), "Order ID", false),
    Column.of("user_id", Types.IntegerType.get(), "User ID", false),
    Column.of("amount", Types.DecimalType.of(18, 2), "Amount", false),
    Column.of("created_at", Types.TimestampType.withoutTimeZone(), "Created time", false)
};

Index[] indexes =
    new Index[] {Indexes.of(Index.IndexType.PRIMARY_KEY, "pk_order", new String[][] {{"order_id"}})};

SortOrder[] sortOrders =
    new SortOrder[] {SortOrder.builder("order_id").withDirection(SortDirection.ASCENDING).build()};

Transform[] partitions = new Transform[] {Transforms.identity("created_at")};

tableCatalog.createTable(
    NameIdentifier.of("sales", "orders"),
    columns,
    "Orders table",
    ImmutableMap.of("engine", "MergeTree", "on-cluster", "true", "cluster-name", "ck_cluster"),
    partitions,
    Distributions.NONE,
    indexes,
    sortOrders);

Table operations

Supported:

Create table with engine, ORDER BY, optional partition, indexes, comments, default values, and SETTINGS.
Add column (with nullable flag, default, comment, position).
Rename column.
Update column type/comment/default/position/nullability.
Delete columns (with IF EXISTS support).
Add primary or data-skipping indexes; drop data-skipping indexes.
Update table comment.

Unsupported:

Changing engine after creation.
Removing table properties or arbitrary ALTER TABLE ... SETTINGS.
Auto-increment columns.

See Manage Relational Metadata Using Gravitino for common JDBC semantics.

Introduction​

Catalog​

Catalog capabilities​

Catalog properties​

Create a ClickHouse catalog​

Schema​

Schema capabilities​

Schema properties​

Create a schema​

Table​

Table capabilities​

Table column types​

Table properties​

Table indexes​

Partitioning, sorting, and distribution​

Create a table​

Table operations​