Skip to main content
Version: 1.3.0

Iceberg Catalog

Introduction

Apache Gravitino provides the ability to manage Apache Iceberg metadata.

Requirements and Limitations

info

Builds with Apache Iceberg 1.11.0. The Apache Iceberg table format version is 2 by default.

Flink and Spark clients may use a different Iceberg version than the server.

caution

Mixing Iceberg JARs from different versions on the client classpath is not compatible and may cause runtime errors.

Catalog

Catalog Capabilities

  • Works as a catalog proxy, supporting Hive, JDBC and REST as catalog backend.
  • Supports DDL operations for Iceberg schemas and tables.
  • Doesn't support snapshot or table management operations.
  • Supports multi storage, including S3, GCS, ADLS, OSS and HDFS.
  • Supports Kerberos or simple authentication for Iceberg catalog with Hive backend.
  • Supports table metadata cache.

Catalog Properties

Property nameDescriptionDefault valueRequiredSince Version
catalog-backendCatalog backend of Gravitino Iceberg catalog. Supports hive or jdbc or rest.(none)Yes0.2.0
uriThe URI configuration of the Iceberg catalog. thrift://127.0.0.1:9083 or jdbc:postgresql://127.0.0.1:5432/db_name or jdbc:mysql://127.0.0.1:3306/metastore_db or http://127.0.0.1:9001/iceberg.(none)Yes0.2.0
warehouseWarehouse location of catalog. Use a physical S3 or HDFS location for hive or jdbc catalog backend, use catalog name for REST catalog backend.(none)Yes for hive and jdbc catalog backend0.2.0
catalog-backend-nameThe catalog name passed to underlying Iceberg catalog backend. Catalog name in JDBC backend is used to isolate namespace and tables.The property value of catalog-backend, like jdbc for JDBC catalog backend.No0.5.2

Any property not defined by Gravitino with gravitino.bypass. prefix will pass to Iceberg catalog properties and HDFS configuration. For example, if specify gravitino.bypass.list-all-tables, list-all-tables will pass to Iceberg catalog properties.

If you are using the Gravitino with Trino, you can pass the Trino Iceberg connector configuration using prefix trino.bypass.. For example, using trino.bypass.iceberg.table-statistics-enabled to pass the iceberg.table-statistics-enabled to the Gravitino Iceberg catalog in Trino runtime.

If you are using the Gravitino with Spark, you can pass the Spark Iceberg connector configuration using prefix spark.bypass.. For example, using spark.bypass.io-impl to pass the io-impl to the Spark Iceberg connector in Spark runtime.

JDBC Backend

If you are using JDBC backend, you must provide properties like jdbc-user, jdbc-password and jdbc-driver.

Property nameDescriptionDefault valueRequiredSince Version
jdbc-userJDBC user name(none)Yes0.2.0
jdbc-passwordJDBC password(none)Yes0.2.0
jdbc-drivercom.mysql.jdbc.Driver or com.mysql.cj.jdbc.Driver for MySQL, org.postgresql.Driver for PostgreSQL(none)Yes0.3.0
jdbc-initializeWhether to initialize meta tables when create JDBC catalogtrueNo0.2.0

If you have a JDBC Iceberg catalog prior, you must set catalog-backend-name to keep consistent with your Jdbc Iceberg catalog name to operate the prior namespace and tables.

caution

Download the corresponding JDBC driver and place it to the catalogs/lakehouse-iceberg/libs directory If you are using JDBC backend. If you are using multiple JDBC catalog backends, setting jdbc-initialize to true may not take effect for RDMS like Mysql, you should create Iceberg meta tables explicitly.

REST Catalog Backend

For the REST catalog backend, warehouse identifies the catalog in the Iceberg REST spec. In the Gravitino Iceberg REST server, warehouse maps to the catalog name. An empty value means the default catalog.

The following properties tune REST backend behavior:

Property nameDescriptionDefault valueRequiredSince Version
data-accessData access mode for REST catalog backend. Supported values are vended-credentials and remote-signing.(none)No1.3.0
rest-client-connection-timeout-msThe HTTP connection timeout in milliseconds for requests to the REST catalog backend.10000No1.3.0
rest-client-socket-timeout-msThe HTTP socket timeout in milliseconds for requests to the REST catalog backend.60000No1.3.0
  • vended-credentials: request credential vending from the Iceberg REST server.
  • remote-signing: Gravitino doesn't support this mode yet.

Example: create an Iceberg catalog with the REST backend. This targets the default catalog and uses a REST path like http://127.0.0.1:9001/iceberg/v1/namespaces/db/tables/table.

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-d '{
"name": "iceberg_rest",
"type": "RELATIONAL",
"comment": "Iceberg REST catalog",
"provider": "lakehouse-iceberg",
"properties": {
"catalog-backend": "rest",
"uri": "http://localhost:9001/iceberg",
"rest-client-connection-timeout-ms": "10000",
"rest-client-socket-timeout-ms": "60000",
"data-access": "vended-credentials"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

To access a non-default catalog, set warehouse to the catalog name. This uses a REST path like http://127.0.0.1:9001/iceberg/v1/catalog/namespaces/db/tables/table. See Multi-Catalog Configuration for details.

S3

If io-impl is not configured, the Iceberg catalog uses org.apache.iceberg.io.ResolvingFileIO, which selects a FileIO implementation based on the URI scheme:

  • S3: s3, s3a, or s3n
  • OSS: oss
  • GCS: gs or gcs
  • ADLS: abfs, abfss, wasb, or wasbs
  • To override the default, explicitly configure io-impl.
  • Ensure that the corresponding storage bundle is available in the Iceberg catalog classpath.

Supports using static access-key-id and secret-access-key to access S3 data.

Configuration itemDescriptionDefault valueRequiredSince Version
io-implThe IO implementation for FileIO in Iceberg. Set it to org.apache.iceberg.aws.s3.S3FileIO to explicitly use S3FileIO.org.apache.iceberg.io.ResolvingFileIONo0.6.0-incubating
s3-access-key-idThe static access key ID used to access S3 data.(none)No0.6.0-incubating
s3-secret-access-keyThe static secret access key used to access S3 data.(none)No0.6.0-incubating
s3-endpointAn alternative endpoint of the S3 service, This could be used for S3FileIO with any s3-compatible object storage service that has a different endpoint, or access a private S3 endpoint in a virtual private cloud.(none)No0.6.0-incubating
s3-regionThe region of the S3 service, like us-west-2.(none)No0.6.0-incubating
s3-path-style-accessWhether to use path style access for S3.falseNo0.9.0-incubating

For other Iceberg s3 properties not managed by Gravitino like s3.sse.type, you could config it directly by gravitino.bypass.s3.sse.type.

info
  • For the JDBC catalog backend, set the warehouse parameter to s3://{bucket_name}/${prefix_name}.
  • For the Hive catalog backend, set warehouse to s3a://{bucket_name}/${prefix_name}.
  • Additionally, download the Gravitino Iceberg AWS bundle and place it in the catalogs/lakehouse-iceberg/libs/ directory.
note

Since Gravitino 1.1.0, the Gravitino Iceberg AWS bundle jar has already included the Iceberg AWS bundle jar, no need to download and include it separately.

OSS

Gravitino Iceberg REST service supports using static access-key-id and secret-access-key to access OSS data.

Configuration itemDescriptionDefault valueRequiredSince Version
io-implThe IO implementation for FileIO in Iceberg. Set it to org.apache.iceberg.aliyun.oss.OSSFileIO to explicitly use OSSFileIO.org.apache.iceberg.io.ResolvingFileIONo0.6.0-incubating
oss-access-key-idThe static access key ID used to access OSS data.(none)No0.7.0-incubating
oss-secret-access-keyThe static secret access key used to access OSS data.(none)No0.7.0-incubating
oss-endpointThe endpoint of Aliyun OSS service.(none)No0.7.0-incubating

For other Iceberg OSS properties not managed by Gravitino like client.security-token, you could config it directly by gravitino.bypass.client.security-token.

info

Please set the warehouse parameter to oss://{bucket_name}/${prefix_name}. Additionally, download the Gravitino Iceberg Aliyun bundle and place it in the catalogs/lakehouse-iceberg/libs/ directory.

note

Since Gravitino 1.1.0, the Gravitino Iceberg aliyun bundle jar has already included the Iceberg aliyun necessary dependency jars, no need to download and include them separately.

GCS

Supports using google credential file to access GCS data.

Configuration itemDescriptionDefault valueRequiredSince Version
io-implThe IO implementation for FileIO in Iceberg. Set it to org.apache.iceberg.gcp.gcs.GCSFileIO to explicitly use GCSFileIO.org.apache.iceberg.io.ResolvingFileIONo0.6.0-incubating

For other Iceberg GCS properties not managed by Gravitino like gcs.project-id, you could config it directly by gravitino.bypass.gcs.project-id.

Please make sure the credential file is accessible by Gravitino, like using export GOOGLE_APPLICATION_CREDENTIALS=/xx/application_default_credentials.json before Gravitino server is started.

info

Please set warehouse to gs://{bucket_name}/${prefix_name}, and download Gravitino Iceberg GCP bundle jar and place it to catalogs/lakehouse-iceberg/libs/.

note

Since Gravitino 1.1.0, the Gravitino Iceberg GCP bundle jar has already included the Iceberg GCP bundle jar, no need to download and include it separately.

ADLS

Supports using Azure account name and secret key to access ADLS data.

Configuration itemDescriptionDefault valueRequiredSince Version
io-implThe IO implementation for FileIO in Iceberg. Set it to org.apache.iceberg.azure.adlsv2.ADLSFileIO to explicitly use ADLSFileIO.org.apache.iceberg.io.ResolvingFileIONo0.6.0-incubating
azure-storage-account-nameThe static storage account name used to access ADLS data.(none)No0.8.0-incubating
azure-storage-account-keyThe static storage account key used to access ADLS data.(none)No0.8.0-incubating

For other Iceberg ADLS properties not managed by Gravitino like adls.read.block-size-bytes, you could config it directly by gravitino.iceberg-rest.adls.read.block-size-bytes.

info

Please set warehouse to abfs[s]://{container-name}@{storage-account-name}.dfs.core.windows.net/{path}, and download the Gravitino Iceberg Azure bundle and place it to catalogs/lakehouse-iceberg/libs/.

note

Since Gravitino 1.1.0, the Gravitino Iceberg Azure bundle jar has already included the Iceberg Azure bundle jar, no need to download and include it separately.

Other Storage

For other storages that are not managed by Gravitino directly, you can manage them through custom catalog properties.

Configuration itemDescriptionDefault valueRequiredSince Version
io-implThe IO implementation for FileIO in Iceberg. Use the fully qualified class name to override the default implementation.org.apache.iceberg.io.ResolvingFileIONo0.6.0-incubating

To pass custom properties such as security-token to your custom FileIO, you can directly configure it by gravitino.bypass.security-token. security-token will be included in the properties when the initialize method of FileIO is invoked.

info

Please set the warehouse parameter to {storage_prefix}://{bucket_name}/${prefix_name}. Additionally, download corresponding jars in the catalogs/lakehouse-iceberg/libs/ directory.

Catalog Backend Security

Users can use the following properties to configure the security of the catalog backend if needed. For example, if you are using a Kerberos Hive catalog backend, you must set authentication.type to Kerberos and provide authentication.kerberos.principal and authentication.kerberos.keytab-uri.

Property nameDescriptionDefault valueRequiredSince Version
authentication.typeThe type of authentication for Iceberg catalog backend. This configuration only applicable for for Hive backend, and only supports Kerberos, simple currently. As for JDBC backend, only username/password authentication was supported now.simpleNo0.6.0-incubating
authentication.impersonation-enableWhether to enable impersonation for the Iceberg catalogfalseNo0.6.0-incubating
hive.metastore.sasl.enabledWhether to enable SASL authentication protocol when connect to Kerberos Hive metastore. This is a raw Hive configurationfalseNo, This value should be true in most case(Some will use SSL protocol, but it rather rare) if the value of gravitino.iceberg-rest.authentication.type is Kerberos.0.6.0-incubating
authentication.kerberos.principalThe principal of the Kerberos authentication(none)required if the value of authentication.type is Kerberos.0.6.0-incubating
authentication.kerberos.keytab-uriThe URI of The keytab for the Kerberos authentication.(none)required if the value of authentication.type is Kerberos.0.6.0-incubating
authentication.kerberos.check-interval-secThe check interval of Kerberos credential for Iceberg catalog.60No0.6.0-incubating
authentication.kerberos.keytab-fetch-timeout-secThe fetch timeout of retrieving Kerberos keytab from authentication.kerberos.keytab-uri.60No0.6.0-incubating

Table Metadata Cache

Gravitino features a pluggable cache system for updating or retrieving table metadata in the cache. It validates the location of table metadata against the catalog backend to ensure the correctness of cached data.

Configuration itemDescriptionDefault valueRequiredSince Version
table-metadata-cache-implThe implementation of the table metadata cache. Set to empty string("") if catalog-backend is rest catalog, or custom catalog without the SupportsMetadataLocation interface.org.apache.gravitino.iceberg.common.cache.LocalTableMetadataCacheNo1.1.0
table-metadata-cache-capacityThe capacity of the table metadata cache.1000No1.1.0
table-metadata-cache-expire-minutesThe expiration time (in minutes) of the table metadata cache.60No1.1.0

Gravitino provides the build-in org.apache.gravitino.iceberg.common.cache.LocalTableMetadataCache to store the cached data in the memory. You could also implement your custom table metadata cache by implementing the org.apache.gravitino.iceberg.common.cache.TableMetadataCache interface.

Catalog Operations

Refer to Manage Relational Metadata Using Gravitino for more details.

note

Sensitive catalog properties such as s3-access-key-id, s3-secret-access-key, oss-access-key-id, and oss-secret-access-key are hidden from the load catalog response since Gravitino 1.3.0. Use the credential vending API to retrieve them at runtime.

Schema

Schema Capabilities

  • doesn't support cascade drop schema.
  • supports hierarchical (multi-level) schemas, mapping each level to an Iceberg namespace level. See Hierarchical schema.

Schema Properties

You could put properties except comment.

Schema Operations

Refer to Manage Relational Metadata Using Gravitino for more details.

Hierarchical schema

The Iceberg catalog supports a hierarchical (multi-level) schema, where a schema can be nested under another schema, mapping each level to an Iceberg multi-level namespace.

A hierarchical schema name is a path whose levels are joined by the configured separator gravitino.schema.separator (default :, see Gravitino server configuration). For example, with the default separator the name a:b:c denotes a schema c nested under a:b, which in turn is nested under a. The separator is only used at the API boundary; Gravitino stores the name internally using a physical separator that never collides with user input.

To create a hierarchical schema, just supply its full hierarchical name. Any missing ancestor schemas are created automatically, so creating a:b:c also creates a and a:b if they don't already exist. The following example creates the schema a:b:c:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "a:b:c",
"comment": "a hierarchical schema",
"properties": {}
}' http://localhost:8090/api/metalakes/metalake/catalogs/iceberg_catalog/schemas

To list the schemas directly under a parent schema, pass the parent schema name. Over REST this is the optional parentSchema query parameter; in the clients it is an argument to the list-schemas method. Given the schemas a, a:b and a:b:c, listing the children of a:b returns [a:b:c]. When the parent is omitted, only the top-level schemas under the catalog are returned (the direct children of the catalog root, e.g. a), not the nested ones.

curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
"http://localhost:8090/api/metalakes/metalake/catalogs/iceberg_catalog/schemas?parentSchema=a:b"

Table

Table Capabilities

  • Doesn't support column default value.

Table Partitions

Supports transforms:

  • IdentityTransform
  • BucketTransform
  • TruncateTransform
  • YearTransform
  • MonthTransform
  • DayTransform
  • HourTransform
info

Iceberg doesn't support multi fields in BucketTransform. Iceberg doesn't support ApplyTransform, RangeTransform, and ListTransform.

Table Sort Orders

supports expressions:

  • FieldReference
  • FunctionExpression
    • bucket
    • truncate
    • year
    • month
    • day
    • hour
info

For bucket and truncate, the first argument must be integer literal, and the second argument must be field reference.

Table Distributions

  • Support HashDistribution, which distribute data by partition key.
  • Support RangeDistribution, which distribute data by partition key or sort key for a SortOrder table.
  • Doesn't support EvenDistribution.
info

If you doesn't specify distribution expressions, the table distribution will be adjusted to RangeDistribution for a sort order table, to HashDistribution for a partition table.

Table Column Types

Gravitino TypeApache Iceberg Type
StructStruct
MapMap
ListArray
BooleanBoolean
IntegerInteger
LongLong
FloatFloat
DoubleDouble
StringString
DateDate
Time(6)Time
Timestamp(6)TimestampType withZone
Timestamp_tz(6)TimestampType withoutZone
DecimalDecimal
FixedFixed
BinaryBinary
UUIDUUID
info

Apache Iceberg doesn't support Gravitino Varchar Fixedchar Byte Short Union type. Meanwhile, the data types other than listed above are mapped to Gravitino External Type that represents an unresolvable data type since 0.6.0-incubating.

Table Properties

Pass Iceberg table properties to Gravitino when creating an Iceberg table.

note

Reserved: Fields that cannot be passed to the Gravitino server.

Immutable: Fields that cannot be modified once set.

Configuration itemDescriptionDefault valueRequiredReservedImmutableSince Version
locationIceberg location for table storage.(none)NoNoYes0.2.0
providerThe storage provider for table storage.(none)NoNoYes0.2.0
formatThe format of table storage.(none)NoNoYes0.2.0
format-versionThe format version of table storage.(none)NoNoYes0.2.0
commentThe table comment; use the comment field in table meta instead.(none)NoYesNo0.2.0
creatorThe table creator.(none)NoYesNo0.2.0
current-snapshot-idThe snapshot represents the current state of the table.(none)NoYesNo0.2.0
cherry-pick-snapshot-idSelecting a specific snapshot in a merge operation.(none)NoYesNo0.2.0
sort-orderIceberg table sort order; use SortOrder in table meta instead.(none)NoYesNo0.2.0
identifier-fieldsThe identifier fields for defining the table.(none)NoYesNo0.2.0
write.distribution-modeDefines distribution of write data; use distribution in table meta instead.(none)NoYesNo0.2.0

Table Indexes

  • Doesn't support table indexes.

Table Operations

Refer to Manage Relational Metadata Using Gravitino for more details.

Alter Table Operations

Supports operations:

  • RenameTable
  • SetProperty
  • RemoveProperty
  • UpdateComment
  • AddColumn
  • DeleteColumn
  • RenameColumn
  • UpdateColumnType
  • UpdateColumnPosition
  • UpdateColumnNullability
  • UpdateColumnComment
info

The default column position is LAST when you add a column. If you add a non nullability column, there may be compatibility issues.

caution

If you update a nullability column to non nullability, there may be compatibility issues.

View

View Capabilities

  • Supports list, create, load, alter, and drop for views managed by the underlying Iceberg backend.
  • Accepts any dialect name (e.g. trino, spark, flink, hive). No restriction on which dialects are used.
  • Can preserve multiple SQL representations for the same logical view; the full set of representations round-trips through Gravitino.
  • defaultCatalog and defaultSchema are stored and returned as-is by the backend.
  • View support depends on the Iceberg catalog backend: REST and Hive backends generally support views; JDBC backend support is in continuous validation.
note

Rename cannot be combined with other changes in a single alterView call. Submit rename as a standalone request.

View Operations

Refer to Manage view metadata using Gravitino for more details.

HDFS Configuration

Place core-site.xml and hdfs-site.xml in the catalogs/lakehouse-iceberg/conf directory to automatically load as the default HDFS configuration.

info

Builds with Hadoop 2.10.x, there may be compatibility issues when accessing Hadoop 3.x clusters. When writing to HDFS, the Gravitino Iceberg REST server can only operate as the specified HDFS user and doesn't support proxying to other HDFS users. See How to access Apache Hadoop for more details.