Skip to main content
Version: 1.3.0

Spark Connector: Hive Catalog

Introduction

With the Apache Gravitino Spark connector, accessing data or managing metadata in Hive catalogs becomes straightforward, enabling seamless federation queries across different Hive catalogs.

Capabilities

Supports most DDL and DML operations in SparkSQL, except such operations:

  • Function operations (Gravitino UDFs are supported, see Spark connector - User-defined functions)
  • Partition operations
  • View DDL operations (CREATE VIEW, DROP VIEW, ALTER VIEW)
  • LOAD clause
  • CREATE TABLE LIKE clause
  • TRUNCATE TABLE clause
info

Don't support reading and writing tables with org.apache.hadoop.hive.serde2.OpenCSVSerde row format.

Prerequisites

  • Hive metastore 2.x
  • HDFS 2.x or 3.x

SQL Example


// Suppose hive_a is the Hive catalog name managed by Gravitino
USE hive_a;

CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;

// Create table
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
age INT
)
PARTITIONED BY (department STRING)
STORED AS PARQUET;
DESC TABLE EXTENDED employees;

INSERT OVERWRITE TABLE employees PARTITION(department='Engineering') VALUES (1, 'John Doe', 30), (2, 'Jane Smith', 28);
INSERT OVERWRITE TABLE employees PARTITION(department='Marketing') VALUES (3, 'Mike Brown', 32);

SELECT * FROM employees WHERE department = 'Engineering';

View

Spark DDL view operations (CREATE VIEW, DROP VIEW, ALTER VIEW) are not supported. However, existing views stored in the Hive Metastore can be read with SELECT. When SELECT references a view name, the connector loads the view's SQL definition from Gravitino, resolves the query, and executes it.

caution

The current implementation materializes all view results on the Spark driver using LocalScan. This is suitable only for small or bounded views. Querying a large or unbounded view may exhaust driver memory and cause an OOM error.

-- Assumes a view was created via Gravitino API, Flink, or Hive directly
SELECT * FROM employee_view;

Catalog Properties

Gravitino spark connector will transform below property names which are defined in catalog properties to Spark Hive connector configuration.

Property name in Gravitino catalog propertiesSpark Hive connector configurationDescriptionSince Version
metastore.urishive.metastore.urisHive metastore uri address0.5.0

Gravitino catalog property names with the prefix spark.bypass. are passed to Spark Hive connector. For example, using spark.bypass.hive.exec.dynamic.partition.mode to pass the hive.exec.dynamic.partition.mode to the Spark Hive connector.

caution

When using the spark-sql shell client, you must explicitly set the spark.bypass.spark.sql.hive.metastore.jars in the Gravitino Hive catalog properties. Replace the default builtin value with the appropriate setting for your setup.

Storage

S3

Refer to Hive catalog with s3 to set up a Hive catalog with s3 storage. To query the data stored in s3, you need to add s3 secret to the Spark configuration using spark.sql.catalog.${hive_catalog_name}.fs.s3a.access.key and spark.sql.catalog.${hive_catalog_name}.fs.s3a.secret.key. Additionally, download hadoop aws jar, aws java sdk jar and place them in the classpath of Spark.