Version: 0.8.0-incubating

Hadoop catalog with OSS

This document explains how to configure a Hadoop catalog with Aliyun OSS (Object Storage Service) in Gravitino.

Prerequisites

To set up a Hadoop catalog with OSS, follow these steps:

Download the gravitino-aliyun-bundle-${gravitino-version}.jar file.
Place the downloaded file into the Gravitino Hadoop catalog classpath at ${GRAVITINO_HOME}/catalogs/hadoop/libs/.
Start the Gravitino server by running the following command:

$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start

Once the server is up and running, you can proceed to configure the Hadoop catalog with OSS. In the rest of this document we will use http://localhost:8090 as the Gravitino server URL, please replace it with your actual server URL.

Configurations for creating a Hadoop catalog with OSS

Configuration for an OSS Hadoop catalog

In addition to the basic configurations mentioned in Hadoop-catalog-catalog-configuration, the following properties are required to configure a Hadoop catalog with OSS:

Configuration item	Description	Default value	Required	Since version
`filesystem-providers`	The file system providers to add. Set it to `oss` if it's a OSS fileset, or a comma separated string that contains `oss` like `oss,gs,s3` to support multiple kinds of fileset including `oss`.	(none)	Yes	0.7.0-incubating
`default-filesystem-provider`	The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for OSS, if we set this value, we can omit the prefix 'oss://' in the location.	`builtin-local`	No	0.7.0-incubating
`oss-endpoint`	The endpoint of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss-access-key-id`	The access key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss-secret-access-key`	The secret key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`credential-providers`	The credential provider types, separated by comma, possible value can be `oss-token`, `oss-secret-key`. As the default authentication type is using AKSK as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like AKSK to access OSS by GVFS. Once it's set, more configuration items are needed to make it works, please see oss-credential-vending	(none)	No	0.8.0-incubating

Configurations for a schema

To create a schema, refer to Schema configurations.

Configurations for a fileset

For instructions on how to create a fileset, refer to Fileset configurations for more details.

Example of creating Hadoop catalog/schema/fileset with OSS

This section will show you how to use the Hadoop catalog with OSS in Gravitino, including detailed examples.

Step1: Create a Hadoop catalog with OSS

First, you need to create a Hadoop catalog for OSS. The following examples demonstrate how to create a Hadoop catalog with OSS:

Shell
Java
Python

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "test_catalog",
  "type": "FILESET",
  "comment": "This is a OSS fileset catalog",
  "provider": "hadoop",
  "properties": {
    "location": "oss://bucket/root",
    "oss-access-key-id": "access_key",
    "oss-secret-access-key": "secret_key",
    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com",
    "filesystem-providers": "oss"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs

GravitinoClient gravitinoClient = GravitinoClient
    .builder("http://localhost:8090")
    .withMetalake("metalake")
    .build();

Map<String, String> ossProperties = ImmutableMap.<String, String>builder()
    .put("location", "oss://bucket/root")
    .put("oss-access-key-id", "access_key")
    .put("oss-secret-access-key", "secret_key")
    .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
    .put("filesystem-providers", "oss")
    .build();

Catalog ossCatalog = gravitinoClient.createCatalog("test_catalog",
    Type.FILESET,
    "hadoop", // provider, Gravitino only supports "hadoop" for now.
    "This is a OSS fileset catalog",
    ossProperties);
// ...

gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
oss_properties = {
    "location": "oss://bucket/root",
    "oss-access-key-id": "access_key"
    "oss-secret-access-key": "secret_key",
    "oss-endpoint": "ossProperties",
    "filesystem-providers": "oss"
}

oss_catalog = gravitino_client.create_catalog(name="test_catalog",
                                              type=Catalog.Type.FILESET,
                                              provider="hadoop",
                                              comment="This is a OSS fileset catalog",
                                              properties=oss_properties)

Step 2: Create a Schema

Once the Hadoop catalog with OSS is created, you can create a schema inside that catalog. Below are examples of how to do this:

Shell
Java
Python

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "test_schema",
  "comment": "This is a OSS schema",
  "properties": {
    "location": "oss://bucket/root/schema"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas

Catalog catalog = gravitinoClient.loadCatalog("test_catalog");

SupportsSchemas supportsSchemas = catalog.asSchemas();

Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
    .put("location", "oss://bucket/root/schema")
    .build();
Schema schema = supportsSchemas.createSchema("test_schema",
    "This is a OSS schema",
    schemaProperties
);
// ...

gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
catalog.as_schemas().create_schema(name="test_schema",
                                   comment="This is a OSS schema",
                                   properties={"location": "oss://bucket/root/schema"})

Step3: Create a fileset

Now that the schema is created, you can create a fileset inside it. Here’s how:

Shell
Java
Python

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "example_fileset",
  "comment": "This is an example fileset",
  "type": "MANAGED",
  "storageLocation": "oss://bucket/root/schema/example_fileset",
  "properties": {
    "k1": "v1"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets

GravitinoClient gravitinoClient = GravitinoClient
    .builder("http://localhost:8090")
    .withMetalake("metalake")
    .build();

Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();

Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
        .put("k1", "v1")
        .build();

filesetCatalog.createFileset(
    NameIdentifier.of("test_schema", "example_fileset"),
    "This is an example fileset",
    Fileset.Type.MANAGED,
    "oss://bucket/root/schema/example_fileset",
    propertiesMap,
);

gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")

catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"),
                                            type=Fileset.Type.MANAGED,
                                            comment="This is an example fileset",
                                            storage_location="oss://bucket/root/schema/example_fileset",
                                            properties={"k1": "v1"})

Accessing a fileset with OSS

Using the GVFS Java client to access the fileset

To access fileset with OSS using the GVFS Java client, based on the basic GVFS configurations, you need to add the following configurations:

Configuration item	Description	Default value	Required	Since version
`oss-endpoint`	The endpoint of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss-access-key-id`	The access key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss-secret-access-key`	The secret key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating

note

If the catalog has enabled credential vending, the properties above can be omitted. More details can be found in Fileset with credential vending.

Configuration conf = new Configuration();
conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs");
conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri", "http://localhost:8090");
conf.set("fs.gravitino.client.metalake", "test_metalake");
conf.set("oss-endpoint", "http://localhost:8090");
conf.set("oss-access-key-id", "minio");
conf.set("oss-secret-access-key", "minio123"); 
Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
FileSystem fs = filesetPath.getFileSystem(conf);
fs.mkdirs(filesetPath);
...

Similar to Spark configurations, you need to add OSS (bundle) jars to the classpath according to your environment. If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your pom.xml:

  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>${HADOOP_VERSION}</version>
  </dependency>

  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aliyun</artifactId>
    <version>${HADOOP_VERSION}</version>
  </dependency>

  <dependency>
    <groupId>org.apache.gravitino</groupId>
    <artifactId>filesystem-hadoop3-runtime</artifactId>
    <version>${GRAVITINO_VERSION}</version>
  </dependency>

  <dependency>
    <groupId>org.apache.gravitino</groupId>
    <artifactId>gravitino-aliyun</artifactId>
    <version>${GRAVITINO_VERSION}</version>
  </dependency>

Or use the bundle jar with Hadoop environment if there is no Hadoop environment:

  <dependency>
    <groupId>org.apache.gravitino</groupId>
    <artifactId>gravitino-aliyun-bundle</artifactId>
    <version>${GRAVITINO_VERSION}</version>
  </dependency>

  <dependency>
    <groupId>org.apache.gravitino</groupId>
    <artifactId>filesystem-hadoop3-runtime</artifactId>
    <version>${GRAVITINO_VERSION}</version>
  </dependency>

Using Spark to access the fileset

The following code snippet shows how to use PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0) to access the fileset:

Before running the following code, you need to install required packages:

pip install pyspark==3.1.3
pip install apache-gravitino==${GRAVITINO_VERSION}

Then you can run the following code:

from pyspark.sql import SparkSession
import os

gravitino_url = "http://localhost:8090"
metalake_name = "test"

catalog_name = "your_oss_catalog"
schema_name = "your_oss_schema"
fileset_name = "your_oss_fileset"

os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar --master local[1] pyspark-shell"
spark = SparkSession.builder
    .appName("oss_fileset_test")
    .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
    .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
    .config("spark.hadoop.fs.gravitino.server.uri", "${_URL}")
    .config("spark.hadoop.fs.gravitino.client.metalake", "test")
    .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"])
    .config("spark.hadoop.oss-secret-access-key", os.environ["OSS_SECRET_ACCESS_KEY"])
    .config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
    .config("spark.driver.memory", "2g")
    .config("spark.driver.port", "2048")
    .getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
columns = ["Name", "Age"]
spark_df = spark.createDataFrame(data, schema=columns)
gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"

spark_df.coalesce(1).write
    .mode("overwrite")
    .option("header", "true")
    .csv(gvfs_path)

If your Spark without Hadoop environment, you can use the following code snippet to access the fileset:

## Replace the following code snippet with the above code snippet with the same environment variables

os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar, --master local[1] pyspark-shell"

gravitino-aliyun-bundle-${gravitino-version}.jar is the Gravitino Aliyun jar with Hadoop environment(3.3.1) and hadoop-oss jar.
gravitino-aliyun-${gravitino-version}.jar is a condensed version of the Gravitino Aliyun bundle jar without Hadoop environment and hadoop-aliyun jar. -hadoop-aliyun-3.2.0.jar and aliyun-sdk-oss-2.8.3.jar can be found in the Hadoop distribution in the ${HADOOP_HOME}/share/hadoop/tools/lib directory.

Please choose the correct jar according to your environment.

note

In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.

Accessing a fileset using the Hadoop fs command

The following are examples of how to use the hadoop fs command to access the fileset in Hadoop 3.1.3.

Adding the following contents to the ${HADOOP_HOME}/etc/hadoop/core-site.xml file:

  <property>
    <name>fs.AbstractFileSystem.gvfs.impl</name>
    <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
  </property>

  <property>
    <name>fs.gvfs.impl</name>
    <value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
  </property>

  <property>
    <name>fs.gravitino.server.uri</name>
    <value>http://localhost:8090</value>
  </property>

  <property>
    <name>fs.gravitino.client.metalake</name>
    <value>test</value>
  </property>

  <property>
    <name>oss-endpoint</name>
    <value>http://oss-cn-hangzhou.aliyuncs.com</value>
  </property>

  <property>
    <name>oss-access-key-id</name>
    <value>access-key</value>
  </property>
  
  <property>
  <name>oss-secret-access-key</name>
    <value>secret-key</value>
  </property>

Add the necessary jars to the Hadoop classpath.

For OSS, you need to add gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar, gravitino-aliyun-${gravitino-version}.jar and hadoop-aliyun-${hadoop-version}.jar located at ${HADOOP_HOME}/share/hadoop/tools/lib/ to Hadoop classpath.

Run the following command to access the fileset:

./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/oss_catalog/oss_schema/oss_fileset
./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/oss_catalog/schema/oss_fileset

Using the GVFS Python client to access a fileset

In order to access fileset with OSS using the GVFS Python client, apart from basic GVFS configurations, you need to add the following configurations:

Configuration item	Description	Default value	Required	Since version
`oss_endpoint`	The endpoint of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss_access_key_id`	The access key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating
`oss_secret_access_key`	The secret key of the Aliyun OSS.	(none)	Yes	0.7.0-incubating

note

If the catalog has enabled credential vending, the properties above can be omitted.

Please install the gravitino package before running the following code:

pip install apache-gravitino==${GRAVITINO_VERSION}

from gravitino import gvfs
options = {
    "cache_size": 20,
    "cache_expired_time": 3600,
    "auth_type": "simple",
    "oss_endpoint": "http://localhost:8090",
    "oss_access_key_id": "minio",
    "oss_secret_access_key": "minio123"
}
fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options)

fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/")

Using fileset with pandas

The following are examples of how to use the pandas library to access the OSS fileset

import pandas as pd

storage_options = {
    "server_uri": "http://localhost:8090", 
    "metalake_name": "test",
    "options": {
        "oss_access_key_id": "access_key",
        "oss_secret_access_key": "secret_key",
        "oss_endpoint": "http://oss-cn-hangzhou.aliyuncs.com"
    }
}
ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
                 storage_options=storage_options)
ds.head()

For other use cases, please refer to the Gravitino Virtual File System document.

Fileset with credential vending

Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset. If the catalog has been configured with credential, you can access OSS fileset without providing authentication information like oss-access-key-id and oss-secret-access-key in the properties.

How to create an OSS Hadoop catalog with credential vending

Apart from configuration method in create-oss-hadoop-catalog, properties needed by oss-credential should also be set to enable credential vending for OSS fileset. Take oss-token credential provider for example:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "oss-catalog-with-token",
  "type": "FILESET",
  "comment": "This is a OSS fileset catalog",
  "provider": "hadoop",
  "properties": {
    "location": "oss://bucket/root",
    "oss-access-key-id": "access_key",
    "oss-secret-access-key": "secret_key",
    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com",
    "filesystem-providers": "oss",
    "credential-providers": "oss-token",
    "oss-region":"oss-cn-hangzhou",
    "oss-role-arn":"The ARN of the role to access the OSS data"
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs

How to access OSS fileset with credential vending

If the catalog has been configured with credential, you can access OSS fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access OSS fileset with credential:

GVFS Java client:

Configuration conf = new Configuration();
conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs");
conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri", "http://localhost:8090");
conf.set("fs.gravitino.client.metalake", "test_metalake");
// No need to set oss-access-key-id and oss-secret-access-key
Path filesetPath = new Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir");
FileSystem fs = filesetPath.getFileSystem(conf);
fs.mkdirs(filesetPath);
...

Spark:

spark = SparkSession.builder
    .appName("oss_fileset_test")
    .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
    .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
    .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
    .config("spark.hadoop.fs.gravitino.client.metalake", "test")
    # No need to set oss-access-key-id and oss-secret-access-key
    .config("spark.driver.memory", "2g")
    .config("spark.driver.port", "2048")
    .getOrCreate()

Python client and Hadoop command are similar to the above examples.

Prerequisites​

Configurations for creating a Hadoop catalog with OSS​

Configuration for an OSS Hadoop catalog​

Configurations for a schema​

Configurations for a fileset​

Example of creating Hadoop catalog/schema/fileset with OSS​

Step1: Create a Hadoop catalog with OSS​

Step 2: Create a Schema​

Step3: Create a fileset​

Accessing a fileset with OSS​

Using the GVFS Java client to access the fileset​

Using Spark to access the fileset​

Accessing a fileset using the Hadoop fs command​

Using the GVFS Python client to access a fileset​

Using fileset with pandas​

Fileset with credential vending​

How to create an OSS Hadoop catalog with credential vending​

How to access OSS fileset with credential vending​

Prerequisites

Configurations for creating a Hadoop catalog with OSS

Configuration for an OSS Hadoop catalog

Configurations for a schema

Configurations for a fileset

Example of creating Hadoop catalog/schema/fileset with OSS

Step1: Create a Hadoop catalog with OSS

Step 2: Create a Schema

Step3: Create a fileset

Accessing a fileset with OSS

Using the GVFS Java client to access the fileset

Using Spark to access the fileset

Accessing a fileset using the Hadoop fs command

Using the GVFS Python client to access a fileset

Using fileset with pandas

Fileset with credential vending

How to create an OSS Hadoop catalog with credential vending

How to access OSS fileset with credential vending