Skip to main content
Version: 0.8.0-incubating

Hadoop catalog with ADLS

This document describes how to configure a Hadoop catalog with ADLS (aka. Azure Blob Storage (ABS), or Azure Data Lake Storage (v2)).

Prerequisites

To set up a Hadoop catalog with ADLS, follow these steps:

  1. Download the gravitino-azure-bundle-${gravitino-version}.jar file.
  2. Place the downloaded file into the Gravitino Hadoop catalog classpath at ${GRAVITINO_HOME}/catalogs/hadoop/libs/.
  3. Start the Gravitino server by running the following command:
$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start

Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use http://localhost:8090 as the Gravitino server URL, please replace it with your actual server URL.

Configurations for creating a Hadoop catalog with ADLS

Configuration for a ADLS Hadoop catalog

Apart from configurations mentioned in Hadoop-catalog-catalog-configuration, the following properties are required to configure a Hadoop catalog with ADLS:

Configuration itemDescriptionDefault valueRequiredSince version
filesystem-providersThe file system providers to add. Set it to abs if it's a Azure Blob Storage fileset, or a comma separated string that contains abs like oss,abs,s3 to support multiple kinds of fileset including abs.(none)Yes0.8.0-incubating
default-filesystem-providerThe name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is builtin-local, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location.builtin-localNo0.8.0-incubating
azure-storage-account-name The account name of Azure Blob Storage.(none)Yes0.8.0-incubating
azure-storage-account-keyThe account key of Azure Blob Storage.(none)Yes0.8.0-incubating
credential-providersThe credential provider types, separated by comma, possible value can be adls-token, azure-account-key. As the default authentication type is using account name and account key as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like account_name/account_key to access ADLS by GVFS. Once it's set, more configuration items are needed to make it works, please see adls-credential-vending(none)No0.8.0-incubating

Configurations for a schema

Refer to Schema configurations for more details.

Configurations for a fileset

Refer to Fileset configurations for more details.

Example of creating Hadoop catalog with ADLS

This section demonstrates how to create the Hadoop catalog with ADLS in Gravitino, with a complete example.

Step1: Create a Hadoop catalog with ADLS

First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "example_catalog",
"type": "FILESET",
"comment": "This is a ADLS fileset catalog",
"provider": "hadoop",
"properties": {
"location": "abfss://container@account-name.dfs.core.windows.net/path",
"azure-storage-account-name": "The account name of the Azure Blob Storage",
"azure-storage-account-key": "The account key of the Azure Blob Storage",
"filesystem-providers": "abs"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

Step2: Create a schema

Once the catalog is created, you can create a schema. The following example shows how to create a schema:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "test_schema",
"comment": "This is a ADLS schema",
"properties": {
"location": "abfss://container@account-name.dfs.core.windows.net/path"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas

Step3: Create a fileset

After creating the schema, you can create a fileset. The following example shows how to create a fileset:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "example_fileset",
"comment": "This is an example fileset",
"type": "MANAGED",
"storageLocation": "abfss://container@account-name.dfs.core.windows.net/path/example_fileset",
"properties": {
"k1": "v1"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets

Accessing a fileset with ADLS

Using the GVFS Java client to access the fileset

To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, based on the basic GVFS configurations, you need to add the following configurations:

Configuration itemDescriptionDefault valueRequiredSince version
azure-storage-account-nameThe account name of Azure Blob Storage.(none)Yes0.8.0-incubating
azure-storage-account-keyThe account key of Azure Blob Storage.(none)Yes0.8.0-incubating
note

If the catalog has enabled credential vending, the properties above can be omitted. More details can be found in Fileset with credential vending.

Configuration conf = new Configuration();
conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs");
conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri", "http://localhost:8090");
conf.set("fs.gravitino.client.metalake", "test_metalake");
conf.set("azure-storage-account-name", "account_name_of_adls");
conf.set("azure-storage-account-key", "account_key_of_adls");
Path filesetPath = new Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
FileSystem fs = filesetPath.getFileSystem(conf);
fs.mkdirs(filesetPath);
...

Similar to Spark configurations, you need to add ADLS (bundle) jars to the classpath according to your environment.

If your wants to custom your hadoop version or there is already a hadoop version in your project, you can add the following dependencies to your pom.xml:

  <dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${HADOOP_VERSION}</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>${HADOOP_VERSION}</version>
</dependency>

<dependency>
<groupId>org.apache.gravitino</groupId>
<artifactId>filesystem-hadoop3-runtime</artifactId>
<version>${GRAVITINO_VERSION}</version>
</dependency>

<dependency>
<groupId>org.apache.gravitino</groupId>
<artifactId>gravitino-azure</artifactId>
<version>${GRAVITINO_VERSION}</version>
</dependency>

Or use the bundle jar with Hadoop environment if there is no Hadoop environment:

  <dependency>
<groupId>org.apache.gravitino</groupId>
<artifactId>gravitino-azure-bundle</artifactId>
<version>${GRAVITINO_VERSION}</version>
</dependency>

<dependency>
<groupId>org.apache.gravitino</groupId>
<artifactId>filesystem-hadoop3-runtime</artifactId>
<version>${GRAVITINO_VERSION}</version>
</dependency>

Using Spark to access the fileset

The following code snippet shows how to use PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0) to access the fileset:

Before running the following code, you need to install required packages:

pip install pyspark==3.1.3
pip install apache-gravitino==${GRAVITINO_VERSION}

Then you can run the following code:

from pyspark.sql import SparkSession
import os

gravitino_url = "http://localhost:8090"
metalake_name = "test"

catalog_name = "your_adls_catalog"
schema_name = "your_adls_schema"
fileset_name = "your_adls_fileset"

os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell"
spark = SparkSession.builder
.appName("adls_fileset_test")
.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
.config("spark.hadoop.fs.gravitino.client.metalake", "test")
.config("spark.hadoop.azure-storage-account-name", "azure_account_name")
.config("spark.hadoop.azure-storage-account-key", "azure_account_name")
.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true")
.config("spark.driver.memory", "2g")
.config("spark.driver.port", "2048")
.getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
columns = ["Name", "Age"]
spark_df = spark.createDataFrame(data, schema=columns)
gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"

spark_df.coalesce(1).write
.mode("overwrite")
.option("header", "true")
.csv(gvfs_path)

If your Spark without Hadoop environment, you can use the following code snippet to access the fileset:

## Replace the following code snippet with the above code snippet with the same environment variables

os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell"

Please choose the correct jar according to your environment.

note

In some Spark versions, a Hadoop environment is necessary for the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly.

Accessing a fileset using the Hadoop fs command

The following are examples of how to use the hadoop fs command to access the fileset in Hadoop 3.1.3.

  1. Adding the following contents to the ${HADOOP_HOME}/etc/hadoop/core-site.xml file:
  <property>
<name>fs.AbstractFileSystem.gvfs.impl</name>
<value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
</property>

<property>
<name>fs.gvfs.impl</name>
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
</property>

<property>
<name>fs.gravitino.server.uri</name>
<value>http://localhost:8090</value>
</property>

<property>
<name>fs.gravitino.client.metalake</name>
<value>test</value>
</property>

<property>
<name>azure-storage-account-name</name>
<value>account_name</value>
</property>
<property>
<name>azure-storage-account-key</name>
<value>account_key</value>
</property>
  1. Add the necessary jars to the Hadoop classpath.

For ADLS, you need to add gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar, gravitino-azure-${gravitino-version}.jar and hadoop-azure-${hadoop-version}.jar located at ${HADOOP_HOME}/share/hadoop/tools/lib/ to the Hadoop classpath.

  1. Run the following command to access the fileset:
./${HADOOP_HOME}/bin/hadoop dfs -ls gvfs://fileset/adls_catalog/adls_schema/adls_fileset
./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file gvfs://fileset/adls_catalog/adls_schema/adls_fileset

Using the GVFS Python client to access a fileset

In order to access fileset with Azure Blob storage (ADLS) using the GVFS Python client, apart from basic GVFS configurations, you need to add the following configurations:

Configuration itemDescriptionDefault valueRequiredSince version
azure_storage_account_nameThe account name of Azure Blob Storage(none)Yes0.8.0-incubating
azure_storage_account_keyThe account key of Azure Blob Storage(none)Yes0.8.0-incubating
note

If the catalog has enabled credential vending, the properties above can be omitted.

Please install the gravitino package before running the following code:

pip install apache-gravitino==${GRAVITINO_VERSION}
from gravitino import gvfs
options = {
"cache_size": 20,
"cache_expired_time": 3600,
"auth_type": "simple",
"azure_storage_account_name": "azure_account_name",
"azure_storage_account_key": "azure_account_key"
}
fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalake_name="test_metalake", options=options)
fs.ls("gvfs://fileset/{adls_catalog}/{adls_schema}/{adls_fileset}/")

Using fileset with pandas

The following are examples of how to use the pandas library to access the ADLS fileset

import pandas as pd

storage_options = {
"server_uri": "http://localhost:8090",
"metalake_name": "test",
"options": {
"azure_storage_account_name": "azure_account_name",
"azure_storage_account_key": "azure_account_key"
}
}
ds = pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
storage_options=storage_options)
ds.head()

For other use cases, please refer to the Gravitino Virtual File System document.

Fileset with credential vending

Since 0.8.0-incubating, Gravitino supports credential vending for ADLS fileset. If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information like azure-storage-account-name and azure-storage-account-key in the properties.

How to create an ADLS Hadoop catalog with credential vending

Apart from configuration method in create-adls-hadoop-catalog, properties needed by adls-credential should also be set to enable credential vending for ADLS fileset. Take adls-token credential provider for example:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "adls-catalog-with-token",
"type": "FILESET",
"comment": "This is a ADLS fileset catalog",
"provider": "hadoop",
"properties": {
"location": "abfss://container@account-name.dfs.core.windows.net/path",
"azure-storage-account-name": "The account name of the Azure Blob Storage",
"azure-storage-account-key": "The account key of the Azure Blob Storage",
"filesystem-providers": "abs",
"credential-providers": "adls-token",
"azure-tenant-id":"The Azure tenant id",
"azure-client-id":"The Azure client id",
"azure-client-secret":"The Azure client secret key"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

How to access ADLS fileset with credential vending

If the catalog has been configured with credential, you can access ADLS fileset without providing authentication information via GVFS Java/Python client and Spark. Let's see how to access ADLS fileset with credential:

GVFS Java client:

Configuration conf = new Configuration();
conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs");
conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri", "http://localhost:8090");
conf.set("fs.gravitino.client.metalake", "test_metalake");
// No need to set azure-storage-account-name and azure-storage-account-name
Path filesetPath = new Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir");
FileSystem fs = filesetPath.getFileSystem(conf);
fs.mkdirs(filesetPath);
...

Spark:

spark = SparkSession.builder
.appName("adls_fielset_test")
.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs")
.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
.config("spark.hadoop.fs.gravitino.client.metalake", "test")
# No need to set azure-storage-account-name and azure-storage-account-name
.config("spark.driver.memory", "2g")
.config("spark.driver.port", "2048")
.getOrCreate()

Python client and Hadoop command are similar to the above examples.