Version: 0.7.0-incubating

How to use the playground

Playground introduction

The playground is a complete Apache Gravitino Docker runtime environment with Apache Hive, HDFS, Trino, MySQL, PostgreSQL, Jupyter, and a Apache Gravitino server.

Depending on your network and computer, startup time may take 3-5 minutes. Once the playground environment has started, you can open http://localhost:8090 in a browser to access the Gravitino Web UI.

Prerequisites

Install Git (optional), Docker, Docker Compose. Docker Desktop (or Orbstack) with Kubenetes enabled, and helm CLI are required if you use helm-chart to deploy services.

System Resource Requirements

2 CPU cores, 8 GB RAM, 25 GB disk storage, MacOS or Linux OS (Verified Ubuntu22.04 Ubuntu24.04 AmazonLinux).

TCP ports used

The playground runs several services. The TCP ports used may clash with existing services you run, such as MySQL or Postgres.

Docker container	Ports used
playground-gravitino	8090 9001
playground-hive	3307 19000 19083 60070
playground-mysql	13306
playground-postgresql	15342
playground-trino	18080
playground-jupyter	18888
playground-prometheus	19090
playground-grafana	13000

Playground usage

One curl command launch playground

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/apache/gravitino-playground/HEAD/install.sh)"

Use git to download and launch playground

git clone git@github.com:apache/gravitino-playground.git
cd gravitino-playground

Docker

Start

./playground.sh docker start

Check status

./playground.sh docker status

Stop playground

./playground.sh docker stop

Kubernetes

Enable Kubernetes in Docker Desktop or Orbstack.

In Project root directory, execute this command:

helm upgrade --install gravitino-playground ./helm-chart/ --create-namespace --namespace gravitino-playground --set projectRoot=$(pwd)

Start

./playground.sh k8s start

Check status

./playground.sh k8s status

Port Forwarding

To access pods or services at localhost, you needs to do these steps:

TRINO_POD=$(kubectl get pods --namespace gravitino-playground -l app=trino -o jsonpath="{.items[0].metadata.name}")
kubectl exec $TRINO_POD -n gravitino-playground -it -- /bin/bash

SPARK_POD=$(kubectl get pods --namespace gravitino-playground -l app=spark -o jsonpath="{.items[0].metadata.name}")
kubectl exec $SPARK_POD -n gravitino-playground -it -- /bin/bash

Port-forwarding Gravitino Service, so that you can access it at localhost:8090.

kubectl port-forward svc/gravitino -n gravitino-playground 8090:8090

Port-forwarding Jupyter Notebook Service, so that you can access it at localhost:8888.

kubectl port-forward svc/jupyternotebook -n gravitino-playground 8888:8888

Stop playground

./playground.sh k8s stop

Experiencing Apache Gravitino with Trino SQL

Using Trino CLI in Docker Container

docker exec -it playground-trino bash

Open the Trino CLI in the container.

trino@container_id:/$ trino

Using Jupyter Notebook

Open the Jupyter Notebook in the browser at http://localhost:18888.
Open the gravitino-trino-example.ipynb notebook.
Start the notebook and run the cells.

Using Spark client

docker exec -it playground-spark bash

Open the Spark SQL client in the container.

spark@container_id:/$ cd /opt/spark && /bin/bash bin/spark-sql

Monitoring Gravitino

Open the Grafana in the browser at http://localhost:13000.
In the navigation menu, click Dashboards -> Gravitino Playground.
Experiment with the default template.

Example

Simple Trino queries

You can use simple queries to test in the Trino CLI.

SHOW CATALOGS;

CREATE SCHEMA catalog_hive.company
  WITH (location = 'hdfs://hive:9000/user/hive/warehouse/company.db');

SHOW CREATE SCHEMA catalog_hive.company;

CREATE TABLE catalog_hive.company.employees
(
  name varchar,
  salary decimal(10,2)
)
WITH (
  format = 'TEXTFILE'
);

INSERT INTO catalog_hive.company.employees (name, salary) VALUES ('Sam Evans', 55000);

SELECT * FROM catalog_hive.company.employees;

SHOW SCHEMAS from catalog_hive;

DESCRIBE catalog_hive.company.employees;

SHOW TABLES from catalog_hive.company;

Cross-catalog queries

In a company, there may be different departments using different data stacks. In this example, the HR department uses Apache Hive to store its data, and the sales department uses PostgreSQL. You can run some interesting queries by joining the two departments' data together with Gravitino.

To know which employee has the largest sales amount, run this SQL:

SELECT given_name, family_name, job_title, sum(total_amount) AS total_sales
FROM catalog_hive.sales.sales as s,
  catalog_postgres.hr.employees AS e
where s.employee_id = e.employee_id
GROUP BY given_name, family_name, job_title
ORDER BY total_sales DESC
LIMIT 1;

To know the top customers who bought the most by state, run this SQL:

SELECT customer_name, location, SUM(total_amount) AS total_spent
FROM catalog_hive.sales.sales AS s,
  catalog_hive.sales.stores AS l,
  catalog_hive.sales.customers AS c
WHERE s.store_id = l.store_id AND s.customer_id = c.customer_id
GROUP BY location, customer_name
ORDER BY location, SUM(total_amount) DESC;

To know the employee's average performance rating and total sales, run this SQL:

SELECT e.employee_id, given_name, family_name, AVG(rating) AS average_rating, SUM(total_amount) AS total_sales
FROM catalog_postgres.hr.employees AS e,
  catalog_postgres.hr.employee_performance AS p,
  catalog_hive.sales.sales AS s
WHERE e.employee_id = p.employee_id AND p.employee_id = s.employee_id
GROUP BY e.employee_id,  given_name, family_name;

Using Spark and Trino

You might consider generating data with SparkSQL and then querying this data using Trino. Give it a try with Gravitino:

// using Hive catalog to create Hive table
USE catalog_hive;
CREATE DATABASE product;
USE product;

CREATE TABLE IF NOT EXISTS employees (
    id INT,
    name STRING,
    age INT
)
PARTITIONED BY (department STRING)
STORED AS PARQUET;
DESC TABLE EXTENDED employees;

INSERT OVERWRITE TABLE employees PARTITION(department='Engineering') VALUES (1, 'John Doe', 30), (2, 'Jane Smith', 28);
INSERT OVERWRITE TABLE employees PARTITION(department='Marketing') VALUES (3, 'Mike Brown', 32);

SELECT * FROM catalog_hive.product.employees WHERE department = 'Engineering';

The demo is located in the jupyter folder, you can open the gravitino-spark-trino-example.ipynb demo via Jupyter Notebook by http://localhost:18888.

Using Apache Iceberg REST service

Suppose you want to migrate your business from Hive to Iceberg. Some tables will use Hive, and the other tables will use Iceberg. Gravitino provides an Iceberg REST catalog service. You can use Spark to access the REST catalog to write the table data. Then, you can use Trino to read the data from the Hive table and join it with the Iceberg table.

spark-defaults.conf is as follows (It's already configured in the playground):

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.catalog_rest org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.catalog_rest.type rest
spark.sql.catalog.catalog_rest.uri http://gravitino:9001/iceberg/
spark.locality.wait.node 0

Please note that catalog_rest in SparkSQL and catalog_iceberg in Gravitino and Trino share the same Iceberg JDBC backend, which implies that they can access the same dataset.

docker exec -it playground-spark bash

spark@container_id:/$ cd /opt/spark && /bin/bash bin/spark-sql

use catalog_rest;
create database sales;
use sales;
create table customers (customer_id int, customer_name varchar(100), customer_email varchar(100));
describe extended customers;
insert into customers (customer_id, customer_name, customer_email) values (11,'Rory Brown','rory@123.com');
insert into customers (customer_id, customer_name, customer_email) values (12,'Jerry Washington','jerry@dt.com');

docker exec -it playground-trino bash

trino@container_id:/$ trino

select * from catalog_hive.sales.customers
union
select * from catalog_iceberg.sales.customers;

The demo is located in the jupyter folder, you can open the gravitino-spark-trino-example.ipynb demo via Jupyter Notebook by http://localhost:18888.

Using Gravitino with LlamaIndex

The Gravitino playground also provides a simple RAG demo with LlamaIndex. This demo will show you the ability to use Gravitino to manage both tabular and non-tabular datasets, connecting to LlamaIndex as a unified data source, then use LlamaIndex and LLM to query both tabular and non-tabular data with one natural language query.

The demo is located in the jupyter folder, and you can open the gravitino_llama_index_demo.ipynb demo via Jupyter Notebook by http://localhost:18888.

The scenario of this demo is that basic structured city statistics data is stored in MySQL, and detailed city introductions are stored in PDF files. The user wants to find answers about cities in the structured data and the PDF files.

In this demo, you will use Gravitino to manage the MySQL table using a relational catalog, pdf files using a fileset catalog, treating Gravitino as a unified data source for LlamaIndex to build indexes on both tabular and non-tabular data. Then you will use LLM to query the data using natural language queries.

Note: to run this demo, you need to set OPENAI_API_KEY in the gravitino_llama_index_demo.ipynb, like below, OPENAI_API_BASE is optional.

import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["OPENAI_API_BASE"] = ""

Playground introduction​

Prerequisites​

System Resource Requirements​

TCP ports used​

Playground usage​

One curl command launch playground​

Use git to download and launch playground​

Docker​

Start​

Check status​

Stop playground​

Kubernetes​

Start​

Check status​

Port Forwarding​

Stop playground​

Experiencing Apache Gravitino with Trino SQL​

Using Trino CLI in Docker Container​

Using Jupyter Notebook​

Using Spark client​

Monitoring Gravitino​

Example​

Simple Trino queries​

Cross-catalog queries​

Using Spark and Trino​

Using Apache Iceberg REST service​

Using Gravitino with LlamaIndex​

Playground introduction

Prerequisites

System Resource Requirements

TCP ports used

Playground usage

One curl command launch playground

Use git to download and launch playground

Docker

Start

Check status

Stop playground

Kubernetes

Start

Check status

Port Forwarding

Stop playground

Experiencing Apache Gravitino with Trino SQL

Using Trino CLI in Docker Container

Using Jupyter Notebook

Using Spark client

Monitoring Gravitino

Example

Simple Trino queries

Cross-catalog queries

Using Spark and Trino

Using Apache Iceberg REST service

Using Gravitino with LlamaIndex