Skip to main content

3 posts tagged with "government"

View All Tags

Apache Gravitino 1.0.0 - From Metadata Management to Contextual Engineering

· 8 min read
Jerry Shao
PMC Member

Apache Gravitino was designed from day one to provide a unified framework for metadata management across heterogeneous sources, regions, and clouds—what we define as the metadata lake (or metalake). Throughout its evolution, Gravitino has extended support to multiple data modalities, including tabular metadata from Apache Hive, Apache Iceberg, MySQL, and PostgreSQL; unstructured assets from HDFS and S3; streaming and messaging metadata from Apache Kafka; and metadata for machine learning models. To further strengthen governance in Gravitino, we have also integrated advanced capabilities, including tagging, audit logging, and end-to-end lineage capture.

After all enterprise metadata has been centralized through Gravitino, it forms a data brain: a structured, queryable, and semantically enriched representation of data assets. This enables not only consistent metadata access but also knowledge grounding, contextual reasoning, tool using and others. As we approach the 1.0 milestone, our focus shifts from pure metadata storage to metadata-driven contextual engineering—a foundation we call the Metadata-driven Action System, to provide the building blocks for the contextual engineering.

The release of Apache Gravitino 1.0.0 marks a significant engineering step forward, with robust APIs, extensible connectors, enhanced governance primitives, improved scalability and reliability in distributed environments. In the following sections, I will dive into the new features and architectural improvements introduced in Gravitino 1.0.0.

Metadata-driven action system

In version 1.0.0, we introduced three new components that enable us to build jobs to accomplish metadata-driven actions, such as table compaction, TTL data management, and PII identification. These three new components are: the statistics system, the policy system, and the job system.

Taking table compaction as an example:

  • Firstly, users can define the table compaction policy in Gravitino and associate this policy with the tables that need to be compacted.
  • Then, users can save the statistics of the table to Gravitino.
  • Also, users can define a job template for the compaction.
  • Lastly, users can use the statistics with the defined policy to generate the compaction parameters and use these parameters to trigger a compaction job based on the defined job templates.

Statistics system

The statistics system is a new component for the statistics store and retrieval. You can define and store the table/partition level statistics in Gravitino, and also fetch them through Gravitino for different purposes.

For the details of how we design this component, please see #7268. For instructions on using the statistics system, refer to the documentation here.

Policy system

The policy system enables you to define action rules in Gravitino, like compaction rules or TTL rules. The defined policy can be associated with the metadata, which means these rules will be enforced on the dedicated metadata. Users can leverage these enforced polices to decide how to trigger an action on the dedicated metadata.

Please refer to the policy system documentation to know how to use it. For more information on the policy system's implementation details, please refer to #7139.

Job system

The job system is another feature that allows you to submit and run jobs through Gravitino. Users can register a job template, then trigger a job based on the specific job template. Gravitino will help submit the job to the dedicated job executor, such as Apache Airflow. Gravitino can manage the job lifecycle and save the job status in it. With the job system, users can run a self-defined job to accomplish a metadata-driven action system.

In version 1.0.0, we have an initial version to support running the jobs as a local process. If you want to know more about the design details, you can follow issue #7154. Also, a user-facing documentation can be found here.

The whole metadata-driven action system is still in an alpha phase for version 1.0.0. The community will continue to evolve the code and take the Iceberg table maintenance as a reference implementation in the next version. Please stay tuned.

Agent-ready through the MCP server

MCP is a powerful protocol to bridge the gap between human languages and machine interfaces. With MCP, users can communicate with the LLM using natural language, and the LLM can understand the context and invoke the appropriate tools.

In version 1.0.0, the community officially delivered the MCP server for Gravitino. Users can launch it as a remote or local MCP server and connect to various MCP applications, such as Cursor and Claude Desktop. Additionally, we exposed all metadata-related interfaces as tools that MCP clients can call.

With the Gravitino MCP server, users can manage and govern metadata, as well as perform metadata-driven actions using natural language. Please follow issue #7483 for more details. Additionally, you can refer to the documentation for instructions on how to start the MCP server locally or in Docker.

Unified access control framework

Gravitino introduced the RBAC system in the previous version, but it only offers users the ability to grant privileges to roles and users, without enforcing access control when manipulating the secure objects. In 1.0.0, we complete this missing piece in Gravitino.

Currently, users can set access control policies through our RBAC system and enforce these controls when accessing secure objects. For details, you can refer to the umbrella issue #6762.

Add support for multiple locations model management

The model management is introduced in Gravitino 0.9.0. Users have since requested support for multiple storage locations within a single model version, allowing them to select a model version with a preferred location.

In 1.0.0, the community added multiple locations for model management. This feature is similar to the fileset’s support for multiple locations. Users can check the document here for more information. For more information on implementation details, please refer to this issue #7363.

Support the latest Apache Iceberg and Paimon versions

In Gravitino 1.0.0, we have upgraded the supported Iceberg version to 1.9.0. With the new version, we will add more feature support in the next release. Additionally, we have upgraded the supported Paimon version to 1.2.0, introducing new features for Paimon support.

You can see the issue #6719 for Iceberg upgrading and issue #8163 for Paimon upgrading.

Various core features

Core:

  • Add the cache system in the Gravitino entity store #7175.
  • Add Marquez integration as a lineage sink in Gravitino #7396.

Server:

  • Add Azure AD login support for OAuth authentication #7538.

Catalogs:

  • Support StarRocks catalog management in Gravitino #3302.

Clients:

Spark connector:

  • Upgrade the supported Kyubbi version #7480.

UI:

  • Add web UI for listing files / directories under a fileset #7477.

Deployment:

  • Add hem char deployment for Iceberg REST catalog #7159.

Behavior changes

Compatible changes:

  • Rename the Hadoop catalog to fileset catalog #7184.
  • Allowing event listener changes Iceberg create table request #6486.
  • Support returning aliases when listing model version #7307.

Breaking changes:

  • Change the supported Java version to JDK 17 for the Gravitino server.
  • Remove the Python 3.8 support for the Gravitino Python client #7491.
  • Fix the unnecessary double encoding and decoding issue for fileset get location and list files interfaces #8335. This change is incompatible with the old version of Java and Python clients. Using old version clients with a new version server will meet a decoding issue in some unexpected scenarios.

Overall

There are still lots of features, improvements, and bug fixes that are not mentioned here. We thank the community for their continued support and valuable contributions.

Apache Gravitino 1.0.0 opens a new chapter from the data catalog to the smart catalog. We will continue to innovate and build, to add more Data and AI features. Please stay tuned!

Credits

This release acknowledges the hard work and dedication of all contributors who have helped make this release possible.

1161623489@qq.com, Aamir, Aaryan Kumar Sinha, Ajax, Akshat Tiwari, Akshat kumar gupta, Aman Chandra Kumar, AndreVale69, Ashwil-Colaco, BIN, Ben Coke, Bharath Krishna, Brijesh Thummar, Bryan Maloyer, Cyber Star, Danhua Wang, Daniel, Daniele Carpentiero, Dentalkart399, Drinkaiii, Edie, Eric Chang, FANNG, Gagan B Mishra, George T. C. Lai, Guilherme Santos, Hatim Kagalwala, Jackeyzhe, Jarvis, JeonDaehong, Jerry Shao, Jimmy Lee, Joonha, Joonseo Lee, Joseph C., Justin Mclean, KWON TAE HEON, Kang, KeeProMise, Khawaja Abdullah Ansar, Kwon Taeheon, Kyle Lin, KyleLin0927, Lord of Abyss, MaAng, Mathieu Baurin, Maxspace1024, Mikshakecere, Mini Yu, Minji Kim, Minji Ryu, Nithish Kumar S, Pacman, Peidian li, Praveen, Qian Xia, Qiang-Liu, Qiming Teng, Raj Gupta, Ratnesh Rastogi, Raveendra Pujari, Reuben George, RickyMa, Rory, Sambhavi Pandey, Sébastien Brochet, Shaofeng Shi, Spiritedswordsman, Sua Bae, Surya B, Tarun, Tian Lu, Tianhang, Timur, Viral Kachhadiya, Will Guo, XiaoZ, Xiaojian Sun, Xun, Yftach Zur, Yuhui, Yujiang Zhong, Yunchi Pang, Zhengke Zhou, _.mung, ankamde, arjun, danielyyang, dependabot[bot], fad, fanng, gavin.wang, guow34, jackeyzhe, kaghatim, keepConcentration, kerenpas, kitoha, lipeidian, liuxian, liuxian131, lsyulong, mchades, mingdaoy, predator4ann, qbhan, raveendra11, roryqi, senlizishi, slimtom95, taylor.fan, taylor12805, teo, tian bao, vishnu, yangyang zhong, youngseojeon, yuhui, yunchi, yuqi, zacsun, zhanghan, zhanghan18, 梁自强, 박용현, 배수아, 신동재, 이승주, 이준하

Apache, Apache Fink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Ranger, Apache Spark, Apache Paimon and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Apache Gravitino 0.9.1

· 2 min read
Rory Qi
committer

Model Management

Support updating aliases for model versions #6814,#7158

Add file viewer support for Filesets #6860

Implement ListFilesEvent in FilesetEventDispatcher #7314

Support setOwner/getOwner event operations #7646

Trino Connector

Auto-load multiple metalakes in Trino connector #7288

JDBC Validation

Validate JDBC URLs during store initialization #7547

Bug Fixes

Core & Catalogs

Fix H2 backend file lock issues during deletion #7406

Prevent SQL session commit errors #7403

Correct OAuth token refresh in web UI #7426

Validate namespace string conversions #7516

Improve server force-kill shutdown logic #7513

Fix bypass key handling in Hive catalog #7416

Filter empty Hadoop storage locations #7190

Fix model catalog error messages #7346

Connectors

Spark Connector

Remove conflicting slf4j dependency #7287

Fix S3 credential test errors #7432

Trino Connector

Handle unsupported catalog providers #7322

Python Client

Fix storage handler mappings for S3/OSS/ABS #7225

Improve Java client error messages #7344

Filesets

Fix multi-location file paths #7371

Improvements

Core & Catalogs

Optimize column deletion logic (#7415)(https://github.com/apache/gravitino/issues/7415)

Auto-register mappers via SPI #7529

Validate JDBC entity store URLs #7614

Fix catalog index existence checks #7660

CLI & Clients

Remove duplicate owner field in CLI #7639

URL-encode paths in Java client #7686

Testing

Refactor Hadoop catalog test stubbing #7280

Fix precondition message mismatches #7521

Documentation

Add Trino REST catalog example #7121

Iceberg IRC guides for StarRocks/Doris #7368

OpenAPI specs for Fileset/File #6860

Fix access control docs #7195

Update model privilege docs #7555

Typo fixes #7448, #7647

Remove incubating status markers #7492

Add 0.9.1 release notes #7485

Build & Infra

Fix Helm chart versioning #7129, #7134

Upgrade Kyuubi dependency #7480

Credits

FANNG1 Abyss-lord jerqi jerryshao slimtom95 flaming-archer yunchipang KyleLin0927 xiaozcy diqiu50 yuqi1129 ziqiangliang carl239 LauraXia123 guov100 senlizishi fivedragon5 justinmclean Jackeyzhe Spiritedswordsman su8y

Apache Gravitino 0.9.0 - Focus on AI, data governance, and security with multi-dimensional feature upgrade

· 4 min read
Rory Qi
committer

Gravitino 0.9.0 focuses on advancements in AI, data governance, and security. Many of its new features are already being used in production environments. The release has attracted strong interest from users from well-known companies, with AI and security capabilities drawing attention.

In this version, the community optimized the user experience for fileset catalogs and model catalogs, making it easier for users to manage their unstructured AI data and model data.

The community added a new data lineage interface. Users can now implement a custom data lineage plugin to adapt to their own system.

For security, the community has corrected some privilege semantics and fixed authorization plugin corner cases to make the entire system more robust.

Model Catalog

Before 0.9.0, the model catalog was immutable, which was not flexible. In the new version, users can alter models and model versions and add tags #6626 #6222.

Fileset Catalog

Gravitino now supports multiple named storage locations within a single fileset and placeholder-based path generation.

With multiple location support, users can reference data across different file systems (HDFS, S3, GCS, local, etc.) through a unified fileset interface, each with a unique location name.

The placeholder feature allows dynamic storage path generation using the {{placeholder}} syntax, automatically replacing placeholders with corresponding fileset properties.

These enhancements significantly improves the flexibility for multi-cloud environments and complex data organization patterns while maintaining a clean abstraction layer for data assets management #6681.

GVFS (Gravitino Virtual File System)

GVFS has been enhanced to support accessing multiple locations within filesets. Users can now select which location to use through configuration properties, environment variables, or fileset default settings.

GVFS has also been refactored with a pluggable architecture allowing custom operations and hooks. This enables users to extend functionality through operations_class and hook_class configuration options for more flexible integration with their specific infrastructure #6938.

Security

The new version has added privileges for the data model and corrected some privilege semantics. It has also fixed some bugs with the Ranger path-based plugin #6620 #6575 #6821 #6864. All of the user-related, group-related, and role-related events are now supported for the event system #2969.

Data Lineage

The community added a data lineage interface that follows the OpenLineage API specification. Users can implement their custom data lineage plugin to adapt to their system #6617.

Core

The community cared about performance. Performance was improved by reducing the scope of the lock and batch reading data from storage #6744 #6560 #2870.

CLI

Additionally, there is one more change worth mentioning. Users no longer need to rely on the alias command to use the CLI. Instead, the community provided a convenient script located at ./bin/gcli.sh so that a user can directly use the CLI client #5383.

Connector

Both the Flink connector and the Spark connector added JDBC support #6233 #6164.

Chart

Deploying Gravitino on Kubernetes with a fully customizable configuration #6594.

Overall

Gravitino 0.9.0 focuses on advancements in AI, data governance, and security. We thank the Gravitino community for their continued support and valuable contributions. We can continue to innovate and build thanks to all our users' feedback. Thank you for taking the time to read this! To dive deeper into the Gravitino 0.9.0 release, explore the full documentation. Your feedback is greatly valued and helps shape the future of the Gravitino project and community.

Credits

JavedAbdullah AndreVale69 Brijeshthummar02 cool9850311 liuchunhao danhuawang unknowntpo FANNG1 tsungchih jerryshao justinmclean zhoukangcn Abyss-lord amazingLyche yuqi1129 Pranaykarvi puchengy LauraXia123 tengqm rud9192 antony0016 frankvicky TEOTEO520 TungYuChiang sunxiaojian xunliu LuciferYang diqiu50 zhengkezhou1 caican00 granewang yunchipang jerqi mchades rickyma Xander-run flaming-archer waukin lsyulong luoshipeng FourFriends this-user vitamin43 hdygxsj liangyouze

Apache, Apache Fink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Ranger, Apache Spark, Apache Paimon and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.