Skip to main content
Version: 1.0.0

Gravitino server Lineage support

Overview

Gravitino server provides a pluggable lineage framework to receive, process, and sink OpenLineage events. By leveraging this, you could do custom process for the lineage event and sink to your dedicated systems.

Lineage Configuration

Configuration itemDescriptionDefault valueRequiredSince Version
gravitino.lineage.sourceThe name of lineage event source.httpNo0.9.0-incubating
gravitino.lineage.${sourceName}.sourceClassThe name of the lineage source class which should implement org.apache.gravitino.lineage.source.LineageSource interface.(none)No0.9.0-incubating
gravitino.lineage.processorClassThe name of the lineage processor class which should implement org.apache.gravitino.lineage.processor.LineageProcessor interface. The default noop processor do nothing about the run event.org.apache.gravitino.lineage.processor.NoopProcessorNo0.9.0-incubating
gravitino.lineage.sinksThe Lineage event sink names (support multiple sinks separated by commas).logNo0.9.0-incubating
gravitino.lineage.${sinkName}.sinkClassThe name of the lineage sink class which should implement org.apache.gravitino.lineage.sink.LineageSink interface.(none)No0.9.0-incubating
gravitino.lineage.queueCapacityThe total capacity of lineage event queues. When there are multiple lineage sinks, each sink utilizes an isolated event queue. The capacity of each queue is calculated by dividing the value of gravitino.lineage.queueCapacity by the number of sinks.10000No0.9.0-incubating

Lineage http source

Http source provides an endpoint which follows OpenLineage API spec to receive OpenLineage run event. The following use example:

cat <<EOF >source.json
{
"eventType": "START",
"eventTime": "2023-10-28T19:52:00.001+10:00",
"run": {
"runId": "0176a8c2-fe01-7439-87e6-56a1a1b4029f"
},
"job": {
"namespace": "gravitino-namespace",
"name": "gravitino-job1"
},
"inputs": [{
"namespace": "gravitino-namespace",
"name": "gravitino-table-identifier"
}],
"producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
"schemaURL": "https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/RunEvent"
}
EOF

curl -X POST \
-i -H 'Content-Type: application/json' \
-d '@source.json' \
http://localhost:8090/api/lineage

Lineage log sink

Log sink prints the log in a separate log file gravitino_lineage.log, you could change the default behavior in conf/log4j2.properties.

Lineage HTTP sink

The HTTP sink supports sending the lineage event to an HTTP server that follows the OpenLineage REST specification, like marquez

Property NameDescriptionDefault ValueRequiredSince Version
gravitino.lineage.sinksSpecifies the lineage sink implementation to use. For http sink http.logYes0.9.0
gravitino.lineage.http.sinkClassFully qualified class name of the http sink lineage sink implementation org.apache.gravitino.lineage.sink.LineageHttpSink)org.apache.gravitino.lineage.sink.LineageLogSinkYes0.9.0
gravitino.lineage.http.urlURL of the http sink server endpoint for lineage collection(e.g., http://localhost:5000)noneYes1.0.0
gravitino.lineage.http.authTypeAuthentication type for http sink (options: apiKey or none)noneYes1.0.0
gravitino.lineage.http.apiKeyAPI key for authenticating with http sink (required if authType=apiKey)noneNo1.0.0

High watermark status

When the lineage sink operates slowly, lineage events accumulate in the async queue. Once the queue size exceeds 90% of its capacity (high watermark threshold), the lineage system enters a high watermark status. In this state, the lineage source must implement retry and logging mechanisms for rejected events to prevent system overload. For the HTTP source, it returns the 429 Too Many Requests status code to the client.