跳转至

2.4.2.2.2 监控项

kafka 通过 jmx 暴露内部监控,具体监控项可以查看:https://kafka.apache.org/documentation/#monitoring

这里列出一些需要关注的监控项的 jmx MBean ObjectName,以及 jmx_exporter 将其转换后对应的 promql 查询语句:

消息速率

客户端与 broker 之间消息写入/读取速率。

kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec[,topic=*]

kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec[,topic=*]

kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec[,topic=*]

kafka_server_BrokerTopicMetrics_OneMinuteRate{name="MessagesInPerSec",topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesInPerSec", topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesOutPerSec", topic=""}

broker 与 broker 之间副本同步带来的消息写入/读取速率。

kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec

kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec

kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesInPerSec",} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesOutPerSec",}

请求速率

kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}

kafka_network_RequestMetrics_OneMinuteRate{name="RequestsPerSec",request="Produce",}

请求速率

以 topic 维度统计请求速率

kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec

kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec

失败请求速率

统计失败的请求/响应,如果 error=NONE 表示响应是成功的。

kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)

kafka_network_RequestMetrics_OneMinuteRate{name="ErrorsPerSec", error!="NONE"}

请求大小

kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)

kafka_network_RequestMetrics_OneMinuteRate{name="RequestBytes", }

网络处理线程空闲率

kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent

请求处理线程空闲率

kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

leader 数量

kafka.server:type=ReplicaManager,name=LeaderCount

kafka_server_ReplicaManager_Value{name="LeaderCount"}

分区数量

kafka.server:type=ReplicaManager,name=PartitionCount

kafka_server_ReplicaManager_Value{name="PartitionCount"}

有落后副本的分区数量

|isr| < |all replicas| 的 partition 数

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

kafka_server_ReplicaManager_Value{name="UnderReplicatedPartitions"}

|isr| = min.insync.replicas 的 partition 数

kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount

非同步的 replica 数量

kafka.server:type=ReplicaManager,name=OfflineReplicaCount

落后消息数量

follower 和 leader replica 之间落后的消息数量。

kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica

kafka_server_ReplicaFetcherManager_Value{name="MaxLag", clientId="Replica"}

具体到 topic partition 级别落后的消息数量

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)

kafka_server_FetcherLagMetrics_Value{name="ConsumerLag",clientId="ReplicaFetcherThread-0-4",topic="test",partition="0",}

Isr 列表扩缩容速率

kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

kafka.server:type=ReplicaManager,name=IsrExpandsPerSec

当前 broker 是否为 controller

只有当 broker 作为集群 controller 时为 1,整个集群应该只有 1 个 controller

kafka.controller:type=KafkaController,name=ActiveControllerCount

Leader 选举速率

kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

Unclean leader 选举速率

kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec

请求/响应队列

kafka.network:type=RequestChannel,name=RequestQueueSize

Size of the request queue. A congested request queue will not be able to process incoming or outgoing requests.

kafka.network:type=RequestChannel,name=ResponseQueueSize

Size of the response queue. The response queue is unbounded. A congested response queue can result in delayed response times and memory pressure on the broker.

请求耗时

Total time in ms to serve the specified request.

kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}

Time the request waits in the request queue.

kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}

Time the request is processed at the leader.

kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}

Time the request waits for the follower. This is non-zero for produce requests when acks=all.

kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}

Time the request waits in the response queue.

kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}

Time to send the response.

kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
Back to top