2.4.2.2.2 监控项
kafka 通过 jmx 暴露内部监控,具体监控项可以查看:https://kafka.apache.org/documentation/#monitoring
这里列出一些需要关注的监控项的 jmx MBean
ObjectName
,以及 jmx_exporter 将其转换后对应的 promql 查询语句:
消息速率
客户端与 broker 之间消息写入/读取速率。
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec[,topic=*]
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec[,topic=*]
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec[,topic=*]
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="MessagesInPerSec",topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesInPerSec", topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesOutPerSec", topic=""}
broker 与 broker 之间副本同步带来的消息写入/读取速率。
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesInPerSec",} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesOutPerSec",}
请求速率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
kafka_network_RequestMetrics_OneMinuteRate{name="RequestsPerSec",request="Produce",}
请求速率
以 topic 维度统计请求速率
kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec
kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec
失败请求速率
统计失败的请求/响应,如果 error=NONE
表示响应是成功的。
kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)
kafka_network_RequestMetrics_OneMinuteRate{name="ErrorsPerSec", error!="NONE"}
请求大小
kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)
kafka_network_RequestMetrics_OneMinuteRate{name="RequestBytes", }
网络处理线程空闲率
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
请求处理线程空闲率
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
leader 数量
kafka.server:type=ReplicaManager,name=LeaderCount
kafka_server_ReplicaManager_Value{name="LeaderCount"}
分区数量
kafka.server:type=ReplicaManager,name=PartitionCount
kafka_server_ReplicaManager_Value{name="PartitionCount"}
有落后副本的分区数量
|isr| < |all replicas|
的 partition 数
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
kafka_server_ReplicaManager_Value{name="UnderReplicatedPartitions"}
|isr| = min.insync.replicas
的 partition 数
kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount
非同步的 replica 数量
kafka.server:type=ReplicaManager,name=OfflineReplicaCount
落后消息数量
follower 和 leader replica 之间落后的消息数量。
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
kafka_server_ReplicaFetcherManager_Value{name="MaxLag", clientId="Replica"}
具体到 topic partition 级别落后的消息数量
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
kafka_server_FetcherLagMetrics_Value{name="ConsumerLag",clientId="ReplicaFetcherThread-0-4",topic="test",partition="0",}
Isr 列表扩缩容速率
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
当前 broker 是否为 controller
只有当 broker 作为集群 controller 时为 1,整个集群应该只有 1 个 controller
kafka.controller:type=KafkaController,name=ActiveControllerCount
Leader 选举速率
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
Unclean leader 选举速率
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
请求/响应队列
kafka.network:type=RequestChannel,name=RequestQueueSize
Size of the request queue. A congested request queue will not be able to process incoming or outgoing requests.
kafka.network:type=RequestChannel,name=ResponseQueueSize
Size of the response queue. The response queue is unbounded. A congested response queue can result in delayed response times and memory pressure on the broker.
请求耗时
Total time in ms to serve the specified request.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits in the request queue.
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request is processed at the leader.
kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the follower. This is non-zero for produce requests when acks=all.
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits in the response queue.
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time to send the response.
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}