2.4.2.2.2 监控项
kafka 通过 jmx 暴露内部监控,具体监控项可以查看:https://kafka.apache.org/documentation/#monitoring
这里列出一些需要关注的监控项的 jmx MBean
ObjectName
,以及 jmx_exporter 将其转换后对应的 promql 查询语句:
消息速率(KafkaRequestHandler 统计)
客户端与 broker 之间消息写入/读取速率。
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec[,topic=*]
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec[,topic=*]
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec[,topic=*]
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="MessagesInPerSec",topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesInPerSec", topic=""} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="BytesOutPerSec", topic=""}
broker 与 broker 之间副本同步带来的消息写入/读取速率。
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesInPerSec",} kafka_server_BrokerTopicMetrics_OneMinuteRate{name="ReplicationBytesOutPerSec",}
broker 重新分区数据迁移速率:
kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec
拒绝写入速率
kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec
请求频率(KafkaRequestHandler 统计)
以 topic 维度统计请求频率
kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec,topic=([-.\w]+)
kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=([-.\w]+)
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="TotalProduceRequestsPerSec", } kafka_server_BrokerTopicMetrics_OneMinuteRate{name="TotalFetchRequestsPerSec", }
TotalProduceRequestsPerSec 会在消息加入本地副本日志时统计一次
TotalFetchRequestsPerSec 会在从本地读取消息时统计一次
失败请求频率
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec,topic=([-.\w]+)
kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec,topic=([-.\w]+)
kafka_server_BrokerTopicMetrics_OneMinuteRate{name="FailedProduceRequestsPerSec", } kafka_server_BrokerTopicMetrics_OneMinuteRate{name="FailedFetchRequestsPerSec", }
这里的 Fetch 包含 FetchConsumer 与 FetchFollower,即客户端的拉取与副本的拉取请求速率之和。
请求频率(RequestChannel 统计)
RequestChannel 会在处理响应时 processCompletedSends()
时记录请求频率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
kafka_network_RequestMetrics_OneMinuteRate{name="RequestsPerSec",request="Produce",}
Kafka 的 API 定义读取消息只有 FETCH,RequestChannel 在统计时会将请求分为 FetchConsumer
和 FetchFollower
,即客户端消费和副本同步消费。
请求错误频率,统计失败的请求/响应,如果 error=NONE
表示响应是成功的。
kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)
kafka_network_RequestMetrics_OneMinuteRate{name="ErrorsPerSec", error!="NONE"}
一般 Channel 统计数会小于 Handler 统计数
请求大小
kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)
kafka_network_RequestMetrics_OneMinuteRate{name="RequestBytes", }
网络处理线程空闲率
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
请求处理线程空闲率
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
Topic 数量
只有 Controller 节点上有,表示集群的 topic 数量
kafka.controller:type=KafkaController,name=GlobalTopicCount
leader 数量
kafka.server:type=ReplicaManager,name=LeaderCount
kafka_server_ReplicaManager_Value{name="LeaderCount"}
只有 Controller 节点上有,表示集群的 topic leader partition 数量
kafka.controller:type=KafkaController,name=GlobalPartitionCount
分区数量
kafka.server:type=ReplicaManager,name=PartitionCount
kafka_server_ReplicaManager_Value{name="PartitionCount"}
有落后副本的分区数量
|isr| < |all replicas|
的 partition 数
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
kafka_server_ReplicaManager_Value{name="UnderReplicatedPartitions"}
|isr| = min.insync.replicas
的 partition 数
kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount
|isr| < min.insync.replicas
的 partition 数
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount
下线分区数
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
落后的 replica 数
kafka.cluster:type=Partition,name=UnderReplicated,topic=*,partition=*
掉线的 replica 数量
当副本因为日志目录文件系统错误而无法访问时,视为 OfflineReplica。
每个 broker 上的 OfflineReplica 数量:
kafka.server:type=ReplicaManager,name=OfflineReplicaCount
每个 broker 上的无法访问的 LogDir 数量:
kafka.server:name=OfflineLogDirectoryCount,type=LogManager
当前 leader 不是最优的分区数量
kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount
落后消息数量
follower 和 leader replica 之间落后的消息数量。
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
kafka_server_ReplicaFetcherManager_Value{name="MaxLag", clientId="Replica"}
具体到 topic partition 级别落后的消息数量
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
kafka_server_FetcherLagMetrics_Value{name="ConsumerLag",clientId="ReplicaFetcherThread-0-4",topic="test",partition="0",}
Isr 列表扩缩容速率
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
kafka_server_ReplicaManager_OneMinuteRate{name="IsrExpandsPerSec"}
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
kafka_server_ReplicaManager_OneMinuteRate{name="IsrShrinksPerSec"}
当前 broker 是否为 controller
只有当 broker 作为集群 controller 时为 1,整个集群应该只有 1 个 controller
kafka.controller:type=KafkaController,name=ActiveControllerCount
Leader 选举速率
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
Unclean leader 选举速率
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
请求/响应队列
kafka.network:type=RequestChannel,name=RequestQueueSize
Size of the request queue. A congested request queue will not be able to process incoming or outgoing requests.
kafka.network:type=RequestChannel,name=ResponseQueueSize
Size of the response queue. The response queue is unbounded. A congested response queue can result in delayed response times and memory pressure on the broker.
请求耗时
Total time in ms to serve the specified request.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}
kafka_network_RequestMetrics_75thPercentile{name="TotalTimeMs",request=~"Produce|FetchConsumer",}
Time the request waits in the request queue.
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request is processed at the leader.
kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the follower. This is non-zero for produce requests when acks=all.
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits in the response queue.
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time to send the response.
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
消息格式转换
转换频率
kafka.server:type=BrokerTopicMetrics,name=ProduceMessageConversionsPerSec,topic=([-.\w]+)
耗时
kafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce|Fetch}
gc
java.lang:name=G1 Young Generation,type=GarbageCollector CollectionCount
java.lang:name=G1 Young Generation,type=GarbageCollector CollectionTime
kafka purgatory
kafka.server:type=DelayedOperationPurgatory,delayedOperation=Produce,name=PurgatorySize
Number of requests waiting in the producer purgatory. This should be non-zero when acks=all
is used on the producer.
kafka.server:type=DelayedOperationPurgatory,delayedOperation=Fetch,name=PurgatorySize
Number of requests waiting in the fetch purgatory. This is high if consumers use a large value for fetch.wait.max.ms
.
partition log
kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)
kafka.log:type=Log,name=NumLogSegments,topic=([-.\w]+),partition=([0-9]+)
kafka.log:type=Log,name=LogStartOffset,topic=([-.\w]+),partition=([0-9]+)
kafka.log:type=Log,name=LogEndOffset,topic=([-.\w]+),partition=([0-9]+)