Hey all,
This is xinyao from dropbox and I am trying to push kafka to widely used
inside dropbox.
I setup up five nodes with 2x replication. The system is running well to
handle all the traffic. The problem I am seeing now is that one
host(probably one leader) being kicked out of ISR. This is some related
information.
[2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
(kafka.network.Processor)
[2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:27,157] INFO Rolled new log segment for
'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:30,709] INFO Rolled new log segment for
'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
(kafka.log.Log)
[2014-04-16 15:31:32,310] INFO Rolled new log segment for
'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
(kafka.network.Processor)
[2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
(kafka.network.Processor)
[2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
(kafka.network.Processor)
[2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
(kafka.network.Processor)
[2014-04-16 15:31:54,954] INFO Rolled new log segment for
'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:32:05,540] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,579] ERROR Conditional update of path
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected
version 0 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,581] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
[0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] ERROR Conditional update of path
/brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0 failed due
to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/api-shmodel-non-ssl/partitions/4/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
partition [server-network_logs_errors-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,605] ERROR Conditional update of path
/brokers/topics/server-network_logs_errors-0012/partitions/0/state with
data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0
failed due to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/server-network_logs_errors-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,605] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
You can see that something went wrong around 15:31. This is the current
topic replicas information:
Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr: 5,1
Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2
Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr: 2,3
Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr: 3,4
Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr: 5,4
Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr: 5,4
Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5
Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1
Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr: 3,2
Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr: 3,4
....
I suspect that broker 1 lost leadership for all the topics around that
timestamp but it is in sync as a follower.
I followed "How to reduce churn" to increase replica.lag.max.messages.
However I don't think it is useful since I suspect it is leader kicked out
of ISR.
I cannot consistently reproduce this since it happened twice, once a day in
the past two days. These are all the configs I changed in
server.properties.
default.replication.factor=2
auto.leader.rebalance.enable=true
log.roll.hours=1
replica.lag.max.messages=50000
Let me know if you need more logs to debug.
This is xinyao from dropbox and I am trying to push kafka to widely used
inside dropbox.
I setup up five nodes with 2x replication. The system is running well to
handle all the traffic. The problem I am seeing now is that one
host(probably one leader) being kicked out of ISR. This is some related
information.
[2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153.
(kafka.network.Processor)
[2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:27,157] INFO Rolled new log segment for
'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:30,709] INFO Rolled new log segment for
'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms.
(kafka.log.Log)
[2014-04-16 15:31:32,310] INFO Rolled new log segment for
'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184.
(kafka.network.Processor)
[2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60.
(kafka.network.Processor)
[2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58.
(kafka.network.Processor)
[2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38.
(kafka.network.Processor)
[2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22.
(kafka.network.Processor)
[2014-04-16 15:31:54,954] INFO Rolled new log segment for
'api-not_your_cursor-2' in 1 ms. (kafka.log.Log)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160.
(kafka.network.Processor)
[2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181.
(kafka.network.Processor)
[2014-04-16 15:32:05,540] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for
partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,579] ERROR Conditional update of path
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected
version 0 failed due to
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
BadVersion for
/brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,581] INFO Partition
[client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion
[0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] ERROR Conditional update of path
/brokers/topics/api-shmodel-non-ssl/partitions/4/state with data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0 failed due
to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/api-shmodel-non-ssl/partitions/4/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker
1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-04-16 15:32:05,596] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for
partition [server-network_logs_errors-0012,0] from 1,2 to 1
(kafka.cluster.Partition)
[2014-04-16 15:32:05,605] ERROR Conditional update of path
/brokers/topics/server-network_logs_errors-0012/partitions/0/state with
data
{"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]}
and expected version 0
failed due to org.apache.zookeeper.KeeperException$BadVersionException:
KeeperErrorCode = BadVersion for
/brokers/topics/server-network_logs_errors-0012/partitions/0/state
(kafka.utils.ZkUtils$)
[2014-04-16 15:32:05,605] INFO Partition
[server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
You can see that something went wrong around 15:31. This is the current
topic replicas information:
Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3
Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr: 2,4
Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr: 3,5
Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr: 1,4
Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr: 5,2
Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr: 5,1
Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2
Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr: 2,3
Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr: 3,4
Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr: 5,4
Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs:
Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr: 5,4
Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5
Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1
Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr: 3,2
Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr: 3,4
....
I suspect that broker 1 lost leadership for all the topics around that
timestamp but it is in sync as a follower.
I followed "How to reduce churn" to increase replica.lag.max.messages.
However I don't think it is useful since I suspect it is leader kicked out
of ISR.
I cannot consistently reproduce this since it happened twice, once a day in
the past two days. These are all the configs I changed in
server.properties.
default.replication.factor=2
auto.leader.rebalance.enable=true
log.roll.hours=1
replica.lag.max.messages=50000
Let me know if you need more logs to debug.