Quantcast
Channel: Kafka Timeline
Viewing all articles
Browse latest Browse all 1519

Issue with kafka cluster with broker not able to send messages.

$
0
0
*Issue : *Kafka cluster goes to an un-responsive state after some time with
producers getting Socket time-outs on every request made.

*Kafka Version* - 0.8.1

*Machines* : VMs , 2 cores 8gb RAM, linux , 3 node cluster.

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 10240
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

*Broker Properties :*

3 topics, 5 partitions, 5 i/o , network threads.
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
zookeeper.connection.timeout.ms=1000000
No replication amongst brokers.

*Producer Properties * -
sync producer,
required.ack = -1 ( Waits for ack from broker, most durable).
kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
(requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
(requested 102400).

*NO Consumers*

*QPS From producer to broker* ..
From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent. Each
Message size is about 600 bytes

From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
message size is about 1 kb.

*Exception seen on producer after 9-10 hours of seamless message send.*

Exception
kafka.common.FailedToSendMessageException: Failed to send messages after 3
tries.

ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
for topics user with correlation ids in [25617,25932]

WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
kafka.producer.async.DefaultEventHandler - Failed to send producer request
with correlation id 25931 to broker 4 with data for partitions
[userservice,2]
java.net.SocketTimeoutException
at
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:375)

*No of Open FDs in one of the broker machines*- 999

*Netstat of kafka broker :*

Lot of sockets in CLOSE_WAIT with high receive queue nos.

Sample :
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
tcp6 198549 0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
tcp6 44 0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
tcp6 209647 0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
tcp6 200731 0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
tcp6 34133 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
tcp6 208136 0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
tcp6 199587 0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
tcp6 44 0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
CLOSE_WAIT
tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
tcp6 186606 0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
CLOSE_WAIT
tcp6 196856 0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
tcp6 44 0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
CLOSE_WAIT

*On producer boxes :*

Lot of FIN_WAIT 1 ,2 with high send queues.
tcp6 0 44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
FIN_WAIT2
tcp6 0 0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
FIN_WAIT2
tcp6 0 60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
ESTABLISHED
tcp6 0 32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
FIN_WAIT1

tcp6 0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED

tcp6 0 0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
FIN_WAIT2
tcp6 0 63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
FIN_WAIT2
tcp6 0 44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
FIN_WAIT1
tcp6 0 28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
ESTABLISHED
tcp6 0 44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
FIN_WAIT1

*Broker Logs : *
In server.log, mostly INFO [2014-04-10 03:01:42,230]
[kafka-processor-9092-0][] kafka.network.Processor - Closing socket
connection to /10.65.40.137.

EXCEPT in 1 machine :

INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
kafka.network.Processor - Closing socket connection to /10.65.40.137.

No logs in-between. First fatal exception at 07.11.

FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
to highwatermark file:
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
at java.io.FileWriter.<init>(FileWriter.java:73)
at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
at
kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
at
kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor -
Error in acceptor
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
at kafka.network.Acceptor.accept(SocketServer.scala:200)
at kafka.network.Acceptor.run(SocketServer.scala:154)
at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
'kafka-recovery-point-checkpoint'
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
open files)

*No EXCEPTIONS in any other logs in any other broker machine..*

Attached is the thread dump from leader.

Viewing all articles
Browse latest Browse all 1519

Trending Articles