DeepStream kafka shows "Receive failed: Disconnected" and cannot reconnect

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Tesla T4 GPU
• DeepStream Version 5.0 (5.0-dp-20.04-devel docker)
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only) 10.2
• Issue Type( questions, new requirements, bugs) bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I am using my modified version of deepstream-test5-app and use Kafka broker to receive and send messages into and from deepstream application. After couple of hours, the application will output error message like this, and cannot receive any new messages anymore.

%3|1602260329.352|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602260329.352|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602260329.363|FAIL|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602260329.363|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602260329.363|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 2/2 brokers are down
%3|1602260929.454|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602260929.455|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602260929.465|FAIL|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602260929.465|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602260929.465|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 2/2 brokers are down
%3|1602261229.356|FAIL|rdkafka#producer-3| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602261229.356|ERROR|rdkafka#producer-3| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602261529.561|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602261529.561|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602261529.577|FAIL|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602261529.577|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602261529.577|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 2/2 brokers are down
%3|1602261829.353|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602261829.353|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602262129.670|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602262129.670|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602262129.681|FAIL|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602262129.681|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602262129.681|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 2/2 brokers are down

FYI, the Kafka related configuration is shown here:

[sink1]
enable=1
gpu-id=0
#Type - 1=FakeSink 2=EglSink 3=File 4=UDPSink 5=nvoverlaysink 6=MsgConvBroker
type=6
msg-conv-config=dstest5_msgconv_config.txt
disable-msgconv=1
#(0): PAYLOAD_DEEPSTREAM - Deepstream schema payload
#(1): PAYLOAD_DEEPSTREAM_MINIMAL - Deepstream schema payload minimal
#(256): PAYLOAD_RESERVED - Reserved type
#(257): PAYLOAD_CUSTOM   - Custom schema payload
msg-conv-payload-type=0
msg-broker-proto-lib=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_kafka_proto.so
#Provide your msg-broker-conn-str here
msg-broker-conn-str=kafka;9092;metromind-start
topic=metromind-start
#msg-broker-conn-str=kafka;9092;metromind-start-test
#topic=metromind-start-test
#Optional:
msg-broker-config=cfg_kafka.txt

# sink type = 6 by default creates msg converter + broker.
# To use multiple brokers use this group for converter and use
# sink type = 6 with disable-msgconv = 1
[message-converter]
enable=1
msg-conv-config=dstest5_msgconv_sample_config.txt
#(0): PAYLOAD_DEEPSTREAM - Deepstream schema payload
#(1): PAYLOAD_DEEPSTREAM_MINIMAL - Deepstream schema payload minimal
#(256): PAYLOAD_RESERVED - Reserved type
#(257): PAYLOAD_CUSTOM   - Custom schema payload
msg-conv-payload-type=0
# Name of library having custom implementation.
#msg-conv-msg2p-lib=<val>
# Id of component in case only selected message to parse.
#msg-conv-comp-id=<val>

# Configure this group to enable cloud message consumer.
[message-consumer0]
enable=1
proto-lib=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_kafka_proto.so
conn-str=kafka;9092
config-file=cfg_kafka.txt
subscribe-topic-list=source-control;add-source;delete-source;mp4-segment

The cfg_kafka.txt is set as follows:

[message-broker]
partition-key=sensor.id
proto-cfg="socket.keepalive.enable=true;retries=5;connection.max.idle.ms=100000"

I find this link for kafka related to this problem. It says Kafka will automatically reconnect whenever the connection is closed, however it seems fail to reconnect.

•Requirement

Is there a way to quickly fix this problem?

If not, how can I modify the deepstream-test5-app so it can raise error and exit upon this situation?

Hi
Sorry for a late reply,
Do you have this issue using original test5 sample?

I will try to do experiment on original test5 application. Since the issue usually appears after hours of normal running, it needs time to verify if the original version have similar issue.

Strangely, I find the application output the error log in RTSP mode but is still running well. Maybe the above bug is related to message consumer?

%3|1602659258.804|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602659258.804|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Receive failed: Disconnected
%3|1602659258.804|FAIL|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602659258.804|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 172.17.0.1:9092/1001: Receive failed: Disconnected
%3|1602659258.804|ERROR|rdkafka#producer-1| [thrd:172.17.0.1:9092/1001]: 2/2 brokers are down

I am wondering if this is related to kafka message broker itself or not. If it is, have you ever seen that before in other situations?

Hi, any updates on above comment?

Hi,
I did some test on my side using original sample test5, but i can not repro your issue, from below log, you can see it will retry connection after broker down(i close the broker manually), and when the broker up, it will continually receive message, never mind the parsing error, since it not in correct format, but it did receive the messages.
but not gurantee cover your case, please check your code, if any findings, you can share here, thanks.

%3|1603772334.571|FAIL|rdkafka#producer-3| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.571|ERROR|rdkafka#producer-3| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.573|FAIL|rdkafka#producer-1| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.573|ERROR|rdkafka#producer-1| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.573|FAIL|rdkafka#producer-1| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.574|ERROR|rdkafka#producer-1| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.574|ERROR|rdkafka#producer-1| [thrd:10.23.136.84:9092/bootstrap]: 2/2 brokers are down
%3|1603772334.574|FAIL|rdkafka#producer-3| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.574|ERROR|rdkafka#producer-3| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.574|ERROR|rdkafka#producer-3| [thrd:10.23.136.84:9092/bootstrap]: 2/2 brokers are down
%3|1603772334.575|FAIL|rdkafka#consumer-2| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.575|ERROR|rdkafka#consumer-2| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Receive failed: Disconnected
%3|1603772334.575|FAIL|rdkafka#consumer-2| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.575|ERROR|rdkafka#consumer-2| [thrd:tsestation:9092/0]: tsestation:9092/0: Receive failed: Disconnected
%3|1603772334.575|ERROR|rdkafka#consumer-2| [thrd:tsestation:9092/0]: 2/2 brokers are down
%3|1603772334.673|FAIL|rdkafka#producer-3| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.673|ERROR|rdkafka#producer-3| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.674|FAIL|rdkafka#producer-1| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.674|ERROR|rdkafka#producer-1| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.675|FAIL|rdkafka#producer-3| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.675|ERROR|rdkafka#producer-3| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.676|FAIL|rdkafka#producer-1| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.676|ERROR|rdkafka#producer-1| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.676|FAIL|rdkafka#consumer-2| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.676|ERROR|rdkafka#consumer-2| [thrd:10.23.136.84:9092/bootstrap]: 10.23.136.84:9092/bootstrap: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.677|FAIL|rdkafka#consumer-2| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
%3|1603772334.677|ERROR|rdkafka#consumer-2| [thrd:tsestation:9092/0]: tsestation:9092/0: Connect to ipv4#10.23.136.84:9092 failed: Connection refused
Tue Oct 27 12:18:56 2020
**PERF: 23.59 (24.39) 23.36 (24.41)
Tue Oct 27 12:19:01 2020
**PERF: 23.64 (24.36) 23.07 (24.37)
Tue Oct 27 12:19:06 2020
**PERF: 23.12 (24.31) 23.93 (24.35)
Tue Oct 27 12:19:11 2020
**PERF: 23.19 (24.27) 22.99 (24.30)
Tue Oct 27 12:19:16 2020
**PERF: 22.85 (24.22) 23.45 (24.27)
Tue Oct 27 12:19:21 2020
**PERF: 23.39 (24.20) 22.99 (24.23)
Tue Oct 27 12:19:26 2020
**PERF: 23.19 (24.16) 22.79 (24.18)
Tue Oct 27 12:19:31 2020
**PERF: 22.84 (24.12) 23.25 (24.15)
Tue Oct 27 12:19:36 2020
**PERF: 23.27 (24.09) 23.27 (24.12)
Tue Oct 27 12:19:41 2020
**PERF: 22.44 (24.04) 22.84 (24.08)
Tue Oct 27 12:19:46 2020
**PERF: 23.04 (24.01) 22.84 (24.05)
Tue Oct 27 12:19:51 2020
**PERF: 22.97 (23.98) 22.57 (24.00)
Tue Oct 27 12:19:56 2020
**PERF: 22.39 (23.94) 22.59 (23.97)
Tue Oct 27 12:20:01 2020
**PERF: 22.35 (23.90) 22.95 (23.94)
Tue Oct 27 12:20:06 2020
**PERF: 22.17 (23.85) 22.77 (23.91)
Tue Oct 27 12:20:11 2020
**PERF: 22.37 (23.81) 22.37 (23.87)

**PERF: FPS 0 (Avg) FPS 1 (Avg)
Tue Oct 27 12:20:16 2020
**PERF: 22.90 (23.79) 22.59 (23.83)
Tue Oct 27 12:20:21 2020
**PERF: 22.10 (23.75) 22.20 (23.80)
Tue Oct 27 12:20:26 2020
**PERF: 21.89 (23.71) 21.89 (23.75)
Tue Oct 27 12:20:31 2020
**PERF: 21.68 (23.66) 22.48 (23.73)
Tue Oct 27 12:20:36 2020
**PERF: 22.25 (23.63) 22.25 (23.69)
Tue Oct 27 12:20:41 2020
**PERF: 21.95 (23.59) 21.75 (23.65)
Tue Oct 27 12:20:46 2020
**PERF: 21.79 (23.55) 21.79 (23.61)
Tue Oct 27 12:20:51 2020
**PERF: 21.82 (23.51) 21.82 (23.57)
Tue Oct 27 12:20:56 2020
**PERF: 21.60 (23.47) 22.01 (23.53)
Tue Oct 27 12:21:01 2020
**PERF: 24.40 (23.49) 22.43 (23.47)
Tue Oct 27 12:21:06 2020
**PERF: 28.71 (23.60) 11.78 (23.26)
Tue Oct 27 12:21:11 2020
**PERF: 25.03 (23.62) 25.03 (23.30)
Tue Oct 27 12:21:16 2020
**PERF: 24.59 (23.64) 24.39 (23.32)
Tue Oct 27 12:21:21 2020
**PERF: 24.57 (23.66) 24.97 (23.35)
Tue Oct 27 12:21:26 2020
**PERF: 25.10 (23.68) 24.70 (23.37)
Tue Oct 27 12:21:31 2020
**PERF: 25.13 (23.72) 24.90 (23.40)
Tue Oct 27 12:21:36 2020
**PERF: 24.94 (23.74) 24.57 (23.42)
Tue Oct 27 12:21:41 2020
**PERF: 24.64 (23.75) 24.64 (23.44)
Tue Oct 27 12:21:46 2020
**PERF: 24.27 (23.76) 25.06 (23.47)
Tue Oct 27 12:21:51 2020
**PERF: 24.25 (23.77) 24.65 (23.49)

**PERF: FPS 0 (Avg) FPS 1 (Avg)
Tue Oct 27 12:21:56 2020
**PERF: 24.60 (23.78) 24.40 (23.51)
** ERROR: <nvds_c2d_parse_cloud_message:99>: Error in parsing json message :1:9: Parse error: unexpected identifier `jahdjfhaf’, expected value
** WARN: <subscribe_cb:63>: error in message parsing

** ERROR: <nvds_c2d_parse_cloud_message:99>: Error in parsing json message :1:5: Parse error: unexpected identifier `ajdfh’, expected value
** WARN: <subscribe_cb:63>: error in message parsing

Tue Oct 27 12:22:01 2020
**PERF: 24.42 (23.79) 24.42 (23.52)
** ERROR: <nvds_c2d_parse_cloud_message:99>: Error in parsing json message :1:4: Parse error: unexpected identifier `ahfd’, expected value
** WARN: <subscribe_cb:63>: error in message parsing

Sorry for late reply, the problem still exists, but I have to work on somethng else for a while… I will notice you if I make any progress.

Thank you!

It looks like we’re also having trouble reconnecting.

Actually after debuging, I find the problem is not related to kafka reconnection. It actually tried to reconnect successfully. But my pipeline itself was stuck so the kafka get timeout again after a period…

I wander how to solve it?? By giving more producer buffer??

I mean your porblem may be not related to Kafka. Try to debug more carefully your DeepStream pipeline itself.

Hi, I have met the same problem, and I also tried close kafka message to test my DS pipeline, it works well for more than 24 hours. So I am not sure the key problem is kafka or pipeline.