OOM of kafka?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) T4
• DeepStream Version 5.0
• JetPack Version (valid for Jetson only)
• TensorRT Version 7.0
• NVIDIA GPU Driver Version (valid for GPU only) 440
• Issue Type( questions, new requirements, bugs) bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
Hi, we used test5 with KAFKA, we found our program always shut down after serveral hours, it just looks liake this:


Finally we found that it might be a memory leak, we did a set of comparative experiments and were surprised to find that the memory footprint of the kafka set had reached an incomprehensible size.
image
config:

enable-perf-measurement=1
perf-measurement-interval-sec=5
#gie-kitti-output-dir=streamscl
[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720
gpu-id=4
nvbuf-memory-type=0
[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=0
source-id=0
gpu-id=4
nvbuf-memory-type=0
[sink1]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=UDPSink 5=nvoverlaysink 6=MsgConvBroker
type=6
msg-conv-config=dstest5_msgconv_sample_config.txt
#(0): PAYLOAD_DEEPSTREAM - Deepstream schema payload
#(1): PAYLOAD_DEEPSTREAM_MINIMAL - Deepstream schema payload minimal
#(256): PAYLOAD_RESERVED - Reserved type
#(257): PAYLOAD_CUSTOM   - Custom schema payload
msg-conv-payload-type=1
msg-broker-proto-lib=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_kafka_proto.so
#Provide your msg-broker-conn-str here
#msg-broker-conn-str=<host>;<port>;<topic>
#topic=<topic>
#msg-broker-conn-str=221.236.26.29;9092;test
msg-broker-conn-str=127.0.0.1;9092;foreign_matter_stat
#msg-broker-conn-str=192.168.1.216;9092;head_count
topic=foreign_matter_stat
#Optional:
#msg-broker-config=../../deepstream-test4/cfg_kafka.txt
source-id=0
[sink2]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265 3=mpeg4
## only SW mpeg4 is supported right now.
codec=3
sync=1
bitrate=2000000
output-file=out.mp4
source-id=0

# sink type = 6 by default creates msg converter + broker.
# To use multiple brokers use this group for converter and use
# sink type = 6 with disable-msgconv = 1
[message-converter]
enable=0
msg-conv-config=dstest5_msgconv_sample_config.txt
#(0): PAYLOAD_DEEPSTREAM - Deepstream schema payload
#(1): PAYLOAD_DEEPSTREAM_MINIMAL - Deepstream schema payload minimal
#(256): PAYLOAD_RESERVED - Reserved type
#(257): PAYLOAD_CUSTOM   - Custom schema payload
msg-conv-payload-type=0
# Name of library having custom implementation.
#msg-conv-msg2p-lib=<val>
# Id of component in case only selected message to parse.
#msg-conv-comp-id=<val>
# Configure this group to enable cloud message consumer.
[message-consumer0]
enable=0
proto-lib=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_kafka_proto.so
conn-str=172.19.207.65;9092
config-file=cfg_kafka.txt
subscribe-topic-list=broadcast
[osd]
enable=1
gpu-id=4
border-width=1
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Arial
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0
[streammux]
gpu-id=4
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=1
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=1920
height=1080
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0
## If set to TRUE, system timestamp will be attached as ntp timestamp
## If set to FALSE, ntp timestamp from rtspsrc, if available, will be attached
attach-sys-ts-as-ntp=0
# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
gpu-id=4
#Required to display the PGIE labels, should be added even when using config-file
#property
batch-size=1
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=0
#Required by the app for SGIE, when used along with config-file property
gie-unique-id=1
nvbuf-memory-type=0
model-engine-file=../models/resnet18_detector.engine
labelfile-path=labelzawu.txt
config-file=config_zawu.txt
#infer-raw-output-dir=../../../../../samples/primary_detector_raw_output/
[tracker]
enable=0
tracker-width=600
tracker-height=288
ll-lib-file=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_mot_klt.so
#ll-config-file required for DCF/IOU only
#ll-config-file=tracker_config.yml
#ll-config-file=iou_config.txt
gpu-id=4
#enable-batch-process applicable to DCF only
enable-batch-process=0
[secondary-gie0]
enable=0
gpu-id=4
gie-unique-id=4
operate-on-gie-id=1
operate-on-class-ids=0;
batch-size=1
config-file=../../../../../samples/configs/deepstream-app/config_infer_secondary_vehicletypes.txt
labelfile-path=../../../../../samples/models/Secondary_VehicleTypes/labels.txt
model-engine-file=../../../../../samples/models/Secondary_VehicleTypes/resnet18.caffemodel_b16_gpu0_int8.engine
[secondary-gie1]
enable=0
gpu-id=4
gie-unique-id=5
operate-on-gie-id=1
operate-on-class-ids=0;
batch-size=1
config-file=../../../../../samples/configs/deepstream-app/config_infer_secondary_carcolor.txt
labelfile-path=../../../../../samples/models/Secondary_CarColor/labels.txt
model-engine-file=../../../../../samples/models/Secondary_CarColor/resnet18.caffemodel_b16_gpu0_int8.engine
[secondary-gie2]
enable=0
gpu-id=4
gie-unique-id=6
operate-on-gie-id=1
operate-on-class-ids=0;
batch-size=1
config-file=../../../../../samples/configs/deepstream-app/config_infer_secondary_carmake.txt
labelfile-path=../../../../../samples/models/Secondary_CarMake/labels.txt
model-engine-file=../../../../../samples/models/Secondary_CarMake/resnet18.caffemodel_b16_gpu0_int8.engine
[tests]
file-loop=0
[source0]
enable=45
type=3
uri=rtsp://xxx.xxx
#Type - 1=CameraV4L2 2=URI 3=MultiURI 4=RTSP
#uri=rtsp://192.168.1.112:8554/test
num-sources=1
gpu-id=4
nvbuf-memory-type=0
#pad_index=11
#camera_id=500
# smart record specific fields, valid only for source type=4
smart-record=0
# 0 = mp4, 1 = mkv
smart-rec-container=0
smart-rec-file-prefix=rtsp_record
smart-rec-dir-path=/opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/zawu/output
# video cache size in seconds
smart-rec-video-cache=100
# default duration of recording in seconds.s
smart-rec-default-duration=30
# duration of recording in seconds.
# this will override default value.
smart-rec-duration=30
# seconds before the current time to start recording.
smart-rec-start-time=30
# value in seconds to dump video stream.
smart-rec-interval=30
drop-frame-interval=20```

Btw, we just found that after set tracker=1, MEM increased slower than before, how could it be like this?!

are you running the configure file with deepstream-app?

Thanks!

Well, I mentioned before, we ran a changed test5-app with kafka

HI @yohoohhh,
Got, sorry!
Could you refer to DeepStream SDK FAQ - #14 by mchi to capture the memory monitor log?

Many thanks!

Thanks for your answering, but I notice that link you shared is about Jetson, could it work on my Telsa4?

): , my appoligies! It does not work for Tesla.

Per your observasion, it’s CPU memory leak, right?

Can it be reproduecd with the test5 sample?

Yes, when we just run the original test5 with kafka, it seems also has such problem, and after close kafka, it goes normal. Besides, we also found that MEM increased slower when we turned tracker off.

Just want to confirm with you, How about your sink0, enabled? or disabled? for the issue you mentioned. i see all the sinks disabled in the config you provided.

and i just use broker sink, did around 20 minutes test, using script dump_RSS_mem_runningtime_pid.sh.log (365 Bytes)
to capture real memory loaded into ram, after two minutes, it will be stable at 1119908.
Getting RSS details for process
1119908, 14:28
Getting RSS details for process
1119908, 15:28
Getting RSS details for process
1119908, 16:28
Getting RSS details for process
1119908, 17:28
Getting RSS details for process
1119908, 18:28
Getting RSS details for process
1119908, 19:28
Getting RSS details for process
1119908, 20:28
Getting RSS details for process
1119908, 21:28
Getting RSS details for process
1119908, 22:28
Getting RSS details for process
1119908, 23:28

and GPU memory kept at 633M

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 On | 00000000:AE:00.0 Off | Off |
| N/A 53C P0 24W / 75W | 633MiB / 8121MiB | 0% Default |
±------------------------------±---------------------±---------------------+

and about the leak you mentioned, is that RES value of java keep growing you are talking about? Since kafka will use java.
will do more test, and get back to you.

Do around 4 half hours test, it is stable at 1109120, no memory leak from test5 sample. test5_config_file_src_infer_tracker_sgie.txt (6.8 KB)
Getting RSS details for process
1109120, 04:26:27

and GPU memory used stable at 803M,
Mon Dec 7 22:22:31 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:AE:00.0 Off | Off |
| N/A 60C P0 28W / 75W | 805MiB / 8121MiB | 39% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13970 C deepstream-test5-app 803MiB |

this is jave used memory during the time, forgot to catch the last state,
1356 tse 20 0 6927608 0.980g 28724 S 15.6 6.4 1:16.42 java
1356 tse 20 0 6930664 1.015g 28920 S 0.7 6.6 2:20.62 java
it have around 30M growth within around 1 hour, i guess the leak should be from kafka.

Thanks!
Is there any solution to avoid this leak from kafka, especially we set over than 40 RTSPs to a program? We also noticed some configuration files about kafka broker server like server.properties with some of the parameters like socket.receive.buffer.bytes can be set by ourselves, and turning the value up and down seems to have an effect on the leak.

This is something beyond the scope of deepstream. not sure if installing the latest kafka package helps solve the issue.