I’m working on a hardware plan for my client, my client ask me to deploy 10 deepstream object detection models, to deal with 100+ IPCamera video analytics, basically 1 model:10 cameras.
I’m not sure what kind of hardware rigs can support 100+ videos analytics, Here’s the conditions:
Assume all OD models are yolov4, which consume 1.8G graphics memory, 10 models cost 20G graphics memory, so I have to choose nvidia T4 x 2 or V100 x 1.
100 1080p videos needs about 400M network bandwidth, a 1 Gig ethernet card can deal with it.
I don’t know how many FPS can be achieved with yolov4 in nvidia T4 or V100.
CPU: 20 cores 40 threads 2x Xeon E5 maybe enough?
memory: 64G should be ok?
My budget is about $4500.
Although nvidia doesn’t want us to use gaming gpus, but 4 gaming gpus seems really nice.
We have the YoloV4 perf data on T4 and Xavier@Max-N.
YOLO 416x416 on Tesla T4 16GB
YOLOv4
Max Batch Size
Batch size
FP16 (fps)
INT8 (fps)
1
1
164
250
4
4
225
370
8
8
232
396
16
16
234
412
32
32
234
412
YOLO 416x416 on AGX Xavier
YOLOv4
Max Batch Size
Batch size
FP16 (fps)
INT8 (fps)
1
1
60
95
4
4
72
124
8
8
75
131
16
16
77
135
We can achieve close to 400fps end-to-end performance with BS>8. If your stream is running 30 fps, then you can run about 13 streams per T4, so for 100 streams, you would need around 8 T4s. Here’s a GitHub link to learn how to run YoloV4 with DeepStream.
What is the application that you are building for your client?
If you are trying to increase the number of channels per GPU, we have Transfer learning toolkit (TLT) which offers training on lots of model architecture from high accuracy to high performance. It also offers pruning on the model to improve inference performance. I would suggest taking a look at models in TLT.
I’ve set the batch size in [primary-gie] equals to that in [property] from pgie configuration file, which is 8, but it keep telling me that the max batch size is 1. But if I export the model engine file with static batch size=4, it can be run successfully.
The number of sources in [source0] and batch size in [streammux] are set to 8.
The second question is, can T4 decode over 100 live camera videos simultaneously?
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#gie-kitti-output-dir=streamscl
[tiled-display]
enable=0
rows=1
columns=1
width=1280
height=720
gpu-id=0
#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla
#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla
#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla
#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
nvbuf-memory-type=0
[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=3
uri=file:/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_1080p_h264.mp4
#uri=file:/opt/nvidia/deepstream/deepstream-5.0/samples/streams/sample_720p.h264
num-sources=16
gpu-id=0
cudadec-memtype=0
[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File
type=1
sync=0
source-id=0
gpu-id=0
nvbuf-memory-type=0
[osd]
enable=0
gpu-id=0
border-width=1
text-size=12
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0
[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=16
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=1280
height=720
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0
# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
gpu-id=0
model-engine-file=yolov4-hat_8.engine
labelfile-path=hat_labels.txt
batch-size=8
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=0
gie-unique-id=1
nvbuf-memory-type=0
config-file=config_infer_primary_yoloV4.txt
[tracker]
enable=0
tracker-width=512
tracker-height=320
ll-lib-file=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_mot_klt.so
[tests]
file-loop=0
Thanks for your reply @mchi, I found the problem.
It truns out that my pytorch-yolov4 code is not the latest, after I git pull the latest code, I can export the onnx model with dynamic batch size:
# the last number set to 0, we can get the dynamic batch size.
python demo_darknet2onnx.py cfg/yolov4-hat.cfg yolov4-hat_7000.weights 233.png 0
But the top FPS is just 144 (9 FPS/video x 16 videos), seems it cannot achieve the best performance that @kayccc provides, what’s the best config for the best performance? @mchi
(base) ubuntu@ip-172-31-13-21:~$ sudo nvidia-smi -pm ENABLED -i 0
Persistence mode is already Enabled for GPU 00000000:00:1E.0.
All done.
(base) ubuntu@ip-172-31-13-21:~$ sudo nvidia-smi -ac "5001,1590" -i 0
Applications clocks set to "(MEM 5001, SM 1590)" for GPU 00000000:00:1E.0
All done.
(base) ubuntu@ip-172-31-13-21:~$ nvidia-smi -q -d CLOCK -i 0
==============NVSMI LOG==============
Timestamp : Thu Oct 29 04:46:05 2020
Driver Version : 440.33.01
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:00:1E.0
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 1590 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Hey @kayccc how do you create the INT8 TensorRT model? I have tried using the deepstream yolov4 nvidia github and following instructions but it just says “If you want to use int8 mode in conversion, extra int8 calibration is needed.”
When I run it i get Calibrator not being used. Users must provide dynmaic range for all tensors not int32
So I would need to create the calibration file based on images from the cameras I would be using? I don’t have much experience with TensorRT, do you have the steps you used for the above performance results? or what code you used to create the INT8 calibration files for your testing?