DeepStream Python SSD : Not utilising GPU and it is slow

Device : Jetson Xavier Nx
Jetpack : JetPack 4.5.1 [L4t 32.5.1]

I tried to run the sample python apps from here
[deepstream_python_apps/apps/deepstream-ssd-parser at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub](https://DeepStream Python SSD apps)

I followed instructions as stated in the above repository.

The GPU instance was set as
instance_group {
kind: KIND_GPU
count: 1
gpus: 0
}

Here is the log when the model is loaded

thukhi@thukhi:/opt/nvidia/deepstream/deepstream-5.1/sources/deepstream_python_apps/apps/deepstream-ssd-parser$ sudo python3 deepstream_ssd_parser.py 
../../../../samples/streams/sample_720p.h264 
Creating Pipeline 
 
Creating Source
Creating H264Parser
Creating Decoder
Creating NvStreamMux
Creating Nvinferserver
Creating Nvvidconv
Creating OSD (nvosd)
Creating Queue
Creating Converter 2 (nvvidconv2)
Creating capsfilter
Creating Encoder
Creating Code Parser
Creating Container
Creating Sink
Playing file ../../../../samples/streams/sample_720p.h264 
Adding elements to Pipeline 

Linking elements in the Pipeline 

Starting pipeline 

Opening in BLOCKING MODE
Opening in BLOCKING MODE 
I0519 06:46:02.314171 15820 pinned_memory_manager.cc:199] Pinned memory pool is created at '0x2030ba000' with size 67108864
I0519 06:46:02.314522 15820 cuda_memory_manager.cc:99] CUDA memory pool is created on device 0 with size 67108864
I0519 06:46:02.317155 15820 server.cc:141] 
+---------+--------+------+
| Backend | Config | Path |
+---------+--------+------+
+---------+--------+------+

I0519 06:46:02.317284 15820 server.cc:184] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0519 06:46:02.317763 15820 tritonserver.cc:1620] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                         |
| server_version                   | 2.5.0                                                                                                          |
| server_extensions                | classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_ |
|                                  | memory binary_tensor_data statistics                                                                           |
| model_repository_path[0]         | /opt/nvidia/deepstream/deepstream-5.1/samples/trtis_model_repo                                                 |
| model_control_mode               | MODE_EXPLICIT                                                                                                  |
| strict_model_config              | 0                                                                                                              |
| pinned_memory_pool_byte_size     | 67108864                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                       |
| min_supported_compute_capability | 5.3                                                                                                            |
| strict_readiness                 | 1                                                                                                              |
| exit_timeout                     | 30                                                                                                             |
+----------------------------------+----------------------------------------------------------------------------------------------------------------+

I0519 06:46:02.321476 15820 model_repository_manager.cc:810] loading: ssd_inception_v2_coco_2018_01_28:1
2021-05-19 08:46:03.025265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
I0519 06:46:03.543636 15820 tensorflow.cc:1876] TRITONBACKEND_Initialize: tensorflow
I0519 06:46:03.543728 15820 tensorflow.cc:1889] Triton TRITONBACKEND API version: 1.0
I0519 06:46:03.543797 15820 tensorflow.cc:1895] 'tensorflow' TRITONBACKEND API version: 1.0
I0519 06:46:03.543833 15820 tensorflow.cc:1916] backend configuration:
{"cmdline":{"allow-soft-placement":"true","gpu-memory-fraction":"0.400000"}}
I0519 06:46:03.544064 15820 tensorflow.cc:1978] TRITONBACKEND_ModelInitialize: ssd_inception_v2_coco_2018_01_28 (version 1)
I0519 06:46:03.549827 15820 tensorflow.cc:2028] TRITONBACKEND_ModelInstanceInitialize: ssd_inception_v2_coco_2018_01_28_0 (GPU device 0)
2021-05-19 08:46:13.322728: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2021-05-19 08:46:13.324289: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f400508b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-19 08:46:13.324404: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-05-19 08:46:13.324787: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-19 08:46:13.325027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.325221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1665] Found device 0 with properties: 
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.109
pciBusID: 0000:00:00.0
2021-05-19 08:46:13.325322: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-05-19 08:46:13.325506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-05-19 08:46:13.347415: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-19 08:46:13.386696: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-19 08:46:13.406021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-19 08:46:13.432913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-05-19 08:46:13.433305: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-19 08:46:13.433513: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.433738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.433826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1793] Adding visible gpu devices: 0
2021-05-19 08:46:13.434233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-19 08:46:13.434330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212]      0 
2021-05-19 08:46:13.434400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0:   N 
2021-05-19 08:46:13.434682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.434913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.435118: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1049] ARM64 does not support NUMA - returning NUMA node zero
2021-05-19 08:46:13.435317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3106 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2021-05-19 08:46:13.441723: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f40057b60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-05-19 08:46:13.441828: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Xavier, Compute Capability 7.2
I0519 06:46:15.267015 15820 model_repository_manager.cc:983] successfully loaded 'ssd_inception_v2_coco_2018_01_28' version 1
INFO: TrtISBackend id:5 initialized model: ssd_inception_v2_coco_2018_01_28
2021-05-19 08:46:28.736767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-19 08:47:07.804698: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
Frame Number=0 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=1 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=2 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=3 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=4 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=5 Number of Objects=5 Vehicle_count=2 Person_count=2
Frame Number=6 Number of Objects=5 Vehicle_count=2 Person_count=2

In the log it shows it the GPU is utilized, but it is not utilized when running the code.

Could you please help ?

Thanks in Advance

Thanks for your question.

We are reproducing this issue internally.
Will update more information with you later.

1 Like

Thanks awaiting for your reply !

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

This is the tegrastats from our testing:

...
... EMC_FREQ 23%@1600 GR3D_FREQ 4%@1109 
... EMC_FREQ 23%@1600 GR3D_FREQ 51%@1109 
... EMC_FREQ 23%@1600 GR3D_FREQ 90%@1109 
... EMC_FREQ 23%@1600 GR3D_FREQ 82%@1109 
... EMC_FREQ 23%@1600 GR3D_FREQ 52%@1109 
... EMC_FREQ 22%@1600 GR3D_FREQ 11%@1109 
... EMC_FREQ 22%@1600 GR3D_FREQ 1%@1109 
... EMC_FREQ 23%@1600 GR3D_FREQ 2%@1109 
...

The sample do use GPU but not always occupy all the resources
This reason might comes from data bandwidth or TensorFlow implementation.

Is this consistent to your observation?

Thanks.