Performance improvement on Jetson Nano

Hi,
I have a mask detection deep learning model that i need to run on Jetson Nano. I created a django application and i am able to run it successfully. The response time however is too much. It takes around 40-45 secs for my model to get predictions and get response.

When i enter the command python manage.py runserver, it first takes around 25-30 secs just to get the application hosted. When called via postman, it gives the below message on console before showing output.

2020-09-15 04:53:44.411896: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-15 04:53:44.428168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.428311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.86GiB deviceMemoryBandwidth: 194.55MiB/s
2020-09-15 04:53:44.428431: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-09-15 04:53:44.507829: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-15 04:53:44.564077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-15 04:53:44.723041: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-15 04:53:44.783545: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-15 04:53:44.847698: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-15 04:53:44.852282: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-09-15 04:53:44.853122: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.854005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.854210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-09-15 04:53:44.887233: W tensorflow/core/platform/profile_utils/cpu_utils.cc:106] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
2020-09-15 04:53:44.887838: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f6cab6850 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-15 04:53:44.887887: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-15 04:53:44.962824: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.963116: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f6d13da30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-15 04:53:44.963160: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2020-09-15 04:53:44.963656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.963775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.86GiB deviceMemoryBandwidth: 194.55MiB/s
2020-09-15 04:53:44.963959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-09-15 04:53:44.964064: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-15 04:53:44.964148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-15 04:53:44.964222: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-15 04:53:44.964296: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-15 04:53:44.964367: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-15 04:53:44.964438: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-09-15 04:53:44.964688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.964962: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:44.965033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-09-15 04:53:44.965145: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-09-15 04:53:47.513224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-15 04:53:47.513326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2020-09-15 04:53:47.513351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2020-09-15 04:53:47.513897: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:47.514272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-09-15 04:53:47.514447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 799 MB memory) → physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2020-09-15 04:54:08.433900: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-15 04:54:09.585552: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-09-15 04:54:48.340647: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 918.27MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

[15/Sep/2020 04:54:50] “POST /PredictMask/ HTTP/1.1” 200 98

Can you help me out here by optimizing the performance of this result?

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

@AastaLLL Okay. Will try that and get back.

@AastaLLL Thank you for your suggestion. I used those commands for maximizing the device performance first. But didn’t see any difference in the response time. I tested the response multiple times and got nearly same results.

There are 5 major things my API does

  1. Writing file on disk.
  2. Loading model and image from disk.
  3. File Pre-processing
  4. Predictions on image
  5. Deleting the file written locally.

I measured time for each of these steps i perform in my program and what i noticed is that it takes around 20 secs to load model and image from disk, and around 6 secs to make predictions on it. These two steps consume a lot of time. Rest of the operations are performed within a second. It takes this much time every-time the API is called. If i could reduce the time taken to load the model and image, that can reduce my response time as well. I am open to try any suggestions regarding this.

Thanks

Hi,

Which model do you use? Is it Mask RCNN?

We have a similar but TLT-based Mask-RCNN model.
It can achieve around 1 fps on Jetson Nano based in this doc:
https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-the-transfer-learning-toolkit/

So it seems that there are some room for your app to optimize.
It’s recommended to convert the model into TensorRT engine first.
You can find the detail step in this sample:
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleUffMaskRCNN

Thanks.

@AastaLLL I have used CNN model . Will convert the model using tensorRT. Thank you for the help.