Running Multiple DetectNets on Jetson TX1


I have trained a DetectNet model using DIGITS and am using the dusty-nv jetson-inference repo as the base for my code. I’m using a slightly modified version of the detectnet-console code which acts as an API to be called by a python script which sends images over an HTTP request to a specified port. I currently have a script which feeds 40 images to the DetectNet, times each call, and then takes the average time over the 40 calls and prints it. I am experimenting with running more than one instance of the DetectNet at the same time on different ports and then running multiple versions of the same script in parallel which send images to the different ports. Below are the results of my tests with 1, 2, and 3 running DetectNets.

I am using TensorRT 1.0, CUDA 8.0, and cuDNN 6.0

Testing Speed of DetectNet on Jetson TX1
1 DetectNet

  • Port 9081: 0.50647 sec (avg over 40 images) = 1.97 images per second
  • tegrastats: RAM 1738/3994MB (lfb 359x4MB) cpu [19%,5%,27%,5%]@204 GR3D 0%@76 EDP limit 0

2 DetectNets

  • Port 9081: 0.55517 sec, Port 9181: 0.54280 sec (avg over 40 images) = 3.64 images per second
  • tegrastats: RAM 2469/3994MB (lfb 165x4MB) cpu [25%,13%,21%,43%]@518 GR3D 99%@998 EDP limit 0

3 DetectNets

  • Port 9081: 0.75225 sec, Port 9181: 0.73613 sec, Port 9281: 0.69476 sec (avg over 40 images) = 4.12 images per second
  • tegrastats: RAM 3269/3994MB (lfb 67x4MB) cpu [30%,27%,35%,32%]@614 GR3D 99%@998 EDP limit 0

4 DetectNets — ‘Killed’ <-- I believe there isn’t enough memory to run 4 at once

So based on these results, it seems that I can run 3 DetectNets in parallel on 1 TX1 and while each individual inference will take a little longer, I will be able to process 3x as many images at the same time so overall I will process more images in the same amount of time.

I am wondering a few things:

  1. Is there a way to make one instance of the DetectNet use more of the GPU so I don’t need to run 2 or 3 in parallel?
  2. Do you have any recommendations for optimizations to decrease the inference time of the DetectNet, separate from having multiple run at once?
  3. Why do I get an inconsistent inference time each time I call the DetectNet? It ranges anywhere from 200ms to 700ms.



You can modify the execute() to enqueue(). And push images to the same queue for inferencing.
Another alternative is to handle multiple input in batch.

Please try our latest TensorRT 3.0 package for acceleration.

Please remember to lock GPU frequency to max.

sudo ./