Difference in preformance of running two NN with TRT in 2 threads vs 2 processes


I have two different neural networks that I run (inference) on the Jetson TX2 that I need them to run at the same time.

When running each of them in a dedicated processes (written in C++ using TRT 6 since it is JetPack 4.3, both running in FP16 mode) I managed to get consistent FPS in both processes (~10FPS in the heavy network and ~10FPS in the smaller network).

But when I take the same exact code and use two threads then the smaller network runs at inconsistent speed between 6-10FPS.

Both threads/processes create their own runtime (tried sharing it, didn’t change) and their own context so I would assume they would behave the same way, is there something shared when you run the code in the same processes?


You will need to use single process and multi-thread to allow GPU to run concurrently.
This is the constraint of Jetson’s GPU context.

So in the two processes scenario, the two processes shared the GPU resource in a time-slicing manner and the performance is stable.
But in two-threads implementation, GPU tries to fully occupy the resources so might cause the inconsistent speed you mentioned.

Is your inference job occupied ~99% GPU resources?
If yes, you can just use the two processes way since no much perf gain if running them concurrently.


I prefer to use one process with two threads due to the memory overhead of the CUDA/TensorRT libraries (the way their are built with the CUDA kernels not being part of the shared memory) so that is why I am trying to move from two processes to one process with two threads.

Few followup questions:

  1. Who does the timeslicing when running in multi-process mode? is that the driver?
  2. If I am understanding correctly what you are saying about two processes is that if I use one thread that gets jobs (inference) from say two threads but serialize the execution of the inference it will create the same timeslicing exprience from two processes?

Another option - if I use one stream in two threads will it do the same time-slicing as using two processes?


  1. GPU driver/scheduler.
  2. The two processes I mentioned indicates two different application.
    For example, running the same code on different consoles with different models.
  3. No. The tasks launched on the same stream will be executed sequentially.
    But you can submit the jobs in turn to get a similar behavior (if the inference time is similar and fast enough).


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.