We are running a few AI algorithms on a single Nvidia GPU (PC with GeForce or Nvidia Jetson AGX) and we are interested to improve the run time performances.
We have a few questions regarding this issue.
Q1. Does the GPU hold all the AI programs memory in the GPU memory simultaneously all the time or does it reload it whenever we move from performing one AI algorithm to another?
Q2. How can we tell what takes more time, running the AI algorithm or transaction of data memory to/from GPU memory?
Q3. Are there known strategies to improve run time performances when running a few AI algorithms based on the TensorRT together?
1.
This depends on the frameworks you use.
For TensorFlow, it by default occupies all the available memory.
But for TensorRT, you can control the usage by parameter.
2.
For TensorRT, there is a parameter called workspace to specify the maximal allowed memory.
3.
For Xavier, you can try INT8 mode and DLA to leverage Tensor cores and inference hardware.
On Jetson, if you use CUDA mapped memory or CUDA managed memory, you don’t need to perform CPU<->GPU transfers because they share the same physical memory on Jetson. Here is a simple wrapper function that allocates CUDA mapped memory (aka zero-copy):