Multiple models on DLAs in AGX Xavier 32TOPs

Can the DLAs on the AGX Xavier run multiple models (networks) at the same time? Or are the DLAs bound to a specific network and impractical to swap in at a fast enough rate? Given there are 2 DLA cores, how will it work as the number of models in use on the Xavier increases?

Hi,

It’s recommended to deploy the same model on the same DLA.

But there are two DLAs on the Xavier, you can deploy two models to the different DLA.
You can use setDLACore(…) to specify which DLA core to use.
[url]TensorRT: nvinfer1::IRuntime Class Reference

Thanks

Thank you, but this does not quite answer the question. Assume I want to be performing inferences with 4 models at the same time. Can the DLAs be used for all 4 models?
For example:
model1.setDLACore(0)
model2.setDLACore(1)
model3.setDLACore(0)
model4.setDLACore(1)
Then perform inference using all 4 models.

Hi,

This depends on the use case.
There is some overhead when deploying a model.
But to give a further suggestion, we need to know the exact execution time of each model first.

By the way, how about the GPU?
If the GPU is not occupied, it’s recommended to use the following setting for minimal latency.

model1: DLA 0
model2: DLA 1
model3: GPU 0
model4: GPU 0

Also, you can choose which model to put into the DLA based on the supported layer ratio.
Thanks.

@AastaLLL ,
I have one model (peoplenet of tlt model), and I want to run same model on GPU and two DLAs,
For this I need to load three times of model?
Is it possible to run two times of model, one for GPU and one for two DLAs as share model? Because when I load three times of model for each ones, The memory of jetson xavier nx is occupied, but when I load two times of model then the memory is sufficient.

Hi,
You need to create and load an engine per DLA/GPU. So in your case you should create 3 engines, load the plan files to each one of them and execute the 3 engines.
You also need to build and run inference for each of the hardware components on a different CPU threads and CUDA streams (as far as I could tell).

thanks
Eyal

Thanks, @eyalhir74 ,
Did you test with deepstream-python-apps? if yes, I have to run three python app with three different config in separate terminals?

Hi,
Sorry, no. I’ve only done C++ and C++ TensorRT API.

thanks
Eyal

@eyalhir74 ,
Is it possible to share your codes or reference GitHub?

Sorry, its propriety.
However you can look at the code in trtexec under /usr/src/tensorrt/samples/trtexec
And basically create CPU thread + TRT objects (builder/context/engine/runtime) + CUDA stream per Hardware component you want (GPU, DLA 0, DLA 1) and you’re set.

thanks
Eyal

@eyalhir74,
Did you run TLT models with your suggest solution?

CPU thread + TRT objects (builder/context/engine/runtime) + CUDA stream per Hardware component.

Both a propriety version of resnet and a few other public networks, just to test the solution.
However it doesn’t matter. Any model that you can build for the DLA/GPU and use with trtexec, you can run in the manner I’ve described (I guess unless there are memory issues etc…, which I didn’t see)

thanks
Eyal

i’m on the Xavier NX

how about python? e.g. if i want to run deepstream_test_1_usb.py in a specific DLA or GPU? thanks