Why IExecutionContext::SetDeviceMemory() takes longer time when the context belongs to DLA

Hi,

I wonder why the API IExecutionContext::SetDeviceMemory() requires a longer time when the context belongs to DLA.

I measured the time for IExecutionContext::SetDeviceMemory() with 20 torchvision models.

For the context belongs to GPU engine, the average time for SetDeviceMemory() is ‘0.001’.

However, for the context belongs to DLA engine, the average time for SetDeviceMemory() is ‘0.262’.


Why the context with DLA engine requires much more time than the context with GPU engine, even if those engines comes from same torchvision model.

I compared the average of ‘GPU engine->getDeviceMemorySize()’ and ‘DLA engine->getDeviceMemorySize()’.
The former one is 75,046,067 and the latter one is 5,738,419. So, the required device memory size by DLA engine is much smaller. However, the average time for setting device memory takes longer.

I read the explanation about SetDeviceMemory on TensorRT Documentation, but I couldn’t find out any information about why the context with DLA engine requires much more time.

or Is there any additional process in SetDeviceMemory() when the context belongs to DLA engine?

====

setDeviceMemory()
Set the device memory for use by this execution context.
The memory must be aligned with cuda memory alignment property (using cudaGetDeviceProperties()), and its size must be at least that returned by getDeviceMemorySize(). Setting memory to nullptr is acceptable if getDeviceMemorySize() returns 0. If using enqueue() will result in undefined behavior.

===

Any help will be greatly appreciated.

Thanks.

yjkim.

Hi,

DLA has it’s own memory. It may not always use the external DRAM like GPU.
So it’s possible to get the different bandwidth result between DLA and GPU.

Below is the DLA hardware document for your reference:
http://nvdla.org/hw/v1/hwarch.html

Thanks.

Hi, @AastaLLL,
I understand that those differences could come from DLA own memory.
Thanks for the reply.

However, I have a few more questions.

  1. It means there is a data move for model configurations(such as weight) during ‘SetDeviceMemory()’?

  2. The data for the model configurations moves like:

    1. Main Memory → External Memory for GPU --(if it is for DLA) → CVSRAM(DLA’s own memory).
    2. Main Memory --(if it is for GPU)–> External Memory for GPU
      Main Memory --(if it is for DLA)–> CVSRAM
      Which one is correct?
  3. Setting device memory for DLA engine takes about 100 times longer than for the GPU engine. Is it only caused by memory bandwidth? or Setting device memory for DLA engine requires extra process such as some kinds of conversion?

I thank you for taking the time to read this.

Regards,

yjkim