What exactly TensorRT does when calling ExecutionContext::SetDeviceMemory()


I am wondering about what TensorRT does when calling ExecutionContext::SetDeviceMemory().

I found that the context belongs to DLA spent about 100 times more time than the context that belongs to GPU for “SetDeviceMemory()”.(link,)

According to the above link, the context belongs to DLA should take more time because its own memory has a much lower bandwidth speed.

I was mistaken, I thought that there are only some processes related to allocation, not copy during SetDeviceMemory().

However, the fact that DLA context is much slower than GPU context because of its low bandwidth represents that there might be some data copy during “SetDeviceMemory()”?

So, what I want to know is,

when calling setDeviceMemory(), DLA context copies some necessary data for inference (such as weight) from main memory to DLA own memory?

Also, I wonder if GPU context requires some data copy or it only requires some allocation during “SetDeviceMemory()”?

Thanks in advance.



Is this a duplicate issue of Why IExecutionContext::SetDeviceMemory() takes longer time when the context belongs to DLA?