I wonder why the API IExecutionContext::SetDeviceMemory() requires a longer time when the context belongs to DLA.
I measured the time for IExecutionContext::SetDeviceMemory() with 20 torchvision models.
For the context belongs to GPU engine, the average time for SetDeviceMemory() is ‘0.001’.
However, for the context belongs to DLA engine, the average time for SetDeviceMemory() is ‘0.262’.
Why the context with DLA engine requires much more time than the context with GPU engine, even if those engines comes from same torchvision model.
I compared the average of ‘GPU engine->getDeviceMemorySize()’ and ‘DLA engine->getDeviceMemorySize()’.
The former one is 75,046,067 and the latter one is 5,738,419. So, the required device memory size by DLA engine is much smaller. However, the average time for setting device memory takes longer.
I read the explanation about SetDeviceMemory on TensorRT Documentation, but I couldn’t find out any information about why the context with DLA engine requires much more time.
or Is there any additional process in SetDeviceMemory() when the context belongs to DLA engine?
Set the device memory for use by this execution context.
The memory must be aligned with cuda memory alignment property (using cudaGetDeviceProperties()), and its size must be at least that returned by getDeviceMemorySize(). Setting memory to nullptr is acceptable if getDeviceMemorySize() returns 0. If using enqueue() will result in undefined behavior.
Any help will be greatly appreciated.