Why IExecutionContext::SetDeviceMemory() takes longer time when the context belongs to DLA

Hi,

I wonder why the API IExecutionContext::SetDeviceMemory() requires a longer time when the context belongs to DLA.

I measured the time for IExecutionContext::SetDeviceMemory() with 20 torchvision models.

For the context belongs to GPU engine, the average time for SetDeviceMemory() is ‘0.001’.

However, for the context belongs to DLA engine, the average time for SetDeviceMemory() is ‘0.262’.


Why the context with DLA engine requires much more time than the context with GPU engine, even if those engines comes from same torchvision model.

I compared the average of ‘GPU engine->getDeviceMemorySize()’ and ‘DLA engine->getDeviceMemorySize()’.
The former one is 75,046,067 and the latter one is 5,738,419. So, the required device memory size by DLA engine is much smaller. However, the average time for setting device memory takes longer.

I read the explanation about SetDeviceMemory on TensorRT Documentation, but I couldn’t find out any information about why the context with DLA engine requires much more time.

or Is there any additional process in SetDeviceMemory() when the context belongs to DLA engine?

====

setDeviceMemory()
Set the device memory for use by this execution context.
The memory must be aligned with cuda memory alignment property (using cudaGetDeviceProperties()), and its size must be at least that returned by getDeviceMemorySize(). Setting memory to nullptr is acceptable if getDeviceMemorySize() returns 0. If using enqueue() will result in undefined behavior.

===

Any help will be greatly appreciated.

Thanks.

yjkim.

Hi,

DLA has it’s own memory. It may not always use the external DRAM like GPU.
So it’s possible to get the different bandwidth result between DLA and GPU.

Below is the DLA hardware document for your reference:
http://nvdla.org/hw/v1/hwarch.html

Thanks.

Hi, @AastaLLL,
I understand that those differences could come from DLA own memory.
Thanks for the reply.

However, I have a few more questions.

  1. It means there is a data move for model configurations(such as weight) during ‘SetDeviceMemory()’?

  2. The data for the model configurations moves like:

    1. Main Memory → External Memory for GPU --(if it is for DLA) → CVSRAM(DLA’s own memory).
    2. Main Memory --(if it is for GPU)–> External Memory for GPU
      Main Memory --(if it is for DLA)–> CVSRAM
      Which one is correct?
  3. Setting device memory for DLA engine takes about 100 times longer than for the GPU engine. Is it only caused by memory bandwidth? or Setting device memory for DLA engine requires extra process such as some kinds of conversion?

I thank you for taking the time to read this.

Regards,

yjkim

Hi,

Do you have a testing example to reproduce the long latency on SetDeviceMemory for DLA?
We want to double check this issue with our internal team first.

Thanks.

Hi, @AastaLLL

I uploaded the example codes to google drive because it contains too many engine files.
(Click here to download full example files)

If you want to see just code without engine files, you can see the following zip file.
codes.zip (203.6 KB)

You can run these sample codes as follows.

make
./test

Execution log:

### DLA ###
Load engine from :torchvision_engine/alexnet_dla0.engine
size: 4178432 / 0.00342932 2.528e-06 0.000198728 0.0983222 0.101953
Load engine from :torchvision_engine/googlenet_dla0.engine
size: 10036736 / 0.00287438 2.272e-06 0.000496179 0.111201 0.114574
Load engine from :torchvision_engine/mnasnet0_5_dla0.engine
size: 4578816 / 0.00257229 2.304e-06 0.00025329 0.22472 0.227547
Load engine from :torchvision_engine/mnasnet1_0_dla0.engine
size: 5916160 / 0.00287429 2.112e-06 0.000209288 0.233002 0.236088
Load engine from :torchvision_engine/mobilenet_v2_dla0.engine
size: 4039680 / 0.00310303 1.984e-06 0.000173127 0.153894 0.157172
Load engine from :torchvision_engine/resnet18_dla0.engine
size: 4035584 / 0.00251757 1.889e-06 0.000186343 0.169834 0.17254
Load engine from :torchvision_engine/resnet34_dla0.engine
size: 4035584 / 0.00276174 2.24e-06 0.000166727 0.315595 0.318526
Load engine from :torchvision_engine/resnet50_dla0.engine
size: 4055552 / 0.00271067 2.592e-06 0.00019028 0.296937 0.29984
Load engine from :torchvision_engine/resnet101_dla0.engine
size: 4055552 / 0.00251837 2.112e-06 0.000185767 0.575983 0.578689
Load engine from :torchvision_engine/resnet152_dla0.engine
size: 4056064 / 0.0030605 3.104e-06 0.000197576 0.805471 0.808733
Load engine from :torchvision_engine/resnext50_32x4d_dla0.engine
size: 5612032 / 0.00318316 2.08e-06 0.00020788 0.531934 0.535327
Load engine from :torchvision_engine/resnext101_32x8d_dla0.engine
size: 152822272 / 0.002608 1.92e-06 0.00266353 0.505333 0.510607
Load engine from :torchvision_engine/shufflenet_v2_x0_5_dla0.engine
size: 6438400 / 0.00256576 2.113e-06 0.000255049 0.155892 0.158715
Load engine from :torchvision_engine/shufflenet_v2_x1_0_dla0.engine
size: 10034688 / 0.00262532 2.08e-06 0.000277291 0.16341 0.166314
Load engine from :torchvision_engine/squeezenet1_0_dla0.engine
size: 4035584 / 0.00274356 2.24e-06 0.000175463 0.0473381 0.0502594
Load engine from :torchvision_engine/squeezenet1_1_dla0.engine
size: 4035584 / 0.00232908 1.824e-06 0.000129093 0.0449632 0.0474232
Load engine from :torchvision_engine/vgg11_dla0.engine
size: 4178432 / 0.00271143 2.272e-06 0.00021812 0.202659 0.205591
Load engine from :torchvision_engine/vgg13_dla0.engine
size: 4178432 / 0.00272993 2.08e-06 0.000182055 0.214595 0.217509
Load engine from :torchvision_engine/vgg16_dla0.engine
size: 4178432 / 0.00249866 2.273e-06 0.000203271 0.293288 0.295992
Load engine from :torchvision_engine/vgg19_dla0.engine
size: 4178432 / 0.00272548 2.304e-06 0.000174919 0.359905 0.362808

### GPU ###
Load engine from :torchvision_engine/alexnet_gpu.engine
size: 12888064 / 0.00274228 2.368e-06 0.000357294 0.000294508 0.00339645
Load engine from :torchvision_engine/googlenet_gpu.engine
size: 37899776 / 0.00284299 2.112e-06 0.000756574 0.00133215 0.00493382
Load engine from :torchvision_engine/mnasnet0_5_gpu.engine
size: 24085504 / 0.00286177 1.984e-06 0.000599128 0.00115835 0.00462124
Load engine from :torchvision_engine/mnasnet1_0_gpu.engine
size: 29230080 / 0.00255075 1.632e-06 0.000608312 0.00109143 0.00425213
Load engine from :torchvision_engine/mobilenet_v2_gpu.engine
size: 40933376 / 0.0029221 2.208e-06 0.000997927 0.00137676 0.00529899
Load engine from :torchvision_engine/resnet18_gpu.engine
size: 46546944 / 0.00302767 2.368e-06 0.000894178 0.000870146 0.00479436
Load engine from :torchvision_engine/resnet34_gpu.engine
size: 68833792 / 0.00285269 2.112e-06 0.00139445 0.00135682 0.00560607
Load engine from :torchvision_engine/resnet50_gpu.engine
size: 89426944 / 0.00254375 1.76e-06 0.0018041 0.00180305 0.00615265
Load engine from :torchvision_engine/resnet101_gpu.engine
size: 129444352 / 0.00288299 2.432e-06 0.00253965 0.00343805 0.00886313
Load engine from :torchvision_engine/resnet152_gpu.engine
size: 158797824 / 0.0029461 2.464e-06 0.00306329 0.00493174 0.0109436
Load engine from :torchvision_engine/resnext50_32x4d_gpu.engine
size: 96128000 / 0.00283416 2.656e-06 0.00181514 0.00343389 0.00808585
Load engine from :torchvision_engine/resnext101_32x8d_gpu.engine
size: 257940992 / 0.00269537 2.016e-06 0.0046533 0.00555452 0.0129052
Load engine from :torchvision_engine/shufflenet_v2_x0_5_gpu.engine
size: 11915776 / 0.00248688 2.4e-06 0.000299372 0.00115518 0.00394383
Load engine from :torchvision_engine/shufflenet_v2_x1_0_gpu.engine
size: 14624256 / 0.00278094 2.304e-06 0.000453457 0.00114622 0.00438292
Load engine from :torchvision_engine/squeezenet1_0_gpu.engine
size: 31676416 / 0.00263575 1.984e-06 0.000681147 0.000781214 0.0041001
Load engine from :torchvision_engine/squeezenet1_1_gpu.engine
size: 22952960 / 0.00251571 2.336e-06 0.000514516 0.000796383 0.00382895
Load engine from :torchvision_engine/vgg11_gpu.engine
size: 115208704 / 0.00253082 2.145e-06 0.00223356 0.000380527 0.00514705
Load engine from :torchvision_engine/vgg13_gpu.engine
size: 156099584 / 0.00236672 2.112e-06 0.00313497 0.000440561 0.00594436
Load engine from :torchvision_engine/vgg16_gpu.engine
size: 166744576 / 0.00277963 1.92e-06 0.00316089 0.000563638 0.00650608
Load engine from :torchvision_engine/vgg19_gpu.engine
size: 169361920 / 0.00379096 2.464e-06 0.0298464 0.000702012 0.0343419

If you find out anything about this, please let me know.
I look forward to hearing from you.

Regards.

yjkim.

Hi,

Thanks for your source.
Confirmed that we can also reproduce this issue in our environment.

$ ./test
### DLA ###
Load engine from :googlenet.dla
size: 403456 / 0.0057505 2.624e-06 5.6672e-05 0.176056 0.181865

### GPU ###
Load engine from :googlenet.gpu
size: 44965888 / 0.0154719 2.4e-06 0.0118249 0.003223 0.0305222

We are checking this issue with our internal team.
Will let you know for any progress.

Thanks.

1 Like

Thanks for the notification.

I look forward to hearing from you.

Regards.

yjkim.

Hi,

We got some update from our internal team.

The underlying mechanism between DLA and GPU is different on TensorRT 7.1.
For DLA, SetDeviceMemory will trigger some evaluation and requires more execution time.

In next TensorRT release, the behavior for DLA and GPU will be similar.
So you won’t find the long latency when calling SetDeviceMemory on DLA based engine.

Thanks.

1 Like