Model timing impacted when used Both DLA & GPU simultaneously

Hi,
I have created 3 TensorRT model on Xavier NX, let’s call them M1,M2,M3.
M1 is generated for DLA 0 using --useDLACore=0
M2 is generated for DLA 1 using – =1
M3 is generated for GPU.

When I am running only M1 inference it took an average of 56 ms per image.
When I am running only M2 inference it took an average of 56 ms per image.
When I am running only M3 inference it took an average of 31 ms per image.

But When I am running M1 & M2 simultaneously, there was little impact on timing and they took 60 ms for each inference.
When I am running M1 & M3 simultaneously, there was also an impact on timing and they took 62 ms & 49 ms respectively for each inference.
When I am running M2 & M3 simultaneously, there was also an impact on timing and they took 62 ms & 49 ms respectively for each inference.

But the main thing was when I am running M1, M2, and M3 all simultaneously, there was a huge impact on M3 inference time, i.e, M1 took 66 ms, M2 took 66 ms & majorly impacted M3 took 77 to 80 ms.

As I am running M1 & M2 on DLA, if I run them simultaneously they both can have an impact, but when running all 3 models simultaneously M3 which is running altogether on the different device had too much impact.

Can you explain this behavior? I doing anything wrong? if so can you please correct me? It’s urgent for me to know this thing.

Adding more information:
We are seeing low gpu occupancy when DLA is running along with GPU inference. For same model GPU occupancy is high when DLA is not running.

Adding more information:
Looks like GPU starvation is happening when 2 DLA is running.

Hi,

Please increase below environment variable to see if it helps.

$ CUDA_DEVICE_MAX_CONNECTIONS=32

Thanks.

1 Like

Thanks AastaLLL
We are seeing improvement. We will do detail profiling and update you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.