Questions about implementation of trtexec, especially when using both DLA and GPU

Hi, I have some questions about trtexec.

  1. I’m confused with the option named ‘useSpinWait’.

It is written here that if I use cudaEventBlockingSync, CPU thread will busy-wait, that is, spin-wait.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g949aa42b30ae9e622f6ba0787129ff22

But in trtexec.cpp, it is written like this :
unsigned int cudaEventFlags = gParams.useSpinWait ? cudaEventDefault : cudaEventBlockingSync;

So, we have to use ‘useSpinWait’ option to avoid busy-waiting.
Could you explain me about this…?

  1. Why do I have to use “cudaEventDefault” to achieve expected sum of performance?
    (It is written here : https://docs.nvidia.com/jetson/jetpack/release-notes/index.html#early-access-notes)

Could you please explain why using both DLA and GPU makes difference if I use cudaEventBlockingSync…?
And how cudaEventDefault resolves this problem??

Thank you.

Hi,

1. useSpinWait default value is off. Turn-on when you need it.

2. –useSpinWait should be setting CU_CTX_SCHED_SPIN flag for the CUDA context which instructs CPU to actively spin for GPU synchronization and so reduces the latency. However, using spin impacts other CPU threads running in parallel.

Thanks.

  1. I completely misread the documentation. Thanks.

  2. So, cudaEventDefault should set CU_CTX_SCHED_SPIN, and cudaEventBlockingSync should set CU_CTX_BLOCKING_SYNC, am I right??
    But why is it important when using both DLA and GPU?? I thought these were totally different devices, but it seems that they have to wait each other.

Hi,

DLA is hardware-based deep learning engine so the support scope is limited.
TensorRT will fallback the non-supported layer into GPU which make synchronization become important.

Thanks.