Xavier: call of Cuda function is too slow


i’m processing some data on Jetson AGX Xavier using a CUDA-function.
According to my measurements it takes too much time to call the function.

duration of call: 1-32 ms (mean value: 11ms)
duration of execution after the call is less than 1ms.

how do i measure:
i check current steady clock using c++ chrono library.

  1. before the call
  2. after the call
  3. after “cudaDeviceSynchronize”

input data: 9MB
output data: 4MB
number of threads: 800k
equal function executed on PC takes 120 microseconds.

Variation of input data size leads to linear dependent variation of call-duration (may be some internal copy is executed?).
Increasing of complexity of the CUDA function increases the execution duration, but the call-duration stays equal.

Do you have any suggestions why the call can take so much time and how to solve this issue?

Hello ,

Can you post some example code so we can take a look at it ?