i’m processing some data on Jetson AGX Xavier using a CUDA-function.
According to my measurements it takes too much time to call the function.
duration of call: 1-32 ms (mean value: 11ms)
duration of execution after the call is less than 1ms.
how do i measure:
i check current steady clock using c++ chrono library.
- before the call
- after the call
- after “cudaDeviceSynchronize”
input data: 9MB
output data: 4MB
number of threads: 800k
equal function executed on PC takes 120 microseconds.
Variation of input data size leads to linear dependent variation of call-duration (may be some internal copy is executed?).
Increasing of complexity of the CUDA function increases the execution duration, but the call-duration stays equal.
Do you have any suggestions why the call can take so much time and how to solve this issue?