Hi,
Thanks for your patience.
Please noted that GPU execution time can depend on lots of things, ex. the current workload of the GPU, resource availability for the work, etc.
So we don’t guarantee a really tight bound on the GPU execution time of any kernel once it reaches the GPU.
But below are some experiments we have tried: change.patch (3.4 KB)
1. Do some warm-up loops of the kernel.
2. Measure the GPU execution time via CUDA events.
With the above changes, we can see a very similar performance (even slightly better) on JetPack4.6.
Would you mind also checking this on your side?
JetPack 4.3
Total elapsed time = 18983.760ms, average time = 1.898ms
...
Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name
------- --------------- --------- ---------- -------- -------- -------- ---------------------------------------------------------------------
100.0 191,7907,0656 1,0100 189,8917.9 182,2208 436,2752 8,9854.0 ComplexMult(complex_float*, complex_float*, complex_float*, int, int)
JetPack-4.6:
Total elapsed time = 18558.316ms, average time = 1.856ms
...
Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name
------- --------------- --------- ----------- --------- --------- -------- ---------------------------------------------------------------------
100.0 18,738,775,456 10,100 1,855,324.3 1,806,496 5,056,928 84,044.6 ComplexMult(complex_float*, complex_float*, complex_float*, int, int)
Thanks.