Yes, thanks,use “nvcc -G” :Kernel execution time: 35.28 ms. It’s so slower than RTX1660: 0.2ms, What directions should I investigate, and why is it so slow?
Your Orin, 3.87ms vs RTX1660, 0.2ms, are two quite different GPU’s resource-wise.
I don’t think I can give a reasonable assessment as to whether the Orin is underperforming, relatively speaking. It has 8 SMs vs 22, clock speeds GPU/Memory 625/1067MHz vs 1530 to 1785/2001MHz and memory bandwidth of 68.29GB/s vs 192.1GB/s.
Nsight Compute comparisons between the two should help in locating any bottleneck between them.
Edit: I notice you’re optimising at -O2. You may benefit from -O3.