Why my Jetson orin nano program is very slow, and it no diffrent bewteen Debug and Release mode

Release: nvcc -O2 -DNDEBUG -arch=sm_87 -o subpixel_RK3566_release dmmalvar5x5.cu main.cu -I/usr/include/opencv4 -L/usr/lib -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_imgcodecs -lopencv_videoio -L/usr/local/cuda/lib64 -lnppc -lnppial -lcuda -lcudart -lnppif -I/home/dmg_wym/code/subpixel_s2/include
Debug: nvcc -g -Ddebug -o subpixel_RK3566_debug dmmalvar5x5.cu main.cu -I/usr/include/opencv4 -L/usr/lib -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_imgcodecs -lopencv_videoio -L/usr/local/cuda/lib64 -lnppc -lnppial -lcuda -lcudart -lnppif -I/home/dmg_wym/code/subpixel_s2/include
./subpixel_RK3566_release

Kernel execution time: 3.87389 ms
./subpixel_RK3566_debug

Kernel execution time: 3.89082 ms

Possibly because, “nvcc -g” is “Generate debug information for host code.” and , “nvcc -G”, is “Generate debug information for device code.”

Yes, thanks,use “nvcc -G” :Kernel execution time: 35.28 ms. It’s so slower than RTX1660: 0.2ms, What directions should I investigate, and why is it so slow?

Your Orin, 3.87ms vs RTX1660, 0.2ms, are two quite different GPU’s resource-wise.

I don’t think I can give a reasonable assessment as to whether the Orin is underperforming, relatively speaking. It has 8 SMs vs 22, clock speeds GPU/Memory 625/1067MHz vs 1530 to 1785/2001MHz and memory bandwidth of 68.29GB/s vs 192.1GB/s.

Nsight Compute comparisons between the two should help in locating any bottleneck between them.

Edit: I notice you’re optimising at -O2. You may benefit from -O3.

35.28 ms is with -G; is 0.2ms with or without -G?
3.9 ms is without -G.

It only makes sense to compare release builds with each other.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.