Good evening allerseits,
I’m new to cuda and currently developing an C/C++ cuda application using Eclipse Nsight. The software has a single, very simple, kernel and when I do the release build, strangely, the performance of the kernel drops drastically (by orders of magnitude) compared to the debug build.
From my online research I only found out that people experienced performance problems with the Debug build (which makes some sense to me), however for me the opposite is the case.
Some parameters:
- Device: RTX 2060
- Eclipse Nsight version: 10.2
- Cuda Toolchain: 10.2
- Builder: 10.1
- Linker options: compute_62, compute_75, sm_62, sm_75 (currently runs on the mentioned 7.5 device, but is supposed to run remotely on a TX2 later)
- Compiler options: -O3 (Release), -O0 (Debug)
( All the settings are basically as preset by the IDE, except for my additional includes )
Nsight Compute profiling results - Release build in relation to Debug build:
- SOL values: -98%
- Duration/Elapsed Cycles/SM active cycles: +730%
- compute workload: -99%
- Memory workload: -98%
- Active warps: nearly same, >7.5
- Eligible warps: -99%
- warp cycles per instruction: +10600%
- executed/issued instructions: -92%
- Stall long scoreboard: +23200%
- Stall wait: -85%
In debug build, both stall values are even at around 13 and main contributors ( I guess these are, because the kernel mainly moves data from one ram location to another )
About the Kernel (just in case its due to the simplicity. The actual implementation doesn’t seem to have an impact on this problem, though):
Call configuration: <<<16, 16>>>
What it does:
- Given two 2D lookup tables A1, A2 (global memory, constant) and two 2D textures T1, T2 (pinned memory, non-constant).
- Get value a1 from A1 according to the “location” of the thread.
- Get value a2 from A2 according to a1.
- Fetch both textures at location a2.
- Write result of fetch to the output according to threads location.
Basically all it does is a simple texture fetch according to a table lookup.
The table is split into A1 and A2, because it would be sparse and extremely large otherwise.
A2 is indexed by A1 in-order and without pitch, except for some cases ( < 10% ), where one thread processes more than one contiguous values from A2.
I hope someone can help me with this, as I can’t figure out what has gone wrong here.
Thank you very much in advance!