Nsight Eclipse Extreme Performance Drop When Compiling Using Release Build Config

Good evening allerseits,

I’m new to cuda and currently developing an C/C++ cuda application using Eclipse Nsight. The software has a single, very simple, kernel and when I do the release build, strangely, the performance of the kernel drops drastically (by orders of magnitude) compared to the debug build.
From my online research I only found out that people experienced performance problems with the Debug build (which makes some sense to me), however for me the opposite is the case.

Some parameters:

  • Device: RTX 2060
  • Eclipse Nsight version: 10.2
  • Cuda Toolchain: 10.2
  • Builder: 10.1
  • Linker options: compute_62, compute_75, sm_62, sm_75 (currently runs on the mentioned 7.5 device, but is supposed to run remotely on a TX2 later)
  • Compiler options: -O3 (Release), -O0 (Debug)
    ( All the settings are basically as preset by the IDE, except for my additional includes )

Nsight Compute profiling results - Release build in relation to Debug build:

  • SOL values: -98%
  • Duration/Elapsed Cycles/SM active cycles: +730%
  • compute workload: -99%
  • Memory workload: -98%
  • Active warps: nearly same, >7.5
  • Eligible warps: -99%
  • warp cycles per instruction: +10600%
  • executed/issued instructions: -92%
  • Stall long scoreboard: +23200%
  • Stall wait: -85%
    In debug build, both stall values are even at around 13 and main contributors ( I guess these are, because the kernel mainly moves data from one ram location to another )

About the Kernel (just in case its due to the simplicity. The actual implementation doesn’t seem to have an impact on this problem, though):
Call configuration: <<<16, 16>>>
What it does:

  1. Given two 2D lookup tables A1, A2 (global memory, constant) and two 2D textures T1, T2 (pinned memory, non-constant).
  2. Get value a1 from A1 according to the “location” of the thread.
  3. Get value a2 from A2 according to a1.
  4. Fetch both textures at location a2.
  5. Write result of fetch to the output according to threads location.
    Basically all it does is a simple texture fetch according to a table lookup.
    The table is split into A1 and A2, because it would be sparse and extremely large otherwise.
    A2 is indexed by A1 in-order and without pitch, except for some cases ( < 10% ), where one thread processes more than one contiguous values from A2.

I hope someone can help me with this, as I can’t figure out what has gone wrong here.
Thank you very much in advance!

Solved.
The reason was something entirely different. Writing an immediate value into one element of the uchar4 variable that is written to pinned memory caused the problem. Apparently the compiler generates inefficient code, which causes particularly large performance drop when writing to pinned memory.
Workaround is using constant memory variables or kernel arguments instead of immediates.
I’ve opened a new thread asking about this: