Nsight Eclipse Extreme Performance Drop When Compiling Using Release Build Config

tunxten · March 14, 2021, 9:58pm

Good evening allerseits,

I’m new to cuda and currently developing an C/C++ cuda application using Eclipse Nsight. The software has a single, very simple, kernel and when I do the release build, strangely, the performance of the kernel drops drastically (by orders of magnitude) compared to the debug build.
From my online research I only found out that people experienced performance problems with the Debug build (which makes some sense to me), however for me the opposite is the case.

Some parameters:

Device: RTX 2060
Eclipse Nsight version: 10.2
Cuda Toolchain: 10.2
Builder: 10.1
Linker options: compute_62, compute_75, sm_62, sm_75 (currently runs on the mentioned 7.5 device, but is supposed to run remotely on a TX2 later)
Compiler options: -O3 (Release), -O0 (Debug)
( All the settings are basically as preset by the IDE, except for my additional includes )

Nsight Compute profiling results - Release build in relation to Debug build:

SOL values: -98%
Duration/Elapsed Cycles/SM active cycles: +730%
compute workload: -99%
Memory workload: -98%
Active warps: nearly same, >7.5
Eligible warps: -99%
warp cycles per instruction: +10600%
executed/issued instructions: -92%
Stall long scoreboard: +23200%
Stall wait: -85%
In debug build, both stall values are even at around 13 and main contributors ( I guess these are, because the kernel mainly moves data from one ram location to another )

About the Kernel (just in case its due to the simplicity. The actual implementation doesn’t seem to have an impact on this problem, though):
Call configuration: <<<16, 16>>>
What it does:

Given two 2D lookup tables A1, A2 (global memory, constant) and two 2D textures T1, T2 (pinned memory, non-constant).
Get value a1 from A1 according to the “location” of the thread.
Get value a2 from A2 according to a1.
Fetch both textures at location a2.
Write result of fetch to the output according to threads location.
Basically all it does is a simple texture fetch according to a table lookup.
The table is split into A1 and A2, because it would be sparse and extremely large otherwise.
A2 is indexed by A1 in-order and without pitch, except for some cases ( < 10% ), where one thread processes more than one contiguous values from A2.

I hope someone can help me with this, as I can’t figure out what has gone wrong here.
Thank you very much in advance!

tunxten · March 31, 2021, 8:24pm

Solved.
The reason was something entirely different. Writing an immediate value into one element of the uchar4 variable that is written to pinned memory caused the problem. Apparently the compiler generates inefficient code, which causes particularly large performance drop when writing to pinned memory.
Workaround is using constant memory variables or kernel arguments instead of immediates.
I’ve opened a new thread asking about this:

Topic		Replies	Views
[Solved] Performance drop when writing immediates to pinned memory CUDA Programming and Performance cuda , kernel , ubuntu	14	1138	April 1, 2021
Slow compiling in release mode? Nsight Eclipse Edition	7	3048	April 29, 2013
CUDA 5 Invalid path warning CUDA Programming and Performance	14	3584	June 7, 2012
Wish List for next OpenCL release CUDA Programming and Performance	9	17433	September 9, 2009
Problem:Different results on EMULATION and RELEASE Problem on release and debug mode CUDA Programming and Performance	4	1856	June 5, 2008
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	337	April 23, 2024
CUDA Samples: lower FPS when compiling with Nsight Eclipse CUDA Programming and Performance	4	1132	April 19, 2015
Crash when profiling with "Kernel Launches and Memory Operations" Nsight Visual Studio Edition	7	3624	February 5, 2015
Nsight 3.0 much slower than previous version Nsight Visual Studio Edition	18	2965	June 15, 2013
Nsight 3.0 crash Nsight Visual Studio Edition	2	1124	June 16, 2013

Nsight Eclipse Extreme Performance Drop When Compiling Using Release Build Config

Related topics