Strange Performance issue

i,

I have a kernel that I want to optimize. One thing I want to do is to remove unrequired initialisation of data by and simple if/else. Here is what I did - just parts of it, but maybe someone already sees what the problem is.

This is the original codesection (Version 1):

const uint32_t l_ui32Index = p_nSelector % p_nIndex;
	switch( l_ui32Index )
	{
	case 0:
	case 1:
	case 2:
	case 3:
		return p_ui64Header[l_ui32Index];

	default:
		return p_ui64Header2[l_ui32Index - 4];
	}

and I replaced it by this code (Version 2):

const uint32_t l_ui32Index = p_nSelector % p_nIndex;
	switch( l_ui32Index )
	{
	case 0:
	case 1:
	case 2:
	case 3:
		return p_ui64Header[l_ui32Index];

	default:
		return NULL == p_ui64Header2 ? 0 : p_ui64Header2[l_ui32Index - 4];
	}

You see the only change is the ‘default’ case of the switch. ALL threads have the SAME data - so all will go through the same case/default path.

The first version shows this register information - Version 1:

ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 18432 bytes smem, 376 bytes cmem[0]

This is Version 2:
Is uses about 30% less registers, 96 instead of 128

ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 96 registers, 18432 bytes smem, 376 bytes cmem[0]

Version 2 performs 20% worse than Version 1! So I do not understand how this simple if/else (this is the ONLY modification) can save 30% registers and shows a performance that is more than 20% worse? I was under the impression that saving registers may have positive impact on the performance.

Please let me know if this is a typical newbie mistake?

Cuda 10.2, windows 10, 1070/2060

Thanks.

make sure you are compiling a release project, not a debug project

Hi,

I compile ‘release’ only. My application ‘creates’ the kernel at runtime and compiles it. Both versions are compiles identical and the only difference is the if/else in the default path of the switch - nothing else. Same commandline to execute the Just in time compiler ‘nvrtcCreateProgram’ and the other required function.

I use: -arch=compute_61 for the 1070, -arch=compute_75 for the 2060
-std=c++11
-restrict

I would like to see some more ‘diagnostic’ and ‘informational’ output from the compiler that gives me information which function could need modification to support better compiler optimization or gives me information if a device function causes the use of the stack. Things like that.

I’d like to see that too. You can file a bug request (enhancement request) using the instructions linked to a sticky post at the top of this sub-forum. In the meantime, there are various GPU performance analysis tools available (profilers).

Will do. Is there a way to profile the application generated kernel file? I do not have the code available BEFORE the program start, but during runtime I have the cu and the ptx file available. Can I make NSIGHT during the profiling session and use this file to point to issues and Bottlenecks?

In more general I’m asking about a ‘best practice’ and ‘how to’ to profile a kernel that is compile at application runtime.

Thanks.

I’m not aware of anything that prevents profiling a kernel that is compiled at runtime vs. one that is available at compile time. Perhaps you should try it. This isn’t really any different than driver API usage where the kernel is also compiled at runtime. I can use the profilers to profile cuda kernels that are compiled at runtime and launched via pycuda or numba, for example.

What I mean is how can I assign the ‘source’ code to the profiler so that I can see which areas in the source code is a bottleneck? I work with VS and when I sue NSIGHT I can only see the compiled code, not the ‘underlying’ source code. Is it somehow possible to let the profiler use the source-code that was generated and used to compile the profiled kernel? Is there somewhere a setting where I can add Sourcecode directories or add the generated code as Sourcecode of the kernel?