CUDA compiler for Visual C++ not recognizing template constants ?

I recently added BIGint arbitrary precision class to my gxLibrary ( https://sourceforge.net/projects/gxlibrary ) , which allows work with large integers on CUDA (and AMP and CPU too), like:

intB<128> A=7, B=99, C=A/B;
A>>=1;  B++;

Internally intB class has defined constant integer which represent how many 32bit unsigned ints are used. Something like:

static const int N= Nbits/32;

When I decided to optimize some operations for low Ns (for example N==4 for 128bit, or N==3 for 96 bit), I used something like this in code:

intB& operator++(){
	switch (N){
		case 4: if (!++d[0]) if (!++d[1]) if (!++d[2]) ++d[3]; break; 
		case 3: if (!++d[0]) if (!++d[1]) ++d[2]; break;
		case 2: if (!++d[0]) ++d[1]; break;
		case 1: ++d[0]; break;
		default:
			for (int i=4; i<N; i++){
				++d[i];
				if (d[i]) break;
			}
	}
	return *this;
}

Since ‘N’ used above is constant (templates are defined at compile time), compilers should remove any unneeded code paths already at compile time, so switch(N) actually become just part of code for given N - and it works exactly like that when compiled for CPU or AMP ( gxLibrary compile code for all three: CUDA/AMP/CPU )

But in case of CUDA compiler, it appears not to recognize that N is constant, since it is giving multiple “subscript out of range” warnings (code has d[3], even when N==2, but that part where d[3] is used should have been eliminated at compile time ).

While I could ignore warnings, my main question is if those are only warnings , or CUDA compiler also failed to remove unneeded code paths and left those “if (N==xyz)” comparisons or “switch(N)” code ? In which case it would also have slight performance impact.

are you compiling a debug or release project? Stated another way, are you compiling with -G switch?

You can figure out what code is actually present by inspecting the PTX or SASS code generated.

Compiler warnings appear both with debug compile or release compile.

It was not possible for me to easily check PTX or SASS, since Nsight is not able to map source code to PTX/SASS if global function header is part of #define macro (which I use in gxLibrary), or I would have done it sooner.

But I decided to make minimal test CUDA app that would have similar case, and check with that:

const int CONST=2;

__global__ void test_cu(int* gpuData){
	int res, d[2]={2,3};
	switch (CONST){
		case 1: res=d[0]; break;
		case 2: res=d[1]; break;
		default:
			res=d[2];
	}
	gpuData[ threadIdx.x]=res;
}

Above code should always resolve to gpuData[ threadIdx.x]=3

Good news is that, when compiled for Release, CUDA compiler does good job and optimize this. Entire SASS code looks like:

1	     MOV R1, c[0x0][0x44];	
2	     S2R R0, SR_TID.X;	
3	     MOV32I R2, 0x3;	
4	     ISCADD R0, R0, c[0x0][0x140], 0x2;	
5	     ST [R0], R2;	
6	     EXIT;

Bad news is that, even when compiling for Release where compiler ‘eventually’ remove things like ‘res=d[2]’, CUDA compiler still shows warnings (I guess it emits warnings before optimization stage).

Another bad news is that when compiled for Debug, there is no apparent optimization and complete switch(CONST) code remains - which indicate that CUDA compiler is not doing this optimizations for Debug builds (and also explain why debug builds are significantly slower). I wonder if there is CUDA option in VisualStudio that allow CUDA debug build with optimizations - but that is not related to my initial question.

Conclusion here is that NVidia CUDA compiler correctly optimize out switch() parts based on constant only in Release build, and it even then issue warnings for code paths that will never be reached (ie those it will optimize out), while in Debug builds it seems that Nvidia compiler do not optimize things.

The fact that the device compiler does not optimize when compiled with -G is a common observation. There are a variety of reasons for this. One is that when attempting to debug code, it’s convenient to have something that approximately represents the original source code, so as to facilitate things like source-level debugging (e.g. setting breakpoints at particular lines of source code). The fully optimized code can make wholesale changes to program organization at the assembly level such that it is nearly impossible to map it back to the original source code.

True, but not directly an issue here - even if it would be nice if there is option to build CUDA debug mode with or without optimizations.

Issue here for me is that even if built for Release, CUDA compiler report warnings about lines of code that it will remove. But at least now I know those are false warnings.