False information from occupancy calculator?

Hello,

I was recently debugging on my graphic card (“Evga GeForce GTX 1050 Ti FTW” with driver “390.65” and the CUDA “9.1.85” on Windows 10) when i remark this:

SM(6) * maxThreadBlockPerSM(32) = 192.

With that we can launch a kernel of 192 blocks and 64 threads “Kernel_X <<< 192, 64 >>> ()”. Theoricaly this will produce a occupancy of 100%. Because

192 * 64 = 12’288 Threads. The maximum allocatable threads for the device. (SM(6) * maxThreadsPerSM(2048) = 12’288).

But no, this produce a occupancy of 50% with 32Warps/SM active and not a 64Warps/SM active. The cause is maybe the limitation of registers or the size of shared memory but again no. Because the kernel launch only 8 registers and 0 shared memory. Here is the code:

__global__ void kernel_T_E_S_T_batch(void)
{
	volatile unsigned int b(0u);

	for(unsigned int i(0u); i != 100000u; ++i)
	{
		++b;
	}
}

int main(void)
{
	// TEST
	printf("LAUNCH" NEW_LINE);
	kernel_T_E_S_T_batch <<< 192u, 64u >>> (); // Theoretical 100%, Achieved 50%
	//kernel_T_E_S_T_batch <<< 96u, 128u >>> (); // Theoretical 100%, Achieved 100%

	HostCudaCheckError();
	HostCudaSafeCall(cudaDeviceReset());
	printf("FINISH" NEW_LINE);
	// |END| TEST |END|

	system("PAUSE");

	return(0);
}

With the occupancy calculator we can see a 64threads per block with 32 blocks launch use 100% occupancy. So what is wrong with my experiment?

If i take the half of the maxThreadBlockDevice(192) so 96 and i multiply by two the number of threads 128 I get “Kernel_X <<< 96, 128 >>> () == 12’288 Threads” and now the active warps per sm equal 64. This produce a 100% occupancy. With 128 threads per blocks this reduce the number of resident blocks per sm to 16. Why can’t i use the real equation for having a 100% occupancy with the maximum resident blocks per sm.

It may be that the compiler is optimizing away most of the kernel code. This will make profiling difficult. You may need to modify some global state in your kernel.