Hello,
I was recently debugging on my graphic card (“Evga GeForce GTX 1050 Ti FTW” with driver “390.65” and the CUDA “9.1.85” on Windows 10) when i remark this:
SM(6) * maxThreadBlockPerSM(32) = 192.
With that we can launch a kernel of 192 blocks and 64 threads “Kernel_X <<< 192, 64 >>> ()”. Theoricaly this will produce a occupancy of 100%. Because
192 * 64 = 12’288 Threads. The maximum allocatable threads for the device. (SM(6) * maxThreadsPerSM(2048) = 12’288).
But no, this produce a occupancy of 50% with 32Warps/SM active and not a 64Warps/SM active. The cause is maybe the limitation of registers or the size of shared memory but again no. Because the kernel launch only 8 registers and 0 shared memory. Here is the code:
__global__ void kernel_T_E_S_T_batch(void)
{
volatile unsigned int b(0u);
for(unsigned int i(0u); i != 100000u; ++i)
{
++b;
}
}
int main(void)
{
// TEST
printf("LAUNCH" NEW_LINE);
kernel_T_E_S_T_batch <<< 192u, 64u >>> (); // Theoretical 100%, Achieved 50%
//kernel_T_E_S_T_batch <<< 96u, 128u >>> (); // Theoretical 100%, Achieved 100%
HostCudaCheckError();
HostCudaSafeCall(cudaDeviceReset());
printf("FINISH" NEW_LINE);
// |END| TEST |END|
system("PAUSE");
return(0);
}
With the occupancy calculator we can see a 64threads per block with 32 blocks launch use 100% occupancy. So what is wrong with my experiment?
If i take the half of the maxThreadBlockDevice(192) so 96 and i multiply by two the number of threads 128 I get “Kernel_X <<< 96, 128 >>> () == 12’288 Threads” and now the active warps per sm equal 64. This produce a 100% occupancy. With 128 threads per blocks this reduce the number of resident blocks per sm to 16. Why can’t i use the real equation for having a 100% occupancy with the maximum resident blocks per sm.