i,
I have a kernel that I want to optimize. One thing I want to do is to remove unrequired initialisation of data by and simple if/else. Here is what I did - just parts of it, but maybe someone already sees what the problem is.
This is the original codesection (Version 1):
const uint32_t l_ui32Index = p_nSelector % p_nIndex;
switch( l_ui32Index )
{
case 0:
case 1:
case 2:
case 3:
return p_ui64Header[l_ui32Index];
default:
return p_ui64Header2[l_ui32Index - 4];
}
and I replaced it by this code (Version 2):
const uint32_t l_ui32Index = p_nSelector % p_nIndex;
switch( l_ui32Index )
{
case 0:
case 1:
case 2:
case 3:
return p_ui64Header[l_ui32Index];
default:
return NULL == p_ui64Header2 ? 0 : p_ui64Header2[l_ui32Index - 4];
}
You see the only change is the ‘default’ case of the switch. ALL threads have the SAME data - so all will go through the same case/default path.
The first version shows this register information - Version 1:
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 18432 bytes smem, 376 bytes cmem[0]
This is Version 2:
Is uses about 30% less registers, 96 instead of 128
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 96 registers, 18432 bytes smem, 376 bytes cmem[0]
Version 2 performs 20% worse than Version 1! So I do not understand how this simple if/else (this is the ONLY modification) can save 30% registers and shows a performance that is more than 20% worse? I was under the impression that saving registers may have positive impact on the performance.
Please let me know if this is a typical newbie mistake?
Cuda 10.2, windows 10, 1070/2060
Thanks.