Been porting code from 32-bit to 64-bit recently, 95% of my code packages made the transfer smoothly except for one, which happens to not have any 32-bit depenencies. Some unit testing revealed that “blockIdx.x” refused to stay within the allocated grid dimensions…
So basically my 32-bit and 64-bit code behave differently despite identical code where the 64-bit versions “blockIdx.x” goes mad.
Has anyone seen this behaviour before?
Thanks,
Jimmy
EDIT: When checking which error i get “unknown error”. If I run with the -G0 flag it runs without a hitch and passes the CPU verification code… Of course at the cost of being much slower…
Also is good idea as was mentioned to check shared memory access. Maybe you allocate more data, cause you use longer pointers etc and memory access is spoiled. Do you use shared memory?
Ok, some further digging revealed where the problem was at. I’m still puzzled as to why this occurs however.
It turned out that in a loop the compiler interpreted the addresses as (unsigned int) instead of (int) which for some reason caused the failure. This did not work:
// CASE A
for(int i = 0; i < 4; i++)
{
smem_block[threadIdx.x + blockDim.x * i] = global_ptr[threadIdx.x + blockDim.x*i + nca + blockIdx.y * 4096];
}
While this does:
//CASE B
for(int i = 0; i < 4; i++)
{
int global_address = threadIdx.x + blockDim.x*i + nca + blockIdx.y*4096;
int smem_address = threadIdx.x + blockDim.x * i;
smem_block[ smem_address] = global_ptr[global_address];
}
So to summarize CASE A worked fine for 32-bit but not for 64-bit while CASE B works for the 64-bit version. The question is why?
If you could post a self-contained repro case here I would be happy to take a look at this. I can follow up with our compiler team if necessary. Alternatively, you could send the code in a private message via the forum (PMs support file attachements).
Based on the repro case provided by Jimmy I have tracked this down to a compiler issue, and have filed a compiler bug. The workaround in this case is to cast the indexing expression for the global memory access to “int” (it is “unsigned int” for the code as written due to quantities such as threadIdx.x in the expression, which are of type “unsigned int”). Thanks to Jimmy for alerting us to the issue and working out a repro case.
Interestingly reg_a and rec_const are never == 0.0f so the printf(…) is never called. So it seems when we choose to look the error goes away (Heisenbugg? ) :)
So could it be that the compiler somehow finds it unnecessary to execute this code segment unless a data dependency like “reg_a == 0.0f” is introduced?
The issue stems from an implicit cast from a 32-bit integer to a 64-bit pointer. Ints are sign extended before the cast whereas unsigned ints are zero extended. This will only matter if your address rolls over or if nca is negative.