Problems in 32-bit ---> 64-bit conversion Kernel idx goes mad, known issue?

Hi,

Been porting code from 32-bit to 64-bit recently, 95% of my code packages made the transfer smoothly except for one, which happens to not have any 32-bit depenencies. Some unit testing revealed that “blockIdx.x” refused to stay within the allocated grid dimensions…

So basically my 32-bit and 64-bit code behave differently despite identical code where the 64-bit versions “blockIdx.x” goes mad.

Has anyone seen this behaviour before?

Thanks,
Jimmy

EDIT: When checking which error i get “unknown error”. If I run with the -G0 flag it runs without a hitch and passes the CPU verification code… Of course at the cost of being much slower…

Try to use other version of cuda, it may be compiler issue.

That has got to be some sort of hosing of shared memory. Try cuda-memcheck and see what it reports.

I’m using the latest 260.99 notebook dev driver (Win 7) and the 3.2.16_win_64 toolkit. Are you suggesting i should try older drivers?

No, cuda 3.1 or 3.0, if it supports 64 bit.

Also is good idea as was mentioned to check shared memory access. Maybe you allocate more data, cause you use longer pointers etc and memory access is spoiled. Do you use shared memory?

Ok, some further digging revealed where the problem was at. I’m still puzzled as to why this occurs however.

It turned out that in a loop the compiler interpreted the addresses as (unsigned int) instead of (int) which for some reason caused the failure. This did not work:

// CASE A

	        for(int i = 0; i < 4; i++)

		{

			

			smem_block[threadIdx.x + blockDim.x * i] = global_ptr[threadIdx.x + blockDim.x*i + nca + blockIdx.y * 4096];

		}

While this does:

//CASE B

                for(int i = 0; i < 4; i++)

		{

			int global_address = threadIdx.x + blockDim.x*i + nca + blockIdx.y*4096;

			int smem_address =   threadIdx.x + blockDim.x * i;

			smem_block[  smem_address] = 	global_ptr[global_address];

		}

So to summarize CASE A worked fine for 32-bit but not for 64-bit while CASE B works for the 64-bit version. The question is why?

Can you unroll the loop? It maybe compiler bug. Most likely.

Yep, it’s unrolled.

I had a couple of guys look at it, most think it might be a compiler bug…

If you could post a self-contained repro case here I would be happy to take a look at this. I can follow up with our compiler team if necessary. Alternatively, you could send the code in a private message via the forum (PMs support file attachements).

Thanks for having a look, I will PM you a repro-case.

Based on the repro case provided by Jimmy I have tracked this down to a compiler issue, and have filed a compiler bug. The workaround in this case is to cast the indexing expression for the global memory access to “int” (it is “unsigned int” for the code as written due to quantities such as threadIdx.x in the expression, which are of type “unsigned int”). Thanks to Jimmy for alerting us to the issue and working out a repro case.

Thanks to Norbert for taking care of that issue.

Meanwhile, I discovered the following while working on a 32-bit machine:

This generates a minor numerical error:

__device__ __forceinline__ float computeSomething(float reg_a, float rec_const, int k)

{

	/*if(reg_a == 0.0f || rec_const == 0.0f)

		printf("\n --------\n"); */

	return d_m[0] == 1 ? (reg_a + rec_const*d_h_k[k]): d_p_k[k]; 

						 

}

While this works without an issue:

__device__ __forceinline__ float computeSomething(float reg_a, float rec_const, int k)

{

	if(reg_a == 0.0f || rec_const == 0.0f)

		printf("\n --------\n"); 

	return d_m[0] == 1 ? (reg_a + rec_const*d_h_k[k]): d_p_k[k]; 

						 

}

Interestingly reg_a and rec_const are never == 0.0f so the printf(…) is never called. So it seems when we choose to look the error goes away (Heisenbugg? ) :)

So could it be that the compiler somehow finds it unnecessary to execute this code segment unless a data dependency like “reg_a == 0.0f” is introduced?

Grateful for any help!

//Jimmy

I think that I’ve seen this one before. See this thread: http://groups.google.com/group/thrust-users/browse_thread/thread/f522b675b45174c3

The issue stems from an implicit cast from a 32-bit integer to a 64-bit pointer. Ints are sign extended before the cast whereas unsigned ints are zero extended. This will only matter if your address rolls over or if nca is negative.

Aha, yes Norbert had an analysis on this issue. Please read also above post about possible new issue on 32-bit compiler! :)