Problems in 32-bit ---> 64-bit conversion Kernel idx goes mad, known issue?

Jimmy_Pettersson · January 7, 2011, 3:18pm

Hi,

Been porting code from 32-bit to 64-bit recently, 95% of my code packages made the transfer smoothly except for one, which happens to not have any 32-bit depenencies. Some unit testing revealed that “blockIdx.x” refused to stay within the allocated grid dimensions…

So basically my 32-bit and 64-bit code behave differently despite identical code where the 64-bit versions “blockIdx.x” goes mad.

Has anyone seen this behaviour before?

Thanks,
Jimmy

EDIT: When checking which error i get “unknown error”. If I run with the -G0 flag it runs without a hitch and passes the CPU verification code… Of course at the cost of being much slower…

Lev · January 7, 2011, 3:58pm

Try to use other version of cuda, it may be compiler issue.

avidday · January 7, 2011, 4:13pm

That has got to be some sort of hosing of shared memory. Try cuda-memcheck and see what it reports.

Jimmy_Pettersson · January 8, 2011, 2:33pm

I’m using the latest 260.99 notebook dev driver (Win 7) and the 3.2.16_win_64 toolkit. Are you suggesting i should try older drivers?

Lev · January 8, 2011, 2:56pm

No, cuda 3.1 or 3.0, if it supports 64 bit.

Lev · January 8, 2011, 2:59pm

Also is good idea as was mentioned to check shared memory access. Maybe you allocate more data, cause you use longer pointers etc and memory access is spoiled. Do you use shared memory?

Jimmy_Pettersson · January 9, 2011, 5:49pm

Ok, some further digging revealed where the problem was at. I’m still puzzled as to why this occurs however.

It turned out that in a loop the compiler interpreted the addresses as (unsigned int) instead of (int) which for some reason caused the failure. This did not work:

// CASE A

	        for(int i = 0; i < 4; i++)

		{

			

			smem_block[threadIdx.x + blockDim.x * i] = global_ptr[threadIdx.x + blockDim.x*i + nca + blockIdx.y * 4096];

		}

While this does:

//CASE B

                for(int i = 0; i < 4; i++)

		{

			int global_address = threadIdx.x + blockDim.x*i + nca + blockIdx.y*4096;

			int smem_address =   threadIdx.x + blockDim.x * i;

			smem_block[  smem_address] = 	global_ptr[global_address];

		}

So to summarize CASE A worked fine for 32-bit but not for 64-bit while CASE B works for the 64-bit version. The question is why?

Lev · January 9, 2011, 6:32pm

Can you unroll the loop? It maybe compiler bug. Most likely.

Jimmy_Pettersson · January 9, 2011, 8:40pm

Yep, it’s unrolled.

I had a couple of guys look at it, most think it might be a compiler bug…

njuffa · January 10, 2011, 5:15am

If you could post a self-contained repro case here I would be happy to take a look at this. I can follow up with our compiler team if necessary. Alternatively, you could send the code in a private message via the forum (PMs support file attachements).

Jimmy_Pettersson · January 10, 2011, 10:08am

Thanks for having a look, I will PM you a repro-case.

njuffa · January 11, 2011, 9:01am

Based on the repro case provided by Jimmy I have tracked this down to a compiler issue, and have filed a compiler bug. The workaround in this case is to cast the indexing expression for the global memory access to “int” (it is “unsigned int” for the code as written due to quantities such as threadIdx.x in the expression, which are of type “unsigned int”). Thanks to Jimmy for alerting us to the issue and working out a repro case.

Jimmy_Pettersson · January 24, 2011, 4:35pm

Thanks to Norbert for taking care of that issue.

Meanwhile, I discovered the following while working on a 32-bit machine:

This generates a minor numerical error:

__device__ __forceinline__ float computeSomething(float reg_a, float rec_const, int k)

{

	/*if(reg_a == 0.0f || rec_const == 0.0f)

		printf("\n --------\n"); */

	return d_m[0] == 1 ? (reg_a + rec_const*d_h_k[k]): d_p_k[k]; 

						 

}

While this works without an issue:

__device__ __forceinline__ float computeSomething(float reg_a, float rec_const, int k)

{

	if(reg_a == 0.0f || rec_const == 0.0f)

		printf("\n --------\n"); 

	return d_m[0] == 1 ? (reg_a + rec_const*d_h_k[k]): d_p_k[k]; 

						 

}

Interestingly reg_a and rec_const are never == 0.0f so the printf(…) is never called. So it seems when we choose to look the error goes away (Heisenbugg? ) :)

So could it be that the compiler somehow finds it unnecessary to execute this code segment unless a data dependency like “reg_a == 0.0f” is introduced?

Grateful for any help!

//Jimmy

Gregory_Diamos · January 24, 2011, 4:59pm

Ok, some further digging revealed where the problem was at. I’m still puzzled as to why this occurs however.

It turned out that in a loop the compiler interpreted the addresses as (unsigned int) instead of (int) which for some reason caused the failure. This did not work:
// CASE A

	        for(int i = 0; i < 4; i++)

		{

			

			smem_block[threadIdx.x + blockDim.x * i] = global_ptr[threadIdx.x + blockDim.x*i + nca + blockIdx.y * 4096];

		}
While this does:
//CASE B

                for(int i = 0; i < 4; i++)

		{

			int global_address = threadIdx.x + blockDim.x*i + nca + blockIdx.y*4096;

			int smem_address =   threadIdx.x + blockDim.x * i;

			smem_block[  smem_address] = 	global_ptr[global_address];

		}
So to summarize CASE A worked fine for 32-bit but not for 64-bit while CASE B works for the 64-bit version. The question is why?

I think that I’ve seen this one before. See this thread: http://groups.google.com/group/thrust-users/browse_thread/thread/f522b675b45174c3

The issue stems from an implicit cast from a 32-bit integer to a 64-bit pointer. Ints are sign extended before the cast whereas unsigned ints are zero extended. This will only matter if your address rolls over or if nca is negative.

Jimmy_Pettersson · January 24, 2011, 5:01pm

Aha, yes Norbert had an analysis on this issue. Please read also above post about possible new issue on 32-bit compiler! :)

Topic		Replies	Views
Wrong unsigned to signed conversion Problem with types conversion leads to incorrect kernel's re CUDA Programming and Performance	3	6971	February 16, 2009
Cross Compiling 32bit on a 64bit Machine issues CUDA Programming and Performance	0	1876	November 3, 2011
Cuda Application Works on 32 bits, not in 64 bits CUDA Programming and Performance	0	2298	September 1, 2009
64 bit integer inconsistancy with CUDA 5.0 CUDA Programming and Performance	3	893	January 16, 2013
Code compiled with -m64 fails but with -m32 works just fine Are there any special moments to remembe CUDA Programming and Performance	33	7522	October 10, 2010
BlockIdx.x suddenly jumping to weird, high value CUDA Programming and Performance	5	1470	January 19, 2010
CUDA and 64bit Linux problem CUDA Programming and Performance	5	4160	November 2, 2007
Size_t CUDA Programming and Performance	4	5365	September 30, 2020
Is there a problem with the cuda program when switching a 32bit program to a 64bit program? CUDA Programming and Performance	1	507	April 17, 2019
Strange behavior CUDA Programming and Performance	5	6630	November 7, 2010

Problems in 32-bit ---> 64-bit conversion Kernel idx goes mad, known issue?

Related topics