problem with bigger than 32768-size grids CUDA bug?

Hi All,

Could anybody compile this code and try it using your CUDA version and hardware? I have CUDA 2.0 on linux64 and 280GTX. The problem with this simple code is that it produces crappy result on grids with higher than 1x32768 size (first dim is always 1)… although should return the grid size. here it is:

#include <stdio.h>

#include <stdlib.h>

__global__ void foo(float *d_out)

{

	int blidY = blockIdx.y;

	int thidX = threadIdx.x;

	int thidY = threadIdx.y;

	int g_thidY = blidY * 8 + thidY;

	if(!thidX)

		d_out[g_thidY] = 1.0f;

}

void test(int grid_size)

{

	int size = grid_size*8;

	float *out, *d_out;

	out = (float*)malloc(size*sizeof(float));

	cudaMalloc((void**)&d_out, size*sizeof(float));

	

	foo <<< dim3(1, size/8), dim3(8, 8) >>> (d_out);

	double result = 0.0;

	cudaMemcpy(out, d_out, size*sizeof(float), cudaMemcpyDeviceToHost);

	for(int i = 0; i < size; i ++)

		result += out[i];

	printf("%f\n", result / 8);

	free(out);

	cudaFree(d_out);

}

void main()

{

	for(int i = 32000; i < 33000; i += 1)

		test(i);

}

here is my output

.

.

.

32761.000000

32762.000000

32763.000000

32764.000000

32765.000000

32766.000000

32767.000000

32768.000000

1482937404001209024512.000000

1482937403996897017856.000000

78296474278759370254712832.000000

78296495133133742119321600.000000

78296495133133742119321600.000000

156592962484230985571368960.000000

.

.

.

May b because you are not doing “cudaThreadSynchronize()” to wait for your kernel to complete (Kernel launches are asynchronous).

But not sure , if all these have changed in newer versions of CUDA

It’s implicit with the memcpy. As far as I am aware it always has been.

I can’t see anything wrong at the moment. I’d imagine we have something related to max sizes of 16 bit ints. I would suggest you see if you can reproduce the error on a simpler example - quite a bit of the stuff here seems to be redundant and makes bug spotting harder

oops. I guess what I just posted was not correct. edit: deleted

Please add checks whether the kernel is actually launching in this case.

If the kernel doesn’t launch you’re summating random data from d_out because
it is never initialized to 0 with cudaMemset()

From what I remember maximum grid is 32768 in either x or y direction. Check the programming guide for clarification.
Quick search found this: http://forums.nvidia.com/lofiversion/index.php?t56827.html

But according to deviceQuery and the programming guide, the maximum grid dimensions are 65535x65535x1 !!

Indeed, adding a cudaThreadSynchronize()/cudaGetLastError() check results in no errors found.

Checking each individual array element (and memsetting before the kernel call), the array elements 262144 and greater are never set by the kernel.

PTX gurus, we need your help! This looks like a compiler bug to me, but I can’t say for sure.

Here is the PTX

mov.f32	 %f1, 0f3f800000;		// 1

	ld.param.u64	%rd1, [__cudaparm__Z3fooPf_d_out];  // id:13 __cudaparm__Z3fooPf_d_out+0x0

	cvt.s32.u16	 %r1, %tid.y;		//

	cvt.s32.u16	 %r2, %ctaid.y;	  //

	cvt.u16.u32	 %rh1, %r2;		  //

	mul.wide.s16	%r3, %rh1, 8;	   //

	add.s32	 %r4, %r1, %r3;		  //

	cvt.u64.s32	 %rd2, %r4;		  //

	mul.lo.u64  %rd3, %rd2, 4;	  //

	add.u64	 %rd4, %rd1, %rd3;	   //

	st.global.f32   [%rd4+0], %f1;  // id:16

	.loc	14  12  0

 //  12  }

	exit;						   //

It looks like it is doing a signed 16-bit multiplication which cannot be done correctly on 32768*8??? I don’t know PTX well enough to be sure.

Regardless, device, a workaround is to change your hardcoded *8 in the kernel to a more sensible and (and less error prone when changing block sizes) blockDim.y. The kernel works then.

For the PTX gurus:

mov.f32	 %f1, 0f3f800000;		// 1

	ld.param.u64	%rd1, [__cudaparm__Z3fooPf_d_out];  // id:14 __cudaparm__Z3fooPf_d_out+0x0

	cvt.s32.u16	 %r1, %tid.y;		//

	cvt.s32.u16	 %r2, %ctaid.y;	  //

	cvt.u16.u32	 %rh1, %r2;		  //

	mov.u16	 %rh2, %ntid.y;		  //

	mul.wide.u16	%r3, %rh1, %rh2;	//

	add.u32	 %r4, %r1, %r3;		  //

	cvt.u64.s32	 %rd2, %r4;		  //

	mul.lo.u64  %rd3, %rd2, 4;	  //

	add.u64	 %rd4, %rd1, %rd3;	   //

	st.global.f32   [%rd4+0], %f1;  // id:18

	.loc	14  12  0

 //  12  }

	exit;

Note the mul.wide.u16 now.

Although, now that I post this I have the odd sense of deja-vu. Wasn’t there a similar bug found a few months ago? Wasn’t it reported fixed? The code I’ve posted here was compiled with CUDA 2.1.

So would making blidY an unsigned int instead of int help?

Christian

Yes, that also produces a working kernel.

Yep, unsigneds work… Thank you All!