Maximum grid dimension Change from 1.0 to 2.0?

I am currently running CUDA on 8800 GTS on SLAMD64 (a 64-bit linux slackware distribution) using the SUSE Linux Enterprise Desktop driver. I thought that the maximum grid dimension was 65535 however when I went from CUDA 1.0 to 2.0 I got a seg fault on code that previously had run successfully. I narrowed down the critical section to the following code:

#include <stdio.h>
#include <cutil.h>
#include <cuda.h>

#define BLOCK_SIZE_IDX 256

global void applyIndex(float *orderedArray, float *inArray, int *indxArray)
{
// Block index
int b_idx = blockIdx.x;

// Thread index
int t_idx = threadIdx.x;

int indx = BLOCK_SIZE_IDX*b_idx + t_idx;

unsigned int index = indxArray[indx];
orderedArray[indx] = inArray[index];
}

main (int argc, char *argv) {

int P = 256*256;
int M = atol(argv[1]);

int PxM = P * M;

float szflt = sizeof(float);
int szint = sizeof(int);
printf(“M: %d P: %d\n”, M, P);

int idx;
CUDA_SAFE_CALL(cudaMallocHost((void
*) &idx, (PxM)*szint));
for (int i=0; i<PxM; i++) idx[i] = i;

int d_p_idx;
CUDA_SAFE_CALL(cudaMalloc((void
*) &d_p_idx, PxMszint));
CUDA_SAFE_CALL(cudaMemcpy(d_p_idx, idx, PxM
szint, cudaMemcpyHostToDevice));

// Free space
CUDA_SAFE_CALL(cudaFreeHost(idx));

float ws1_dev, ws2_dev;
CUDA_SAFE_CALL(cudaMalloc((void
) &ws1_dev, PMszflt));
CUDA_SAFE_CALL(cudaMalloc((void**) &ws2_dev, PMszflt));

printf(“Threads per block: %d Grid dimension: %d\n”,
BLOCK_SIZE_IDX, PxM/BLOCK_SIZE_IDX);
applyIndex<<<PxM/BLOCK_SIZE_IDX, BLOCK_SIZE_IDX>>>
(ws2_dev, ws1_dev, d_p_idx);
CUT_CHECK_ERROR(“Kernel execution failed”);
CUDA_SAFE_CALL( cudaThreadSynchronize() );

// clean up memory
CUDA_SAFE_CALL(cudaFree(ws1_dev));
CUDA_SAFE_CALL(cudaFree(ws2_dev));
CUDA_SAFE_CALL(cudaFree(d_p_idx));

return 0;
}

When executed with M=128 then I get: Threads per block: 256 Grid dimension: 32768
With M=129 I get: Threads per block: 256 Grid dimension: 33024 and an “unspecified launch failure”.

I receive no errors during the memory allocation or copy commands. The code runs successfuly in emulation for both cases. Running under valgrind shows no errors. Any ideas what the problem might be?

Thanks.

It is some sort of integer PTX bug. The value of ‘indx’ is being computed wrong.

Here’s my slightly modified code:

[codebox]#include <stdio.h>

#include <cutil.h>

#include <cuda.h>

#define BLOCK_SIZE_IDX 256

global void applyIndex(int *orderedArray)

{

int b_idx = blockIdx.x;

int t_idx = threadIdx.x;

int indx = BLOCK_SIZE_IDX*b_idx + t_idx;

orderedArray[indx] = indx;

}

main (int argc, char *argv) {

int P = 256*256;

int M = atol(argv[1]);

int PxM = P * M;

float szflt = sizeof(float);

int szint = sizeof(int);

printf(“M: %d P: %d\n”, M, P);

int *ws2_dev;

CUDA_SAFE_CALL(cudaMalloc((void**) &ws2_dev, PMszint));

printf(“Threads per block: %d Grid dimension: %d\n”, BLOCK_SIZE_IDX, PxM/BLOCK_SIZE_IDX);

applyIndex<<<PxM/BLOCK_SIZE_IDX, BLOCK_SIZE_IDX>>> (ws2_dev);

CUT_CHECK_ERROR(“Kernel execution failed”);

CUDA_SAFE_CALL( cudaThreadSynchronize() );

int *returnArray;

CUDA_SAFE_CALL(cudaMallocHost((void**) &returnArray, (PxM)*szint));

CUDA_SAFE_CALL(cudaMemcpy(returnArray, ws2_dev, PxM*szint, cudaMemcpyDeviceToHost));

//breakpoint here to examine returnArray. elements 8388608 (ie 2^23) and above are all 0

CUDA_SAFE_CALL(cudaFreeHost(returnArray));

// clean up memory

CUDA_SAFE_CALL(cudaFree(ws2_dev));

return 0;

}[/codebox]

Elements in returnArray should equal their index, but after 8388608, they all equal 0.

And here is the kernel decuda’d:

000000: 10004c05 0023c780 mov.b16 $r0.hi, s[0x000c]

000008: a0000005 04000780 cvt.u32.u16 $r1, $r0.lo

000010: 60000305 00000013 mad24.lo.u32.u16.u16 $r65, $r0.hi, 0x0100// (No operand 4 in this instruction)

000018: 30020201 c4100780 shl.u32 $r0, $r1, 0x00000002

000020: 2000c801 04200780 add.u32 $r0, s[0x0010], $r0

000028: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 00000100

Note how decuda can’t process the mad24 instruction, thinking it’s writing to register r65 instead of r1. There must be something wrong with that instruction. Mul24 seems like it would work, but it’s as if precision is only 23 bits not 24.

P.S. I’m on Vista x64 with GTX260.

Alex,

Thanks for your input. I had never heard of decuda. When I applied it to a file containing a number of similar types of kernels I noticed that the bad instruction:

mad24.lo.u32.u16.u16 $r65, $r0.hi, 0x0100

occurred when I used:

int indx = BLOCK_SIZE_IDX*b_idx + t_idx;

but not when I used:

unsigned int indx = __mul24(BLOCK_SIZE, b_idx) + t_idx;

After I changed this in applyIndex the code ran successfully.

Thanks very much! This “feature” has been puzzling me for a few months.

Joel

Just confirmed this and filed as a bug–thanks.