CUDA kernel fails to write a hard-coded value to a linear array when the array index > 65535 but


I have uncovered some unexpected behavior in my CUDA application; maybe someone here can explain what is going on. My CUDA kernel fails to write a hard-coded value to a linear array when:

i) the array index > 65535, and

ii) the array element type is a struct such as uchar4, and

iii) the kernel doesn’t contain an if statement

iv) when not in emu mode

The structure of my program is simple:

  1. Allocate a large (256 * 256 * 512 * sizeof( uchar4 ) ) buffer on the host.

  2. Set the buffer from (1) to all 1s using memset( … ).

  3. On the host, iterate through the buffer from (1) and assert that every element is indeed 1.

  4. Allocate an equally sized buffer on the device using CUDA_SAFE_CALL( cudaMalloc( … ) ).

  5. Copy the host buffer from (1) to the device buffer from (4) using CUDA_SAFE_CALL( cudaMemcpy( … ) ).

  6. Execute a device kernel that computes a canonical 1D array index based on gridDim, blockDim, blockIdx, threadIdx, and sets x, y, z, w of each element of the device buffer from (4) to 255.

  7. Copy the device buffer back to the host buffer

  8. On the host, iterate through the buffer from (1) and assert that x, y, z, w of every element is 255.

Expected behavior:

Since the grid and block dimensions are sufficient to cover the entire device buffer, I expect the host buffer to be completely filled with the value 255. It is worth noting that this expected behavior is observed when I run the program in emu mode.

Actual behavior:

The first 65535 ( = 256 * 256 ) elements of the host buffer are correctly set to x = 255, y = 255, z = 255, w = 255. The remaining elements are still set to x = 1, y = 1, z = 1, w = 1. It is as though the kernel never ran over the remaining (> 65535) elements of the array, since these elements still have their old value. This strange behavior is not observed in emu mode.


  • When I change my data element from uchar4 to unsigned int, it works as expected.

  • When I add an if statement (that doesn’t change the overall logic of the kernel), it works as expected. This is particularly strange.

  • When I run in emu mode, it works as expected.



typedef uchar4 CudaArrayElement;

extern “C” void CudaCallUpdateKernel( CudaArrayElement* deviceData )


// set the thread block size to the maximum

dim3 threadBlockDimensions( 8, 8, 8 );

// set the grid dimensions

dim3 gridDimensions( 256, 256, 1 );

// call the kernel

HelloWorldKernel<<< gridDimensions, threadBlockDimensions >>>( deviceData );



Here is the kernel itself:


typedef uchar4 CudaArrayElement;

global void HelloWorldKernel( CudaArrayElement* deviceData )


const dim3&  gridDimensions               = gridDim;

const dim3&  threadBlockDimensions        = blockDim;

const uint3& threadBlockIndexWithinGrid   = blockIdx;

const uint3& threadIndexWithinThreadBlock = threadIdx;

const unsigned long arrayIndex =

	( threadBlockIndexWithinGrid.y   * ( gridDimensions.x * threadBlockDimensions.z * threadBlockDimensions.y * threadBlockDimensions.x ) ) +

	( threadBlockIndexWithinGrid.x   * ( threadBlockDimensions.z * threadBlockDimensions.y * threadBlockDimensions.x ) ) +

	( threadIndexWithinThreadBlock.z * ( threadBlockDimensions.y * threadBlockDimensions.x ) ) +

	( threadIndexWithinThreadBlock.y * ( threadBlockDimensions.x ) ) +

	( threadIndexWithinThreadBlock.x );

// although this if statement seems pointless, removing it will cause this kernel not to work as expected

if ( arrayIndex > 0 )


	deviceData[ arrayIndex ].x = 255;


deviceData[ arrayIndex ].x = 255;

deviceData[ arrayIndex ].y = 255;

deviceData[ arrayIndex ].z = 255;

deviceData[ arrayIndex ].w = 255;



What works:

  • We know that arrayIndex is being computed correctly, since this is being computed correctly in emu mode.

  • We know that the copying of buffers from host to device and back to host works, since our initial values written by the host can still be seen after coming back from the device. It is not the case that we end up with all 0s in our host buffer. Instead, we end up with the initial values written by the host.


  • Funky alignment issue due to the fact that uchar4 aligns differently than unsigned int?

  • Issue with structs in general? I notice that some of the SDK examples have device methods like “convertRGBtoInt” so their kernels can avoid operating on arrays of structs. Is this supposed to be necessary?

System Specs:

Vista Business SP1 32 bit

CUDA toolkit 2.0

CUDA SDK 2.0 beta2

Visual Studio 2005 for compiling host code

Intel Xeon CPU 3.2GHz (2 processors)

2 GB System RAM

GeForce 8800 GT (on 178.02)

I’m not sure what “chipset type” means, but the computer I’m using is a Macbook Pro with Vista installed. I’m running Vista natively though (not through any virtualization layer).

Anyway I hope that helps! Does anyone have any ideas? If you’re in a similar situation, you can just throw a pointless if statement into your kernel to get it working hehe :) I created this post to see if anyone knows the root cause of this issue. I’d prefer to understand what is happening instead of resorting to a random hack.



You should check for kernel launch errors after calling the kernel. My guess is that you have a “too many resources requested for launch” error as you are running 512 threads in a block. Changing the data type will increase the number of registers used and make the launch fail when there are not enough registers to run that large of a block.

Hi MisterAnderson,

Thanks for the suggestion. When you said that I was thinking “Man I’m such an idiot for not checking the error conditions!”

So I added the following right after I call the kernel.


HelloWorldKernel<<< 256, 256 >>>( deviceData );

cudaError_t error = cudaGetLastError();

const char* errorString = cudaGetErrorString( error );

assert( error == cudaSuccess );


The code still demonstrates the funky behavior and doesn’t return an error code. Am I checking for errors properly?

Your theory still leaves some open questions. Why would adding the if statement to the kernel fix the problem? Presumably the if statement wouldn’t reduce the register count in the compiled kernel.

Nonetheless thanks for the suggestion. I should definitely have that error checking code in there anyway so I appreciate the heads up! :)



kernel launches are asyncrhronous, so calling cudaGetLastError right after the call most likely checks for any errors before the launch. You need a cudaThreadSynchronize before the cudaGetLastError call to actually check for errors during the kernel launch.

As for register usage, making simple changes to code often changes the register usage +/-1. It all comes down to the register allocation in ptxas and how independent instructions are reordered. Compile with --ptxax-options -v (IIRC, check the forums) for a printout of the number of registers used in the kernel.

16 registers should be enough for this kernel (and mroberts says he gets partial results).

I think this may be related to this bug:
Index arithmetic is not being done with enough bits. I think this why the array stops having valid data after the peculiar index 64k.

mroberts, download decuda and post the decompiled version of your cubin. (the working and nonworking one)