Hello,
I have uncovered some unexpected behavior in my CUDA application; maybe someone here can explain what is going on. My CUDA kernel fails to write a hard-coded value to a linear array when:
i) the array index > 65535, and
ii) the array element type is a struct such as uchar4, and
iii) the kernel doesn’t contain an if statement
iv) when not in emu mode
The structure of my program is simple:
-
Allocate a large (256 * 256 * 512 * sizeof( uchar4 ) ) buffer on the host.
-
Set the buffer from (1) to all 1s using memset( … ).
-
On the host, iterate through the buffer from (1) and assert that every element is indeed 1.
-
Allocate an equally sized buffer on the device using CUDA_SAFE_CALL( cudaMalloc( … ) ).
-
Copy the host buffer from (1) to the device buffer from (4) using CUDA_SAFE_CALL( cudaMemcpy( … ) ).
-
Execute a device kernel that computes a canonical 1D array index based on gridDim, blockDim, blockIdx, threadIdx, and sets x, y, z, w of each element of the device buffer from (4) to 255.
-
Copy the device buffer back to the host buffer
-
On the host, iterate through the buffer from (1) and assert that x, y, z, w of every element is 255.
Expected behavior:
Since the grid and block dimensions are sufficient to cover the entire device buffer, I expect the host buffer to be completely filled with the value 255. It is worth noting that this expected behavior is observed when I run the program in emu mode.
Actual behavior:
The first 65535 ( = 256 * 256 ) elements of the host buffer are correctly set to x = 255, y = 255, z = 255, w = 255. The remaining elements are still set to x = 1, y = 1, z = 1, w = 1. It is as though the kernel never ran over the remaining (> 65535) elements of the array, since these elements still have their old value. This strange behavior is not observed in emu mode.
Workarounds:
-
When I change my data element from uchar4 to unsigned int, it works as expected.
-
When I add an if statement (that doesn’t change the overall logic of the kernel), it works as expected. This is particularly strange.
-
When I run in emu mode, it works as expected.
Code:
[codebox]
typedef uchar4 CudaArrayElement;
extern “C” void CudaCallUpdateKernel( CudaArrayElement* deviceData )
{
// set the thread block size to the maximum
dim3 threadBlockDimensions( 8, 8, 8 );
// set the grid dimensions
dim3 gridDimensions( 256, 256, 1 );
// call the kernel
HelloWorldKernel<<< gridDimensions, threadBlockDimensions >>>( deviceData );
}
[/codebox]
Here is the kernel itself:
[codebox]
typedef uchar4 CudaArrayElement;
global void HelloWorldKernel( CudaArrayElement* deviceData )
{
const dim3& gridDimensions = gridDim;
const dim3& threadBlockDimensions = blockDim;
const uint3& threadBlockIndexWithinGrid = blockIdx;
const uint3& threadIndexWithinThreadBlock = threadIdx;
const unsigned long arrayIndex =
( threadBlockIndexWithinGrid.y * ( gridDimensions.x * threadBlockDimensions.z * threadBlockDimensions.y * threadBlockDimensions.x ) ) +
( threadBlockIndexWithinGrid.x * ( threadBlockDimensions.z * threadBlockDimensions.y * threadBlockDimensions.x ) ) +
( threadIndexWithinThreadBlock.z * ( threadBlockDimensions.y * threadBlockDimensions.x ) ) +
( threadIndexWithinThreadBlock.y * ( threadBlockDimensions.x ) ) +
( threadIndexWithinThreadBlock.x );
// although this if statement seems pointless, removing it will cause this kernel not to work as expected
if ( arrayIndex > 0 )
{
deviceData[ arrayIndex ].x = 255;
}
deviceData[ arrayIndex ].x = 255;
deviceData[ arrayIndex ].y = 255;
deviceData[ arrayIndex ].z = 255;
deviceData[ arrayIndex ].w = 255;
}
[/codebox]
What works:
-
We know that arrayIndex is being computed correctly, since this is being computed correctly in emu mode.
-
We know that the copying of buffers from host to device and back to host works, since our initial values written by the host can still be seen after coming back from the device. It is not the case that we end up with all 0s in our host buffer. Instead, we end up with the initial values written by the host.
Suspects:
-
Funky alignment issue due to the fact that uchar4 aligns differently than unsigned int?
-
Issue with structs in general? I notice that some of the SDK examples have device methods like “convertRGBtoInt” so their kernels can avoid operating on arrays of structs. Is this supposed to be necessary?
System Specs:
Vista Business SP1 32 bit
CUDA toolkit 2.0
CUDA SDK 2.0 beta2
Visual Studio 2005 for compiling host code
Intel Xeon CPU 3.2GHz (2 processors)
2 GB System RAM
GeForce 8800 GT (on 178.02)
I’m not sure what “chipset type” means, but the computer I’m using is a Macbook Pro with Vista installed. I’m running Vista natively though (not through any virtualization layer).
Anyway I hope that helps! Does anyone have any ideas? If you’re in a similar situation, you can just throw a pointless if statement into your kernel to get it working hehe :) I created this post to see if anyone knows the root cause of this issue. I’d prefer to understand what is happening instead of resorting to a random hack.
Cheers,
Mike