I have found a simple kernel which seems to cause the CUDA environment to die. It is intermittent, requiring somewhere between
0 and 100k iterations. After “death”, subsequent CUDA calls get an “unspecified launch failure” until the calling program exits. I am using the kernel to do a high speed device memory to device memory copy, but other permutations of the kernel which do more useful work also exhibit the same behavior. Some thread block sizes work just fine, others have the problem. For example, xSize of 512 with 128 threads per block works just fine. xSize of 448 with 112 threads per block causes the CUDA environment to die after somewhere between 0 to 100k iterations. The kernel uses shared memory to stored reads from input device memory and then writes out to device memory
from shared memory. When shared memory is not used, the kernel runs much more slowly and the problem doesn’t happen.
Trying out xSizes of 32 to 512 in multiples of 16, failure was found for xSizes of 144,160,192, 272, 304, 400 and 448.
There may be others between 448 and 144. xSize of 128, 256, 384, 464-512 are definitely ok.
The thread block size is xSize/4.
CUDA 2.1 Beta toolkit and SDK
Visual Studio 2005
I have tried it on 2 platforms and with 2 different GPU cards
NVIDIA 181.20 driver
Intel Xeon 5120 1.86 GHz 2 GBytes RAM Windows XP
Intel Xeon E5410 2.33 GHz 8 Gbytes RAM Windows XP64
8800 GT and FX3700
Here is the kernel and wrapper function:
[codebox]global void
dummy_kernel(unsigned char *id, unsigned char *od, unsigned int yPitch, unsigned int zPitch)
{
unsigned int x = threadIdx.x * 4;
unsigned int y = blockIdx.x;
unsigned int z = blockIdx.y;
__shared__ unsigned char shMem[512]; // 1 line of samples
unsigned int i = __mul24(z,zPitch) + __mul24(y,yPitch) + x;
*(int *)&shMem = (int)&id[i];
__syncthreads();
*(int *)&od[i] = *(int *)&shMem[x];
}
extern “C” void
testKernel(unsigned char *d_in, unsigned char *d_out, int xSize, int ySize, int zSize, int yPitch)
{
dim3 gridSz(ySize,zSize);
dim3 blockSz(xSize/4);
int zPitch = ySize * yPitch;
dummy_kernel<<<gridSz, blockSz>>>(d_in, d_out, yPitch, zPitch);
}
[/codebox]
Here is the calling and setup code:
[codebox]#include <stdio.h>
#include <cuda_runtime.h>
#include <cutil.h>
extern “C” void testKernel(unsigned char *d_in, unsigned char *d_out, int xSize, int ySize, int zSize, int yPitch);
//
// make a number of 3D device arrays for use as temporary buffers
//
void
createDeviceArrays(int xSize, int ySize, int zSize, int count, size_t *pitch, unsigned char **bufA)
{
cudaThreadSynchronize();
cudaPitchedPtr pitchedPtrA;
cudaExtent extentDesc = make_cudaExtent(xSize,ySize,zSize);
for (int j=0; j<count; j++)
{
cudaError_t e = (cudaError_t) cudaMalloc3D( &pitchedPtrA, extentDesc );
if (e)
{
printf("ERROR: createDeviceArrays %d %s \n",e,cudaGetErrorString(e));
}
bufA[j] = (unsigned char*) pitchedPtrA.ptr;
printf("createDeviceArray[%d]=%IX size=%d %d %d \n",j,bufA[j],xSize,ySize,zSize);
}
*pitch = pitchedPtrA.pitch;
cudaThreadSynchronize();
}
#define N_3D_DEVICE_BUFFERS 2
unsigned char *d_buf3D[N_3D_DEVICE_BUFFERS];
size_t yPitch;
void
allocate(int xSize, int ySize, int zSize)
{
createDeviceArrays(xSize, ySize, zSize, N_3D_DEVICE_BUFFERS, &yPitch, d_buf3D);
}
void
process(unsigned char *src, unsigned char *dest, int xSize, int ySize, int zSize)
{
cudaError_t e;
// skip copying host buffers up to the device
//cudaThreadSynchronize();
testKernel(d_buf3D[0], d_buf3D[1], xSize, ySize, zSize, (int)yPitch);
//cudaThreadSynchronize();
// skip copying device buffers to host
if (e=cudaGetLastError())
{
printf("testKernel cuda Error: %d %s\n",e,cudaGetErrorString(e));
printf(" %d %d %d\n",xSize, ySize, zSize);
while (1) {} // just wait here
}
}
unsigned char buffer1[512256256], buffer2[512256256];
void
main()
{
// some “good” xSizes 512, 480, 256
// “bad” xSizes 448, 400
int xSize=448;
int ySize=137;
int zSize=72;
allocate(xSize,ySize,zSize);
printf(“pitch=%d\n”,(int)yPitch);
for (int k=0; k<100; k++)
{
for (int j=0; j<1000; j++)
{
process(buffer1, buffer2, xSize, ySize, zSize);
}
printf("%d k iterations\n",k+1);
}
printf(“success\n”);
while (1)
{}
}[/codebox]
I would be happy with any information about this problem. I have a work around for now–just don’t allow those “bad” sizes
to be used. I just increase pad the data to the next highest “good” size and accept the performance penalty.
I would like to understand this problem and make sure my work-around is truly robust or have a fix.
By the way, this kernel which does device memory to device memory copies appears to be a lot faster than memcpy3D–perhaps
by about a factor of 4. Anybody have any idea why? Could it be related to my problem?
I have also attached the complete Visual Studio 2005 project. It should be placed in the SDK projects directory.
Thanks for any help.
cellophane man
cudaVIEbug.zip (3.47 MB)