unspecified launch failure simple volume initialization fails

Hi Everybody,

I have problems with my project code since I updated from CUDA 0.8 to 1.0.
After a while I could specify the problem, which causes a CUDA Error
“unspecified launch failure”.

I’ve attached a simple CUDA SDK Project which just allocates memory for
a specified volume size and initializes the values with 0.0f. I am aware that,
there is a memset function, but I just wanted to use a very simple example.

The error occurs in “cudaThreadSynchronize();” after calling the kernel “myInitVolumeKernel<<<grid, block>>>(pVolumeDevice);”.

If SDK Methods are used “CUT_CHECK_ERROR” this program will work
in release mode, since nothing is checked. Both version are not working
in debug mode, but work fine in the emulation.

The program runs also fine, if the parameters, which are transferred into
constant memory, are hardcoded. See makro option in the kernel.

This Example program was tested on a GeForce 8800 GTX using Linux.
I also tested it on Windows XP. This time it caused sometimes a not clear
memory access error or froze the system.

A additional test on a second linux system with another G80 Card also failed.

Anybody an idea?


// //
// MyVolumeInitializeTest //
// //

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cutil.h>


// Common host and device functions



// Data configuration


// Volume dimensions
device constant unsigned int gVolSize[3];
// Volume strides dimensions
device constant unsigned int gVolStride[2];

static global void myInitVolumeKernel(float* pVolume);


// Main program


int main(int argc, char *argv){
// Pointer to float volume

// Initialize Device

// Set volume size and volume strides
// this should be changeable during runtime
unsigned int volSize[3];    

volSize[0] = 512; 
volSize[1] = 512; 
volSize[2] = 512;

unsigned int volStride[2];    
volStride[0] = 512; 
// if volstride[1] is 512 instead of 512^2 it works?! 
volStride[1] = 512*512;

// Allocate device memory for volume part
cudaMalloc((void **) &pVolumeDevice, 
		volSize[0] * volSize[1] * 
		volSize[2] * sizeof(float) ));

// Copy volume stride to constant memory    
cudaMemcpyToSymbol(gVolStride, volStride, 2 * sizeof(unsigned int)));    

// Copy volume dimension to constant memory
CUDA_SAFE_CALL(cudaMemcpyToSymbol(gVolSize, volSize, 3 * sizeof(unsigned int)));    

// Set up kernel configuration
dim3 block(64, 4);
dim3 grid(512 / block.x, 512 / block.y);

// Call kernel execution    
myInitVolumeKernel<<<grid, block>>>(pVolumeDevice);

// use definition out of cutil.h instead of using cutil-makros
// both fails in debug mode, but not in emulation.
#if 1
cudaError_t err = cudaThreadSynchronize();
if( cudaSuccess != err) {
fprintf(stderr, “Cuda error: %s.\n”,
cudaGetErrorString( err) );

CUT_CHECK_ERROR(“myInitVolumeKernel() execution failed\n”);
CUDA_SAFE_CALL( cudaThreadSynchronize() );

// Exit program
CUT_EXIT(argc, argv);


static global void myInitVolumeKernel(float* pVolume)
// compute volume coordinates
unsigned int x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int y;
unsigned int z = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

#if 1
//volume index
unsigned int idx = __umul24(z, gVolStride[1]) + x;

for (y = 0; y < gVolSize[1]; y++, idx += gVolStride[0])    
    pVolume[idx] = 0.0f;    

// if variables are hardcoded it works also
unsigned int idx = __umul24(z, 512*512) + x;

for (y = 0; y < 512; y++, idx += 512)    
    pVolume[idx] = 0.0f;    


} // myInitVolumeKernel
initialVolumeTest.zip (23.8 KB)

You allocate memories more, than is reserved in the device:

cudaMalloc((void **) &pVolumeDevice,
volSize[0] * volSize[1] *
volSize[2] * sizeof(float) ));

Depends on the device memory. I have a GeForce 8800 GTX with 768MB.

Since the volume size is 512MB, there should be enough memory.

To be sure, I tested the sample code, by replacing 512 with 256 everywhere

and the error still happen!!!

Did you tried my example on your system?



Let’s count:
You allocate memories volSize [0] * volSize [1] * volSize [2] * sizeof (float), where
volSize [0] = 512; volSize [1] = 512512; volSize [2] = 512;
That are equivalent 2^9
2^92^92^94 = 2^38 bytes or 262144MB that is greater than 768MB.
But, if you use arays of the size volSize [0] = 512; volSize [1] = 512; volSize [2] = 512;, than it is anough memory to allocate: 2^9
2^92^94 = 2^29 bytes = 512MB and (by your words) mistakes do not arise.


I don’t get it, b/c volSize[1] = 512 and not 512*512;

Probably you mixed it up with volStride[1]!

And don’t forget, the program works fine in the emulator and with CUDA 0.8.

The allocation states:

// Allocate device memory for volume part


cudaMalloc((void **) &pVolumeDevice,

volSize[0] * volSize[1] *

volSize[2] * sizeof(float) ));

so 512^3 * sizeof(float) = 512MB .

So my problem is still NOT solved.

Perhaps the problem is not the size, but the fact that you are trying to allocate that much contiguous memory?
Instead of 256 (which still leads to a contiguous block of 64MB), try something even smaller, like 16.

I know, I would expect 64 to work, but this way for certain we can exclude this as the problem.

We are looking into this

Hey there,

NVIDIA confirmed the bug in the current driver version. Won’t be there in the next driver. Thanks to everybody.

What is the bug? Just like to know so that we can try not to step into it. Thanks.

I am getting a very similar problem. I have a kernel that was working fine on CUDA 0.8 and now is giving the same problem described here with CUDA 1.0. Could any body from NVIDIA tell us what to avoid to get around this bug until the next version is released!