unspecified launch failure simple volume initialization fails

tronjr · August 8, 2007, 8:40am

Hi Everybody,

I have problems with my project code since I updated from CUDA 0.8 to 1.0.
After a while I could specify the problem, which causes a CUDA Error
“unspecified launch failure”.

I’ve attached a simple CUDA SDK Project which just allocates memory for
a specified volume size and initializes the values with 0.0f. I am aware that,
there is a memset function, but I just wanted to use a very simple example.

The error occurs in “cudaThreadSynchronize();” after calling the kernel “myInitVolumeKernel<<<grid, block>>>(pVolumeDevice);”.

If SDK Methods are used “CUT_CHECK_ERROR” this program will work
in release mode, since nothing is checked. Both version are not working
in debug mode, but work fine in the emulation.

The program runs also fine, if the parameters, which are transferred into
constant memory, are hardcoded. See makro option in the kernel.

This Example program was tested on a GeForce 8800 GTX using Linux.
I also tested it on Windows XP. This time it caused sometimes a not clear
memory access error or froze the system.

A additional test on a second linux system with another G80 Card also failed.

Anybody an idea?

Thanks

////////////////////////////////////////////////////////////////////////////////
// //
// MyVolumeInitializeTest //
// //
////////////////////////////////////////////////////////////////////////////////

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cutil.h>

////////////////////////////////////////////////////////////////////////////////

// Common host and device functions

////////////////////////////////////////////////////////////////////////////////

// Data configuration

////////////////////////////////////////////////////////////////////////////////

// Volume dimensions
device constant unsigned int gVolSize[3];
// Volume strides dimensions
device constant unsigned int gVolStride[2];

static global void myInitVolumeKernel(float* pVolume);

////////////////////////////////////////////////////////////////////////////////

// Main program

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char *argv){
// Pointer to float volume
float pVolumeDevice=0;

// Initialize Device
CUT_DEVICE_INIT();


// Set volume size and volume strides
// this should be changeable during runtime
unsigned int volSize[3];    

volSize[0] = 512; 
volSize[1] = 512; 
volSize[2] = 512;

unsigned int volStride[2];    
volStride[0] = 512; 
// if volstride[1] is 512 instead of 512^2 it works?! 
volStride[1] = 512*512;


// Allocate device memory for volume part
CUDA_SAFE_CALL(
cudaMalloc((void **) &pVolumeDevice, 
		volSize[0] * volSize[1] * 
		volSize[2] * sizeof(float) ));

// Copy volume stride to constant memory    
CUDA_SAFE_CALL(
cudaMemcpyToSymbol(gVolStride, volStride, 2 * sizeof(unsigned int)));    

// Copy volume dimension to constant memory
CUDA_SAFE_CALL(cudaMemcpyToSymbol(gVolSize, volSize, 3 * sizeof(unsigned int)));    

// Set up kernel configuration
dim3 block(64, 4);
dim3 grid(512 / block.x, 512 / block.y);

// Call kernel execution    
myInitVolumeKernel<<<grid, block>>>(pVolumeDevice);

// use definition out of cutil.h instead of using cutil-makros
// both fails in debug mode, but not in emulation.
#if 1
cudaError_t err = cudaThreadSynchronize();
if( cudaSuccess != err) {
fprintf(stderr, “Cuda error: %s.\n”,
cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}

#else
CUT_CHECK_ERROR(“myInitVolumeKernel() execution failed\n”);
#endif
CUDA_SAFE_CALL( cudaThreadSynchronize() );

// Exit program
CUT_EXIT(argc, argv);

}

static global void myInitVolumeKernel(float* pVolume)
{
// compute volume coordinates
unsigned int x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int y;
unsigned int z = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

#if 1
//volume index
unsigned int idx = __umul24(z, gVolStride[1]) + x;

for (y = 0; y < gVolSize[1]; y++, idx += gVolStride[0])    
{
    pVolume[idx] = 0.0f;    
}

#else
// if variables are hardcoded it works also
unsigned int idx = __umul24(z, 512*512) + x;

for (y = 0; y < 512; y++, idx += 512)    
{
    pVolume[idx] = 0.0f;    
}

#endif

} // myInitVolumeKernel
initialVolumeTest.zip (23.8 KB)

MGN · August 10, 2007, 8:29am

You allocate memories more, than is reserved in the device:

cudaMalloc((void **) &pVolumeDevice,
volSize[0] * volSize[1] *
volSize[2] * sizeof(float) ));

tronjr · August 10, 2007, 11:22am

Depends on the device memory. I have a GeForce 8800 GTX with 768MB.

Since the volume size is 512MB, there should be enough memory.

To be sure, I tested the sample code, by replacing 512 with 256 everywhere

and the error still happen!!!

Did you tried my example on your system?

Thanks

Best

MGN · August 12, 2007, 3:52am

Let’s count:
You allocate memories volSize [0] * volSize [1] * volSize [2] * sizeof (float), where
volSize [0] = 512; volSize [1] = 512512; volSize [2] = 512;
That are equivalent 2^92^92^92^94 = 2^38 bytes or 262144MB that is greater than 768MB.
But, if you use arays of the size volSize [0] = 512; volSize [1] = 512; volSize [2] = 512;, than it is anough memory to allocate: 2^92^92^94 = 2^29 bytes = 512MB and (by your words) mistakes do not arise.

tronjr · August 13, 2007, 3:20pm

Hmm,

I don’t get it, b/c volSize[1] = 512 and not 512*512;

Probably you mixed it up with volStride[1]!

And don’t forget, the program works fine in the emulator and with CUDA 0.8.

The allocation states:

// Allocate device memory for volume part

CUDA_SAFE_CALL(

cudaMalloc((void **) &pVolumeDevice,

volSize[0] * volSize[1] *

volSize[2] * sizeof(float) ));

so 512^3 * sizeof(float) = 512MB .

So my problem is still NOT solved.

Stewie · August 13, 2007, 4:08pm

Perhaps the problem is not the size, but the fact that you are trying to allocate that much contiguous memory?
Instead of 256 (which still leads to a contiguous block of 64MB), try something even smaller, like 16.

I know, I would expect 64 to work, but this way for certain we can exclude this as the problem.

mfatica · August 14, 2007, 6:28pm

We are looking into this

tronjr · August 20, 2007, 7:40am

Hey there,

NVIDIA confirmed the bug in the current driver version. Won’t be there in the next driver. Thanks to everybody.

sphyraena · August 20, 2007, 12:52pm

What is the bug? Just like to know so that we can try not to step into it. Thanks.

mehussein · August 26, 2007, 8:00pm

I am getting a very similar problem. I have a kernel that was working fine on CUDA 0.8 and now is giving the same problem described here with CUDA 1.0. Could any body from NVIDIA tell us what to avoid to get around this bug until the next version is released!

MH

Topic		Replies	Views
unspecified launch failure CUDA Programming and Performance	2	9048	January 21, 2009
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37760	April 23, 2009
cudaMalloc initialization error CUDA Programming and Performance	7	8664	December 17, 2008
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1675	January 6, 2009
Memory allocation problem CUDA Programming and Performance	2	4638	April 16, 2009
Memory allocation problem CUDA Programming and Performance	0	905	April 19, 2009
cudaMalloc error in big loop CUDA Programming and Performance	12	15648	May 21, 2008
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2588	September 6, 2011
Unspecified launch failure when launch my kernel CUDA Programming and Performance	2	3441	June 15, 2010
unspecified launch failure: This error is in cudaMemcpy CUDA Programming and Performance	9	23274	February 16, 2010

unspecified launch failure simple volume initialization fails

Related topics