Causes of unspecified launch failure

I am just about able to allocate memory for an algorithm for processing 6 million particles on one C870.

The algorithm transfers chunks of particle data from host to device, and when the size of this chunk is for 8k particles there is approximately 1.5Gb of free memory and the algorithm works fine.

And when I increase the chunk size to accommodate 64k particles there is just under 1.4Gb of free memory on the device but I get an unspecified launch failure for one particular kernel.

This kernel uses just 25 registers (obtained from the use of the --ptaxas flag with nvcc) and with a block size of 64 threads means a block requires 1600 registers (max 8k) and the number of blocks is 1000. (max=?)

The kernel is called kernel<<<numblocks,numthreads>>>() ,which with the numbers above is kernel<<<1000,64>>>()

I would expect this to be well within resource limits, yet I get an unspecified launch failure.

Is the number of blocks too many?
Do I need to spec the grid in more detail?

What else could be causing the failure?

If there is a problem in the number of blocks/threads, the error should be something like: “wrong launch params” or something like this.

“Unspecified launch failure” usually means “Segmentation fault”

So, in your kernel you are making a seg fault, wrong pointers, threads trying to access out of range in your arrays, etc…

Maybe you’re using too much shared memory?

N.

I doubt it. The sm per thread is very tiny, due to the kernel algorithm, and the compiler usually tells you too much sm is being used.

What about texture mem? Does the --ptxas flag consider texture mem use?

Yes it does, but not if you’re using dynamically allocated shared memory in the kernel…

N.

My apologies for going off-topic here, I’ll keep it short.

I seem to recall having seen a link to a post that describes how to dynamically allocate shared memory, but I’ve lost it, could you provide and links/insights Nico?

Can you get an unspecified launch failure from a cudamemcpy? I’ve placed cudaGetLastError after all statements that involve the device and the launch failure always seems to occur at a particular cudaMemcpy statement but in different iterations of the loop. I have placed dummy cudaGetLastError statements after the previous kernel call to catch any delay in the kernel reporting an error but none is reported so I am assuming that the launch failure is the cudaMemcpy (which isn’t a kernel).

The code looks something like this:

[codebox]

global void kernel() {

extern shared float shared;

}

int main(int argc, char** argv) {

//This will give you shared memory storage for 32 floats

kernel<<<512,32,32*sizeof(float)>>>();

}

[/codebox]

N.

I would think that cudaMemCpy returns a different kind of error. Are you sure you’re not trying to dereference a pointer to host memory in your kernel function, that’s a typical cause for unspecified launch failures.

N.