I am just about able to allocate memory for an algorithm for processing 6 million particles on one C870.
The algorithm transfers chunks of particle data from host to device, and when the size of this chunk is for 8k particles there is approximately 1.5Gb of free memory and the algorithm works fine.
And when I increase the chunk size to accommodate 64k particles there is just under 1.4Gb of free memory on the device but I get an unspecified launch failure for one particular kernel.
This kernel uses just 25 registers (obtained from the use of the --ptaxas flag with nvcc) and with a block size of 64 threads means a block requires 1600 registers (max 8k) and the number of blocks is 1000. (max=?)
The kernel is called kernel<<<numblocks,numthreads>>>() ,which with the numbers above is kernel<<<1000,64>>>()
I would expect this to be well within resource limits, yet I get an unspecified launch failure.
Is the number of blocks too many?
Do I need to spec the grid in more detail?
My apologies for going off-topic here, I’ll keep it short.
I seem to recall having seen a link to a post that describes how to dynamically allocate shared memory, but I’ve lost it, could you provide and links/insights Nico?
Can you get an unspecified launch failure from a cudamemcpy? I’ve placed cudaGetLastError after all statements that involve the device and the launch failure always seems to occur at a particular cudaMemcpy statement but in different iterations of the loop. I have placed dummy cudaGetLastError statements after the previous kernel call to catch any delay in the kernel reporting an error but none is reported so I am assuming that the launch failure is the cudaMemcpy (which isn’t a kernel).
I would think that cudaMemCpy returns a different kind of error. Are you sure you’re not trying to dereference a pointer to host memory in your kernel function, that’s a typical cause for unspecified launch failures.