I am just about able to allocate memory for an algorithm for processing 6 million particles on one C870.
The algorithm transfers chunks of particle data from host to device, and when the size of this chunk is for 8k particles there is approximately 1.5Gb of free memory and the algorithm works fine.
And when I increase the chunk size to accommodate 64k particles there is just under 1.4Gb of free memory on the device but I get an unspecified launch failure for one particular kernel.
This kernel uses just 25 registers (obtained from the use of the --ptaxas flag with nvcc) and with a block size of 64 threads means a block requires 1600 registers (max 8k) and the number of blocks is 1000. (max=?)
The kernel is called kernel<<<numblocks,numthreads>>>() ,which with the numbers above is kernel<<<1000,64>>>()
I would expect this to be well within resource limits, yet I get an unspecified launch failure.
Is the number of blocks too many?
Do I need to spec the grid in more detail?
What else could be causing the failure?