Hi,
Admittedly, the subject is a somewhat stupid question. This is because I’m missing smth. in CUDA API.
I launch my kernels as follows:
const dim3 grid(numBlocks, 1);
const dim3 threads(numThreads);
const unsigned sharedMemRequested = 123;
myKernel<<<grid, threads, sharedMemRequested>>>(param1, param2, param3);
if (cudaSuccess != cudaThreadSynchronize()) {
// report error somehow.
}
I am using GTX 280 card.
To my surprise, the kernel will “kind of launch”, even if the number of threads (specified by variable numThreads) is very large, e.g. 10*1024. Moreover, the kernel will also “kind of launch”, if the number of threads is reasonable (say, 512), but the number of registers, as reported by nvcc, when --ptxas-options=-v parameter is supplied to it, times the number of threads per block is greater than 16K. As far as I understand, 16K registers/block is the hardware limit for my card.
In the above paragraph “kind of launch” means, that cudaThreadSynchronize() does NOT report an error, but the kernel produces a wrong result, different from the one, when a large, but smaller number of threads is used.
Hence my question: is there an easy way to detect, at the execution time, that the number of registers, available on the card is not sufficient for launching the requested number of threads?
Thank you in advance for your explanation!