I am using the <<<dimGrid, dimBlock >>> syntax to launch kernels.
I noticed that if my block size is too large (in my case over 384 threads) then the kernels are not executed.
If I understand correctly, resource use such as amount of shared memory can reduce the maximum number of threads that can participate in a block below 512. But my problem is that I don’t get an exception or error code that I can handle which indicates that the kernel execution has failed.
The kernel returns void with <<<>>> syntax. Even calling cudaThreadSynchronize afterwards does not return an error code. How should this situation be handled to determine whether the kernels were executed correctly?
Thanks. With cudaGetLastError I do indeed get an error code of cudaErrorLaunchOutOfResources. My next question is, how can I determine, either at compile time or run time, how many threads I can launch without running out of resources?
cudaGetDeviceProperties() (see the CUDA reference manual for a full description) will allow you to query the device and find the maximum number of threads per block, as well as many other useful things.
This looks like it returns device-specific information, rather than kernel-specific information. My more complex kernels must be launched with smaller block sizes, whereas my simpler kernels can allow larger block sizes. How can I determine, preferably at runtime, the maximum number of threads I can use for a particular kernel?
Oh, sorry, I reversed what you asked in my head. Hopefully someone has a good suggestion here. (All I can think of is manually reading the output of ptxas -v when you compile your kernel to see how many registers per thread are used and doing the math.)
USE the CUDA occupancy calculator to determine the optimum number of threads per block. You would need to use the ‘ptax’ thing to get the number of threads and memory allocation / block size – information.
CUDA occupancy calculator can be downloaded from cuda website… its a excel spread sheet.