Hello,
I am using my Tesla P40 with driver 410.72 for a real-time matrix multiplication and I am getting a strange issue. When I launch a Kernel everything closes properly but I get the error “too many resources requested”. The strange part is that the number of threads (1024 threads in 4 blocks) and memory used is identical to another Kernel that runs successfully every time without issue. At first the only difference was the number of input arguments which I see from this forum has the potential to cause this issue, but upon changing these to be identical, as well, I get the same issue. So now I have two kernels with identical thread/block usage, input arguments, and memory usage but one kernel throws this error and the other works every time. The only real difference at this point is the names of the two kernels and the amount of shared memory used, but the kernel that uses more shared memory is the one that works.
Please let me know if you can help. Thank you,
Alexander Battey
You state that you have looked at shared memory usage. Have you looked at register usage? What is the output from building with the compiler flag -Xptxas -v for each of the two kernels mentioned?
Hello,
Thank you for your help. I have not looked at the registers yet but it looks like you just told me how.
For the working kernel:
ptxas info : Compiling entry function ‘_Z13LQG_feedback3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PVjjjjjjjjiS_S_S_S_i’ for ‘sm_61’
ptxas info : Function properties for _Z13LQG_feedback3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PVjjjjjjjjiS_S_S_S_i
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 56 registers, 38292 bytes smem, 524 bytes cmem[0], 52 bytes cmem[2]
For the kernel throwing the error:
ptxas info : Compiling entry function ‘_Z12LQG_feedbackPfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PVjjjjjjjjiS_S_S_S_i’ for ‘sm_61’
ptxas info : Function properties for _Z12LQG_feedbackPfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PVjjjjjjjjiS_S_S_S_i
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 72 registers, 23636 bytes smem, 524 bytes cmem[0], 60 bytes cmem[2]
It appears that I am using more registers for this function even though I am requesting the same number of blocks and threads. What is the maximum number of registers allowed for my Tesla P40 and what is the best way to limit the number used? As I mentioned I am using this for real time applications so speed is a crucial factor.
If you’re going to use 1024 threads per block, the maximum registers per thread is 64.
You can get the total number of registers available per SM from either deviceQuery or from the programming guide, table 14.
Divide that number by the number of threads per block you desire. The result is the maximum number of registers per thread you can use. You may need to round down to the nearest whole-number multiple of 2 or 4, depending on your GPU.
Number of registers needed per thread is determined at compilation time.
Number of thread blocks and threads/block is determined at run time.
The number of registers allocated in hardware may be larger than the number of registers required by the code due to allocation granularity.
The Tesla P40 has compute capability 6.1 (sm_61). From the table in appendix H of the CUDA Programming Guide we see that for CC 6.1 there are 64K registers available per thread block. If your thread blocks have 1024 threads per the original post, then 1024*72 > 65536. In that case the best solution would be to use fewer threads per block.
Instructing the compiler to squeeze the code to use fewer registers is possible with the -maxrreg compiler switch and the launch_bounds attribute (see CUDA documentation), but that is rarely the best way to go when the goal is to have the best performance.
Hello,
Thank you this is very helpful! Are there any tricks I can play to reduce the number of registers assigned for each kernel? It is strange to me that the larger kernels (those with more lines of code and variables) are taking fewer registers and I would like to avoid limiting my number of threads as much as possible. If there are any convenient resources available to learn about tricks like this I would appreciate that as well.
Maybe. The problem with “tricks” is that they tend to be brittle: The next iteration of the toolchain might wipe out any gains achieved, or even slow down the code further. In general, the CUDA toolchain is mature and makes good decisions as how to utilize registers for best performance.
Generally speaking, a good strategy for the initial design of a CUDA kernel is to start with thread blocks of between 128 and 256 threads (in steps of 32), and a one-to-one correspondence of output elements to threads. So in your case that might mean use of 16x16 tiles, where each thread produces one element of the output tile, and 256 threads per thread block. Then launch a grid with as many blocks as are needed to cover the full matrix. Apply ‘const’ and ‘restrict’ to pointer arguments as appropriate, as this maximizes the information available to the compiler.
Once the initial code is working, use the CUDA profiler to point out the bottlenecks in the code, and try to address them. The CUDA Best Practices Guide is required and recommended reading at that point. One common design pattern is to avail yourself of the APIs provided by the numerous libraries that ship with CUDA. Matrix multiplication is a common idiom in numerical programming, and covered in multiple variants. Re-inventing the wheel is a common anti-pattern.
Deviations from the above approach are best left as an advanced topic, once programmers have developed a feel for CUDA in general and performance issues with CUDA code in particular.
I assume you are aware that CUDA is not suitable for hard real-time applications, as there are no guarantees for latencies or execution times for any component of the software stack. CUDA may be useful for soft real-time applications with loose timing constraints, where missing deadlines merely degrades system performance but does not cause catastrophic failure.