I have quite a large code that I have ported using CUDA Fortran which uses large local arrays.
Now, I was playing around with the number of threads per block and found an unexpected bug. When I increase the number of threads above 128 my results become incorrect. I can’t seem to find a problem with the actual code so I was wondering if this could be a result of local memory limitations per multiprocessor.
Any information or suggestions would be much appreciated.
Okay, I found the solution to my problem. It might be of interest to some of you.
The problem was that too many resources were being requested for launch. This meant that the kernels in the code weren’t being executed properly.
In order to prevent this problem the number of registers needs to be controlled using the -Mcuda=maxregcount:n flag on compilation. The general rule of thumb is that:
no. of registers*blocksize should not be greater than 8192
So, for a blocksize of 258 the regcount must be no more than 32…
I’m a beginner to cuda. Could you tell me why it can’t large than 8192?
I have some problem about registers. My GPU is Tesla M2050 (cc2.0), which may have 32K registers per SM.
But the compiler seems to block the kernel code to use more registers than 8K. I try to convert some small arrays into registers, but results in lower performance. Thanks in advance.
I think it’s as simple as a hardware limitation. However, I think the maximum number of registers per block depends on the compute capability that you are using. I’m using 1.3. The maximum number of registers is higher for higher compute capabilities.
But the compiler seems to block the kernel code to use more registers than 8K
Do you mean that this is the maximum number of registers your kernel is using? If so, to allow the compiler to allocate more registers to your kernel you need to lower the number of threads per block that you’re using. Each multiprocessor allocates registers to one thread block at a time but all the threads on a multiprocessor have to share the limited number of registers. This means that the more threads in a thread block then the less registers available to each thread.
However, it’s not always useful to have the maximum number of registers per thread as this means that you limit the number of warps that can execute on a multiprocessor at any one time, which could hinder performance as latency from inactive warps tends not to be hidden. As you can see, it’s quite a complex issue which you should probably read more about. Here’s a link to some documentation you might find useful… the metric is known as occupancy.
I try to convert some small arrays into registers, but results in lower performance.
I think this could be because there aren’t enough registers available to hold your arrays so they’re spilling over to local memory… which essentially is thread private global memory, hence the poor performance could be the latency of fetching the data.