Could you tell me why it can’t large than 8192?
I think it’s as simple as a hardware limitation. However, I think the maximum number of registers per block depends on the compute capability that you are using. I’m using 1.3. The maximum number of registers is higher for higher compute capabilities.
But the compiler seems to block the kernel code to use more registers than 8K
Do you mean that this is the maximum number of registers your kernel is using? If so, to allow the compiler to allocate more registers to your kernel you need to lower the number of threads per block that you’re using. Each multiprocessor allocates registers to one thread block at a time but all the threads on a multiprocessor have to share the limited number of registers. This means that the more threads in a thread block then the less registers available to each thread.
However, it’s not always useful to have the maximum number of registers per thread as this means that you limit the number of warps that can execute on a multiprocessor at any one time, which could hinder performance as latency from inactive warps tends not to be hidden. As you can see, it’s quite a complex issue which you should probably read more about. Here’s a link to some documentation you might find useful… the metric is known as occupancy.
I try to convert some small arrays into registers, but results in lower performance.
I think this could be because there aren’t enough registers available to hold your arrays so they’re spilling over to local memory… which essentially is thread private global memory, hence the poor performance could be the latency of fetching the data.
Hope that helps,