Speeding up a kernel Unrolling loops and arrays in registers

I am developing a rather simple kernel. My first version (single_kernel) was about 6x slower than OpenMP/SSE combination.

Then I tried to speed it up (single_kernel2) implementing a suggestion by Volkov - reducing the number of threads and increasing thread output. However, the compiler fails to allocate short arrays in registers, and unroll loops. Can anybody please suggest what I could be doing better?

Hardware: Tesla C1060/Xeon E5530
NVCC: version 4.0

Compilation arguments:
/opt/cuda4.0/cuda/bin/nvcc --ptxas-options=-v -G -g -D_DEBUG -arch=sm_13 -O3 -c -I"/usr/local/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/opt/cuda4.0/cuda/include" -I"./" -I/usr/include/boost141 -o single.cu_o single.cu

Source for kernel as well as an excerpt from ptx file is attached.

Output from compiler:
ptxas info : Compiling entry function ‘_Z14single_kernel2ILi2ELi2EEvv’ for ‘sm_13’
ptxas info : Used 32 registers, 148+0 bytes lmem, 256+16 bytes smem, 64 bytes cmem[0], 60 bytes cmem[1], 8 bytes cmem[14]
ptxas info : Compiling entry function ‘_Z13single_kernelP11KernelInput’ for ‘sm_13’
ptxas info : Used 16 registers, 112+0 bytes lmem, 16+16 bytes smem, 64 bytes cmem[0], 52 bytes cmem[1], 8 bytes cmem[14]
loop-lines-62-64.txt (1.52 KB)
single.cu (10.7 KB)

I have not had time to look at the code. In general, local arrays with non-constant indexing cannot be placed into registers as register files are not indexable. If such an array is accessed in a loop, it is sometimes possible to make all indices compile-time constants by completely unrolling the loop. There are many reasons the compiler may not unroll a loop on it own, from unstructured control flows (including the use of continue or break) to the code size after unrolling exceeding internal limits. Programmers can use #pragma unroll directly before the loop to request full unrolling, so I would suggest you give that a try.

Several loops over array elements still miss the [font=“Courier New”]#pragma unroll[/font] in your code…

That’s an important point: For local arrays to be mapped to registers, any and all accesses must use compile-time constant indices. In addition, there is a size limit on the size of such arrays, to prevent excessive use of registers that would lead to spilling. I do not know what the size limit is, and it may be target architecture specific to account for the different number of registers available per thread: 124 for sm_1x, 63 for sm_2x and sm_30, and 255 for sm_35. Note that small local arrays (say 10 elements or so) may benefit greatly from L1 cache such that the savings from placing such an array into registers may be less significant than anticipated, especially since increased register use can reduce occupancy.