Better occupancy leads in general to better performance, but thats often not true when forcing the register count down to a desired value.
The first benchmark without a limitation of registers.
-maxrregcount=128
Kernel registers:
1>ptxas info : Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'
1>ptxas info : Used 48 registers, 44+16 bytes smem, 12 bytes cmem[1]
Active Threads per Multiprocessor 256
Active Warps per Multiprocessor 8
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 25%
Benchmark: Elapsed time test: 1273.7278452 msec
I started to reduce the number of registers by setting maxrregcount. I have done so until I reached a limit where the compiler started to move registers to local memory. This means local memory is never used.
-maxrregcount=42
1>ptxas info : Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'
1>ptxas info : Used 42 registers, 44+16 bytes smem, 12 bytes cmem[1]
Active Threads per Multiprocessor 384
Active Warps per Multiprocessor 12
Active Thread Blocks per Multiprocessor 2
Occupancy of each Multiprocessor 38%
Benchmark: Elapsed time test: 1337.0669315 msec
With -maxregcount=32 every thread uses now 64 byte of uncached local memory and this is what is causing the big performance drop. Occupacy rises to 50 %.
The next oberservation I made is that, the compiler pushed the register count down to 30 even though maxregcount was set to 32.
-maxrregcount=32
Kernel registers:
1>ptxas info : Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'
1>ptxas info : Used 30 registers, 64+0 bytes lmem, 44+16 bytes smem, 12 bytes cmem[1]
Active Threads per Multiprocessor 512
Active Warps per Multiprocessor 16
Active Thread Blocks per Multiprocessor 2
Occupancy of each Multiprocessor 50%
Benchmark: Elapsed time test: 2006.5070508 msec
It would be very useful to have some compiler generated information about the register usage and the limits, e. g. when does the compiler start to spill registers to local memory. I had to collect this information by testing different values for maxrregcount and that’s a boring job and could be done by the compiler or a script …
Now we come to tex3D and it’s register usage.
It seems that the tex3D call works only on registers, this means all operands (=4) have to be in a register even when the operands are constants.
The use of tex3D causes a higher register usage! That’s important for kernels that already using many registers.
__global__ void
copy1D_tex3D(float *dst)
{
dst[9] = tex3D(tex_image_3D, 1, 2, 3);
}
Here is the ptx code.
.entry _Z12copy1D_tex3DPf (
.param .u32 __cudaparm__Z12copy1D_tex3DPf_dst)
{
.reg .u32 %r<9>;
.reg .f32 %f<7>;
.loc 14 57 0
$LBB1__Z12copy1D_tex3DPf:
mov.f32 %f1, 0f3f800000; // 1
mov.f32 %f2, 0f40000000; // 2
mov.f32 %f3, 0f40400000; // 3
mov.f32 %f4, 0f00000000; // 0
tex.3d.v4.u32.f32 {%r1,%r2,%r3,%r4},[tex_image_3D,{%f1,%f2,%f3,%f4}];
.loc 14 59 0
mov.s32 %r5, %r1;
cvt.u8.u32 %r6, %r5;
cvt.rn.f32.u32 %f5, %r6;
ld.param.u32 %r7, [__cudaparm__Z12copy1D_tex3DPf_dst];
st.global.f32 [%r7+36], %f5;
.loc 14 60 0
exit;
$LDWend__Z12copy1D_tex3DPf:
} // _Z12copy1D_tex3DPf
Some thoughts to my optimization problem, which is stated in this thread: http://forums.nvidia.com/index.php?showtopic=152945
Until today, I achieved the best performance of my kernel by using tex1Dfetch for data that is read and written and for read only data I used tex3D.
Benchmark: Elapsed time test: 1273.7278452 msec (first benchmark from above).
In another implementation (today) I used neither tex1Dfetch nor tex3D and managed to reduce the register count to 33 registers. I used a 1D indexing scheme in a 3D volume to reduce the number of registers. By forcing the maxrregcount to 32 I achieved the best result, which is:
1>ptxas info : Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'
1>ptxas info : Used 32 registers, 8+0 bytes lmem, 44+16 bytes smem, 12 bytes cmem[1]
Active Threads per Multiprocessor 512
Active Warps per Multiprocessor 16
Active Thread Blocks per Multiprocessor 2
Occupancy of each Multiprocessor 50%
Elapsed CPU time test: 1115.6475840 msec
Using 33 register and optimizing the block size result in following values:
1>ptxas info : Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'
1>ptxas info : Used 33 registers, 44+16 bytes smem, 8 bytes cmem[1]
3.) GPU Occupancy Data is displayed here and in the graphs:
Active Threads per Multiprocessor 448
Active Warps per Multiprocessor 14
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 44%
Benchmark: Elapsed CPU time test: 1213.7278803 msec
To conclude this benchmarks:
You have to pay attention when setting maxrregcount, since it can result either in significant drop of performance or in a performance boost.
Higher occupancy should result in better performance as long as you are not forced to reduce the number of registers “drastically”.
It is quite hard to find optimal settings (maxrregcount, blockSize and the implementation itself) and some more tool support would be quite nice.
I’m wondering that a occupancy increase from 448 threads (44 %) to 512 (50 %) lead actually to a performance increase. I often hear something of 192 (19 %) threads per SM is sufficient to hide most/all latency?
Is there a posibility to find out the minimum occupancy that results in maximum performance? Some code analysis tool would be nice that runs through your code and measures times and return the best settings. Something like a feedback profiling tool.