Performance impact of "-maxrregcount=X" ... register usage of tex3D ...

Better occupancy leads in general to better performance, but thats often not true when forcing the register count down to a desired value.

The first benchmark without a limitation of registers.

-maxrregcount=128 

Kernel registers:

1>ptxas info	: Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'

1>ptxas info	: Used 48 registers, 44+16 bytes smem, 12 bytes cmem[1]

Active Threads per Multiprocessor	256

Active Warps per Multiprocessor	8

Active Thread Blocks per Multiprocessor	1

Occupancy of each Multiprocessor	25%

Benchmark: Elapsed time test: 1273.7278452  msec

I started to reduce the number of registers by setting maxrregcount. I have done so until I reached a limit where the compiler started to move registers to local memory. This means local memory is never used.

-maxrregcount=42

1>ptxas info	: Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'

1>ptxas info	: Used 42 registers, 44+16 bytes smem, 12 bytes cmem[1]

Active Threads per Multiprocessor	384

Active Warps per Multiprocessor	12

Active Thread Blocks per Multiprocessor	2

Occupancy of each Multiprocessor	38%

Benchmark: Elapsed time test: 1337.0669315  msec

With -maxregcount=32 every thread uses now 64 byte of uncached local memory and this is what is causing the big performance drop. Occupacy rises to 50 %.

The next oberservation I made is that, the compiler pushed the register count down to 30 even though maxregcount was set to 32.

-maxrregcount=32

Kernel registers:

1>ptxas info	: Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'

1>ptxas info	: Used 30 registers, 64+0 bytes lmem, 44+16 bytes smem, 12 bytes cmem[1]

Active Threads per Multiprocessor	512

Active Warps per Multiprocessor	16

Active Thread Blocks per Multiprocessor	2

Occupancy of each Multiprocessor	50%

Benchmark: Elapsed time test: 2006.5070508  msec

It would be very useful to have some compiler generated information about the register usage and the limits, e. g. when does the compiler start to spill registers to local memory. I had to collect this information by testing different values for maxrregcount and that’s a boring job and could be done by the compiler or a script …


Now we come to tex3D and it’s register usage.

It seems that the tex3D call works only on registers, this means all operands (=4) have to be in a register even when the operands are constants.

The use of tex3D causes a higher register usage! That’s important for kernels that already using many registers.

__global__ void

copy1D_tex3D(float *dst)

{ 

	dst[9] = tex3D(tex_image_3D, 1, 2, 3);

}

Here is the ptx code.

.entry _Z12copy1D_tex3DPf (

		.param .u32 __cudaparm__Z12copy1D_tex3DPf_dst)

	{

	.reg .u32 %r<9>;

	.reg .f32 %f<7>;

	.loc	14	57	0

$LBB1__Z12copy1D_tex3DPf:

	mov.f32 	%f1, 0f3f800000;	 	// 1

	mov.f32 	%f2, 0f40000000;	 	// 2

	mov.f32 	%f3, 0f40400000;	 	// 3

	mov.f32 	%f4, 0f00000000;	 	// 0

	tex.3d.v4.u32.f32 {%r1,%r2,%r3,%r4},[tex_image_3D,{%f1,%f2,%f3,%f4}];

	.loc	14	59	0

	mov.s32 	%r5, %r1;

	cvt.u8.u32 	%r6, %r5;

	cvt.rn.f32.u32 	%f5, %r6;

	ld.param.u32 	%r7, [__cudaparm__Z12copy1D_tex3DPf_dst];

	st.global.f32 	[%r7+36], %f5;

	.loc	14	60	0

	exit;

$LDWend__Z12copy1D_tex3DPf:

	} // _Z12copy1D_tex3DPf

Some thoughts to my optimization problem, which is stated in this thread: http://forums.nvidia.com/index.php?showtopic=152945

Until today, I achieved the best performance of my kernel by using tex1Dfetch for data that is read and written and for read only data I used tex3D.

Benchmark: Elapsed time test: 1273.7278452 msec (first benchmark from above).

In another implementation (today) I used neither tex1Dfetch nor tex3D and managed to reduce the register count to 33 registers. I used a 1D indexing scheme in a 3D volume to reduce the number of registers. By forcing the maxrregcount to 32 I achieved the best result, which is:

1>ptxas info	: Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'

1>ptxas info	: Used 32 registers, 8+0 bytes lmem, 44+16 bytes smem, 12 bytes cmem[1]

Active Threads per Multiprocessor	512

Active Warps per Multiprocessor	16

Active Thread Blocks per Multiprocessor	2

Occupancy of each Multiprocessor	50%

Elapsed CPU time test: 1115.6475840  msec

Using 33 register and optimizing the block size result in following values:

1>ptxas info	: Compiling entry function '_Z24kernel_index_tex_tex3DPfS_PhS0_S0_S0_S0_jjjj'

1>ptxas info	: Used 33 registers, 44+16 bytes smem, 8 bytes cmem[1]

3.) GPU Occupancy Data is displayed here and in the graphs:	

Active Threads per Multiprocessor	448

Active Warps per Multiprocessor	14

Active Thread Blocks per Multiprocessor	1

Occupancy of each Multiprocessor	44%

Benchmark: Elapsed CPU time test: 1213.7278803  msec

To conclude this benchmarks:

You have to pay attention when setting maxrregcount, since it can result either in significant drop of performance or in a performance boost.

Higher occupancy should result in better performance as long as you are not forced to reduce the number of registers “drastically”.

It is quite hard to find optimal settings (maxrregcount, blockSize and the implementation itself) and some more tool support would be quite nice.

I’m wondering that a occupancy increase from 448 threads (44 %) to 512 (50 %) lead actually to a performance increase. I often hear something of 192 (19 %) threads per SM is sufficient to hide most/all latency?

Is there a posibility to find out the minimum occupancy that results in maximum performance? Some code analysis tool would be nice that runs through your code and measures times and return the best settings. Something like a feedback profiling tool.

Very interesting findings indeed!

How do you get the statics for your benchmark runs?
Active Threads per Multiprocessor
Active Warps per Multiprocessor
Active Thread Blocks per Multiprocessor
Occupancy of each Multiprocessor

kynan

I haven’t looked at it recently but I noticed before that register usage was a nightmare when making multiple calls to tex2D(). From decuda I concluded that the problem was even worse than what you have found. Not only do the arguments have to all be in registers but they have to be in quite specific registers. This is because there aren’t enough different opcodes for all the different combinations of registers.