More threads/block increase kernel execution time. WHY?

Hi insmvb00!!!

I have checked the ptx file and it shows st.local.u32 and ld.local.u32 but what does it say really? External Image

( I have not checked the ptx file before so I don’t feel that familiar with it )

Below is a part of the ptx file:

$Lt_1_25090:

 //<loop> Loop body line 24, nesting depth: 1, iterations: 6

	.loc	19	1411	0

	ld.const.u32 	%r27, [%r21+0];

	mul.lo.u32 	%r28, %r27, %r25;

	add.u32 	%r29, %r28, %r26;

	.loc	19	1412	0

	set.gt.u32.u32 	%r30, %r28, %r29;

	neg.s32 	%r31, %r30;

	mul.hi.u32 	%r32, %r27, %r25;

	add.u32 	%r26, %r31, %r32;

	.loc	19	1413	0

	st.local.u32 	[%r23+0], %r29;

	add.u32 	%r23, %r23, 4;

	add.u32 	%r21, %r21, 4;

	setp.ne.u32 	%p3, %r21, %r22;

	@%p3 bra 	$Lt_1_25090;

	.loc	19	1415	0

	st.local.u32 	[__cuda_result_16+24], %r26;

	.loc	19	1420	0

	mul.lo.u32 	%r33, %r20, 4;

	mov.u32 	%r34, __cuda_result_16;

	add.u32 	%r35, %r33, %r34;

	ld.local.u32 	%r26, [%r35+8];

	.loc	19	1421	0

	ld.local.u32 	%r36, [%r35+4];

	and.b32 	%r37, %r17, 31;

	mov.u32 	%r38, 0;

	setp.eq.u32 	%p4, %r37, %r38;

	@%p4 bra 	$Lt_1_25602;

Thanks!!

Hi insmvb00!!!

I have checked the ptx file and it shows st.local.u32 and ld.local.u32 but what does it say really? External Image

( I have not checked the ptx file before so I don’t feel that familiar with it )

Below is a part of the ptx file:

$Lt_1_25090:

 //<loop> Loop body line 24, nesting depth: 1, iterations: 6

	.loc	19	1411	0

	ld.const.u32 	%r27, [%r21+0];

	mul.lo.u32 	%r28, %r27, %r25;

	add.u32 	%r29, %r28, %r26;

	.loc	19	1412	0

	set.gt.u32.u32 	%r30, %r28, %r29;

	neg.s32 	%r31, %r30;

	mul.hi.u32 	%r32, %r27, %r25;

	add.u32 	%r26, %r31, %r32;

	.loc	19	1413	0

	st.local.u32 	[%r23+0], %r29;

	add.u32 	%r23, %r23, 4;

	add.u32 	%r21, %r21, 4;

	setp.ne.u32 	%p3, %r21, %r22;

	@%p3 bra 	$Lt_1_25090;

	.loc	19	1415	0

	st.local.u32 	[__cuda_result_16+24], %r26;

	.loc	19	1420	0

	mul.lo.u32 	%r33, %r20, 4;

	mov.u32 	%r34, __cuda_result_16;

	add.u32 	%r35, %r33, %r34;

	ld.local.u32 	%r26, [%r35+8];

	.loc	19	1421	0

	ld.local.u32 	%r36, [%r35+4];

	and.b32 	%r37, %r17, 31;

	mov.u32 	%r38, 0;

	setp.eq.u32 	%p4, %r37, %r38;

	@%p4 bra 	$Lt_1_25602;

Thanks!!

Hi again!

The compiler uses this memory (local) when the kernel has an intensive use of the registers.

This is what Nvidia says in the CUDA programming guide 3.1:

As i told you before, the register spilling (registers to local memory) is a very expensive operation and the local memory is very slow.

The solution is, reducing the number of registers used in your kernel code or decrement the number of active blocks by SM.

Try to check a high ‘-maxrregcount’ to reducing the SM occupation and that these ‘st.local or ld.local’ disappear.

Remember that a low occupation does not necessarily cause bigger times than high occupation.

Bye, and good luck!

Hi again!

The compiler uses this memory (local) when the kernel has an intensive use of the registers.

This is what Nvidia says in the CUDA programming guide 3.1:

As i told you before, the register spilling (registers to local memory) is a very expensive operation and the local memory is very slow.

The solution is, reducing the number of registers used in your kernel code or decrement the number of active blocks by SM.

Try to check a high ‘-maxrregcount’ to reducing the SM occupation and that these ‘st.local or ld.local’ disappear.

Remember that a low occupation does not necessarily cause bigger times than high occupation.

Bye, and good luck!

Hi tera!

Here are the kernel execution time for 128 threads:

128x1: ~0.0932 ms

64x2 : ~0.0931 ms

32x4 : ~0.0932

16x8 : ~0.0939 ms

8x16 : ~0.0937 ms

4x32 : ~0.0938 ms

2x64 : ~0.0946 ms

1x128: ~0.160 ms

I guess the differance between 64 threads and 128 threads for 8x8 vs 16x8 has to have something to do with register spilling to local memory as insmvb00 suggested…

For those of you that start of by reading this post:

The difference in time between the different configuration has already been explained in the previous post.

Hi tera!

Here are the kernel execution time for 128 threads:

128x1: ~0.0932 ms

64x2 : ~0.0931 ms

32x4 : ~0.0932

16x8 : ~0.0939 ms

8x16 : ~0.0937 ms

4x32 : ~0.0938 ms

2x64 : ~0.0946 ms

1x128: ~0.160 ms

I guess the differance between 64 threads and 128 threads for 8x8 vs 16x8 has to have something to do with register spilling to local memory as insmvb00 suggested…

For those of you that start of by reading this post:

The difference in time between the different configuration has already been explained in the previous post.

Hi insmvb00!

Thank you so much for you explanation! =)

I have tried to change the maxrregcount (it was set to 32 which should be enough ) to 64 but I still see the same amount of st.local and ld.local in the ptx output (I’m using visual studio). If I set the maxregcount to 8 it should spill more register memory to local memory, right? But it still show the same amount of st.local and ld.local.

Hi insmvb00!

Thank you so much for you explanation! =)

I have tried to change the maxrregcount (it was set to 32 which should be enough ) to 64 but I still see the same amount of st.local and ld.local in the ptx output (I’m using visual studio). If I set the maxregcount to 8 it should spill more register memory to local memory, right? But it still show the same amount of st.local and ld.local.

it has still accesses to local memory?

And, how many registers does use each thread now? (the cuda profiler shows this value).

Try to improve the registers usage, e.g. removing ‘freq’ variable and uses its inmediate value.

You too can use ‘-use_fast_math’ flag cause your kernel is basically a ‘sinf’ and a ‘cosf’ call and they are

improved with this flag. Moreover that flag can improve too the register usage External Image

it has still accesses to local memory?

And, how many registers does use each thread now? (the cuda profiler shows this value).

Try to improve the registers usage, e.g. removing ‘freq’ variable and uses its inmediate value.

You too can use ‘-use_fast_math’ flag cause your kernel is basically a ‘sinf’ and a ‘cosf’ call and they are

improved with this flag. Moreover that flag can improve too the register usage External Image

Hi!

When I change the maxregcount to 8 I saw (I missed it the last time) that the lmem from the ptxas output increased ( from 56 to 92 ), hence (as you said) it had to put some of the registers need by the thread in local memory It did however not see any difference in the ptx-file that I generated, this is what made me confused=)

Does st.local and ld.local always mean that some registers has to be put in local memory that could be avoided?

If I compile it with maxrregcount 14-64 it shows a local memory usage for each thread for 56 bytes ( I dont know if that was your question)

I tried to -use_fast_math and as you said it did improve the kernel execution time and the ptxas output showed that it only needed 8 registers instead of 14 as before and no lmem was presented.

So if I got this right:

The 56 bytes of local memory is some registers that has been spilled to the local memory? If so do you think that this is a result of (as you once again mentioned before) a pro-cation by the compiler since I’m so close to 16K register space?

Once again thank you for clearing things up for me External Image

Hi!

When I change the maxregcount to 8 I saw (I missed it the last time) that the lmem from the ptxas output increased ( from 56 to 92 ), hence (as you said) it had to put some of the registers need by the thread in local memory It did however not see any difference in the ptx-file that I generated, this is what made me confused=)

Does st.local and ld.local always mean that some registers has to be put in local memory that could be avoided?

If I compile it with maxrregcount 14-64 it shows a local memory usage for each thread for 56 bytes ( I dont know if that was your question)

I tried to -use_fast_math and as you said it did improve the kernel execution time and the ptxas output showed that it only needed 8 registers instead of 14 as before and no lmem was presented.

So if I got this right:

The 56 bytes of local memory is some registers that has been spilled to the local memory? If so do you think that this is a result of (as you once again mentioned before) a pro-cation by the compiler since I’m so close to 16K register space?

Once again thank you for clearing things up for me External Image