Dummy operation improoves performance

Hello,

I’m testing CUDA in an application that implements a sieve of Eratosthenes to generate prime numbers. The program works correctly.

However, a strange thing is that when I use a dummy “if statement” where it’s condition is always evaluated to false for all threads I get an unexpected speedup.

The part of the code is attached bellow. The dummy if statement is marked with bold. It always evaluates to false because the kernel work just fine and does not halt on the while(true) loop.

On my 8400M GS laptop I get about 5,5 seconds without the “if statement” and about 5 seconds with it. That’s half a second speedup.

Is there any reason for a useless part of code to improve total performance?? Does it have to do with the optimizer?

[codebox]

...

for(; primecount < primesD[PRIME_BUFFER_INTS-1].x; primecount+=32 ){

	unsigned int nextprime = primesD[primecount].x;

	[b]if( nextprime==0 )

		while(true);[/b]

	unsigned int basemultidx = getStartingSievePos(nextprime, offset+blockIdx.x * blockDim.x * INTS_PER_THREAD);

	int widx = basemultidx / 32, wbitidx = basemultidx % 32;

	while( widx < INTS_PER_BLOCK ){

		unsigned int mask = 0xffffffff;

		clearbit(mask, wbitidx);

		buf[widx] &= mask;

		basemultidx += nextprime;

		widx    = basemultidx / 32;

		wbitidx = basemultidx % 32;

	}

}

__syncthreads();

// store prime bitmap

for(int j=0; j<INTS_PER_BLOCK/32; j++)

	bufD[offset+gridbaseintidx+j*32+threadIdx.x].x = buf[j*32+threadIdx.x];

...

[/codebox]

I’ve got the similar. The code [codebox]for (;;)

//real code here

break;

}

[/codebox]

runs faster than

[codebox]{

//real code

}[/codebox]

I have no idea, why.

Have a look if the number of registers changes between the two versions. The compiler cannot schedule instructions across the loop in the first case, so it may use less registers. This can then lead to higher occupancy and a speed-up.

If I create the cubin files for both versions it seems that both use the same number of registers (15). Only the bincode section seems to be different like depicted bellow:

File without dummy branch:

[codebox]code {

name = __globfunc__Z10ksieve_shmP5uint1j

lmem = 0

smem = 4120

reg = 15

bar = 1

bincode  {

	0xa0004211 0x04200780 0x40084c1d 0x00200780 

	0x30030e09 0xc4100780 0xa0000005 0x04000780 

	0x2000ca09 0x04208780 0x30010815 0xc4100780 

	...[/codebox]

File with dummy branch:

[codebox]code {

name = __globfunc__Z10ksieve_shmP5uint1j

lmem = 0

smem = 4120

reg = 15

bar = 1

bincode  {

	0xa0004211 0x04200780 0x40084c1d 0x00200780 

	0x30030e09 0xc4100780 0xa0000005 0x04000780 

	0x2000ca0d 0x04208780 0x30010815 0xc4100780 

	...[/codebox]

However when I generated the ptx assebly codes for both versions I see this :

File without dummy branch:

[codebox] .entry __globfunc__Z10ksieve_shmP5uint1j

{

.reg .u32 %r<168>;

.reg .pred %p<15>;

.param .u32 __cudaparm___globfunc__Z10ksieve_shmP5uint1j_bufD;

.param .u32 __cudaparm___globfunc__Z10ksieve_shmP5uint1j_offset;

.shared .align 4 .b8 __cuda_buf32776[4096];

// b = 0

// mask = 4

.loc	14	139	0

$LBB1___globfunc__Z10ksieve_shmP5uint1j:

mov.u32 	%r1, __cuda_buf32776;	// 

.loc	14	147	0

cvt.u32.u16 	%r2, %ntid.x;    	// 

...[/codebox]

File with dummy branch:

[codebox] .entry __globfunc__Z10ksieve_shmP5uint1j

{

.reg .u32 %r<167>;

.reg .pred %p<16>;

.param .u32 __cudaparm___globfunc__Z10ksieve_shmP5uint1j_bufD;

.param .u32 __cudaparm___globfunc__Z10ksieve_shmP5uint1j_offset;

.shared .align 4 .b8 __cuda_buf32776[4096];

// b = 0

// mask = 4

.loc	14	139	0

...[/codebox]

Is this a change in the number of register usage?

The ptx version with the dummy branch is some lines longer than the other, as expected.

Nope, same number. The PTX doesn’t include the final register allocation. You may have to look through the changes in the PTX code to see if anything fishy happens.