do warp vote functions cause branching?

I have managed to isolate this piece of code that is giving me headaches as to why is it causing branching’

__global__ void foo(args...){

	int flag = 0;

	do{

	   flag = 1;

	}while(__all(flag ==0))

}

The profiler reports this as causing some severe branching (544 branches when launched with a 4096 blocks, 128 thread size grid). When combined with other pieces of code it causes even more severe branching and some pretty bad slowdown.

Is this the expected behavior with warp vote functions?

Excellent test case since it’s clear by eye that the condition is false and should test and exit once.

I can’t explain the profiler’s report, but it’s often a little mysterious.

One interesting experiment: what if you change the test to “_all(0)” instead? The compiler might optimize it away but if not it would be interesting.
Looking at the PTX might also give a clue what’s happening.

Are you compiling for compute capability 1.2 or higher?
</wild_guessing>

1.3, yes. I don’t think it would compile if you don’t set it to 1.2 minimum

Good call, however it doesn’t seem to help, the compiler keeps reporting the same branching D:

here is the ptx. Other than the changed function name and the bunch of parameters its the same as the code above. This is the version with the while(0) loop.

.entry _Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4PKfPKiS2_i (

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_randomWalkerPosition,

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_coordinatesLut,

		.param .s32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_numRandomWalkers,

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_randomv,

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_color,

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_volume,

		.param .u32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_centerMass,

		.param .s32 __cudaparm__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4

PKfPKiS2_i_cellCount)

	{

	.reg .u32 %rv1;

	.reg .u32 %r<7>;

	.reg .pred %p<5>;

	.loc	48	129	0

$LBB1__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4P

KfPKiS2_i:

$Lt_2_1282:

 //<loop> Loop body line 191

	.loc	48	191	0

	mov.s32 	%r1, 0;

	mov.u32 	%r2, 0;

	setp.ne.u32 	%p1, %r1, %r2;

	vote.all.pred 	%p2, %p1;

	selp.u32 	%r3, 1, 0, %p2;

	mov.s32 	%r4, %r3;

	mov.u32 	%r5, 0;

	setp.ne.s32 	%p3, %r4, %r5;

	@%p3 bra 	$Lt_2_1282;

	.loc	48	199	0

	exit;

$LDWend__Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint

4PKfPKiS2_i:

	} // _Z27updateOptimizedRandomWalkerP4int4PKS_iP5uint4PKfPKiS2_i

On further inspection it’d seem that even this causes branching and the consequent slowdown,

__global__ void foo(args...){

	do{

	}while(__all(0))

}

the corresponding ptx is the same as the one above since the compiler was optimizing flag out anyway.

Here is a proper test case. Just launch it in the profiler and check the branches field…

__global__ void bad(){

	do{

	}while(__all(0));

}

__host__ void method2(){

	bad<<<4096,128>>>();

}

int main( int argc, char** argv )

{

	

method2();

}

Excellent test case, I have to agree. I think we need to leave this one to Nvidia staff.

The only idea I have is that maybe the branching somehow happens in the setup code and this is only the simplest kernel the compiler is not able to completely optimize away.

It may just be the profiler… it’s more likely its report is wrong than the hardware is somehow getting into an inefficient rendundant branching loop.

However I can attest that just adding those lines of code causes a significant performance dip (half the performance actually) and makes that function to be the bottleneck of my program.

I can’t reproduce it here though, so I’ll rather go to bed…

Just for the record I’m using CUDA 3.0, driver version 195.36.15 and profiling on a Tesla C1060. I’d upgrade to 3.1b but as I reported in a previous thread 3.1 breaks the latest stable thrust release, which is why I’m waiting for the final releases. I’ll seriously consider updating if this is a 3.0 only thing though. :P

Don’t upgrade too early - I have to admit I’ve tested on ancient 2.1 CUDA. For weird reasons I’m stuck with having to run with an old Linux kernel, so can’t easily upgrade to newer CUDA versions. :(

Try “volatile int” and see if that helps.

That sounds like a good idea, however in the second test case I posted (post #5) we could see you don’t even need a variable for this to happen…

have you tried moving the vote to inside the loop its a long shot but never know

Yes, I tried organizing my code like that, although I’m starting to think I may have some serious misunderstanding on how warps are organized.

I’m trying to make an algorithm that has a warp level granularity, From my understanding, warps are organized such that if you have a 1D block threads 1-31 pertain to the same warp, 32-63 to the second one and so forth. That’s what I assumed when organizing my kernel.

However, I don’ think that should matter, given that my kernel has the following structure

int flag;

{

(... some calculations in here...)

flag = *calculations final result that returns 0 or 1*

}

if(__any(flag))

   return;

If I run that everything runs fine and dandy, however if I write any line of code after that that is not optimized out by the compiler, say a global memory write,

int flag;

{

(... some calculations in here...)

flag = *calculations final result that retunrs 0 or 1*

}

if(__any(flag))

   return;

<b>globalMemory[warp] = 0;</b>

the code goes up to 300% slower, the profiler shows thousand of branches and tens of thousands of extra instructions in the instructions field… and I just don’t get it…

I had/have a good deal of difficulty understanding the usefulness of the vote intrinsics, but the following gives me some hope:

I can’t really understand the slowdown, but I think the scheme for using vote intrinsics should be different to (likely) produce a desirable result.

I think it requires a loop where a calculation/iteration is carried out (dataparallel) and is tested. Set the flag according to the outcome of the test, and call __any(flag) and return if true. This helps avoiding the “winning thread” sitting idle while the rest of the warp is trying fruitlessly. To do the same trick among warps in a threadblock, you probably need shared memory and make the winning thread set a shared flag and (the rest) testing that after calling __any(flag).

I had/have a good deal of difficulty understanding the usefulness of the vote intrinsics, but the following gives me some hope:

I can’t really understand the slowdown, but I think the scheme for using vote intrinsics should be different to (likely) produce a desirable result.

I think it requires a loop where a calculation/iteration is carried out (dataparallel) and is tested. Set the flag according to the outcome of the test, and call __any(flag) and return if true. This helps avoiding the “winning thread” sitting idle while the rest of the warp is trying fruitlessly. To do the same trick among warps in a threadblock, you probably need shared memory and make the winning thread set a shared flag and (the rest) testing that after calling __any(flag).