Diverge-free doesn't win 32x over Diverge-all warp divergence

(cuda1.0, 8800GTX, winxp, vs2005)
Hi, I have ugly kernels full of data-dependent branches like “if…else… for(…)”.
Each thread processes 1 element of a independently. No _syncthreads() in kernel.
At first, the data in a is random, thus I call it “Diverge-all”. time cost is 900ms.
Then, I let a[i] = a[32 * (i /32)], thus all threads in a warp take exactly the same branches. I call it “Diverge-free”. Surprisingly, time cost is 400ms, speedup is only 2x.
I can’t explain this with my current knowledge of “SIMD warps”. It seems the more complicated the kernel, the little effect the duplicated (SIMD) tasks have.
Thanks for any comments!

900ms in one pass? it must be a big kernel…
How divergent things really are depend on the exact way you formulate your code and how random the data is. For one if, you can’t possibly get more than 2-way divergence, and things should converge back after each if. That is likely what you’re experiencing.

BTW: If my math didn’t fail me, you’re only expected to get ~20 different values for 32 integers uniformly distributed in 0~31.
Also, for a true 32-way divergence, one would need 5 levels of nested if-else, and write a lot of code in all 32 clauses to ensure “true if”. I highly doubt whether you’re doing that.

asadafag:
Thanks for the cool comments! I got it and agree.
but my code has “for(i=0;i<x;++i){if…else…}”, in which x is data-dependent and diverges from 0 to 64. Doesn’t that translate to 128-way divergence? sure that actually no worse than 32-way divergence.
Thanks!

Is your divergence really short? The compiler could be producing predicated instructions (check the ptx). This would explain the roughly 1/2 change in performance.

The code you provide is not enough to grasp the cost of divergence. You can’t get more than two divergent branches with an if-else statement. If both branches take about the same amount of time, it makes sense that by getting rid of the branches you get a 2x speedup - due to divergence threads have to execute instructions in both branches.

Since the number of loop iterations varies among threads, the cost of a warp depends on the max number of iterations in that warp.

Also, keep in mind that impact of divergence on the total time depends on the percentage of time the kernel spends in divergent sections.

Paulius

What a coincidence! I happen to have a similar kernel as my bottleneck. That’s just a 2-way divergence, since there’s convergence at the end of for. In my case, the major problem is load balancing. Like Paulius said, a warp takes time proportional to max{x}. A rough sort by x before this kernel may result in slightly improved performance. However, even a rough sort is costly, so you may need some work balancing the two parts.

sigh, I just now found the fact of just 2-way divergence. Compiler won’t expand the for loop! I will try expand the for loop manually.
a rough sort, or to say cluster, will improve it. But as I see, cuda’s bitonic sort is fast, why do you say it’s costly?
For work balancing, i though of a warp-range task queue. each thread apends its task in the shared queue, and other threads fetch from the queue and do the same thing.
I’ll see if https://research.microsoft.com/~jrzhou/pub/…Maintenance.pdf
helps. Thanks!