reduction optimization #1 Not agree with performances explanation

Hi,

I read again the well known parallel reduction example, especially its 7 optimizations given in “High performance Computing with CUDA”

For remembering, the original code is:

extern __shared__ int sdata[];

//[...]

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {

  if (tid % (2*s) == 0) {

     sdata[tid] += sdata[tid + s];

  }

__syncthreads();

}

and the first optimization proposes to remove the divergence due to the conditionnal instruction

if (tid % (2*s) == 0)

. The new loop is:

for (unsigned int s=1; s < blockDim.x; s *= 2) {

  int index = 2 * s * tid;

  if (index < blockDim.x) {

    sdata[index] += sdata[index + s];

}

__syncthreads();

}

with a speedup of 2.33x. BUT, in my view, divergent branching are the same, and the result could be the consequence of removing the costly operation %

About branch diverging, it should happen only when two different execution paths appear. But without “else” instruction after an “if”, I thougth there could’nt have divergence…

As I am a newbie, I suppose I am wrong :blarg: . So please push me in the right way

Again, I finally answer myself

The solution increases performance, not by removing divergent branchs, but by removing INTERLEAVED divergent branchs.

Hope it will be hepfull for other newbizzz

H.

I think “divergence” means that “divergence within warp”. Divergence between warps will not decrease performance.

Absolutely

If you understand, would you be so kind as to explain to me?

I don’t understand either why the 2nd loop is faster than the first one, besides that it doesn’t use the expensive modulo operator…

I mean, the for loop is executed the same number of times, and the accumulation is performed the same number of times, so how can it be faster?

I just got back from vacation, my brain hasn’t rebooted completely yet :)

For the 1st loop, on the 1st pass the threads #0, 2, 4, 6, 8, 10, 12 and 14 will accumulate. On the 2nd pass the threads #0, 4, 8 and 12 accumulate. And so on.

For the 2nd loop, on the 1st pass the threads #0 to 7 accumulate. On the 2nd pass, threads #0 to 3 accumulate, and so on.

Why would the 1st loop be that faster?

[url=“http://www.gpgpu.org/sc2007/SC07_CUDA_5_Optimization_Harris.pdf”]http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op...tion_Harris.pdf[/url]

Because of the conditionnal instruction “if”, there are divergent branchs in the process.

The key point is, in the second loop, threads are ordered assuming their answer to the instruction “if”. That is, there is no divergence into blocks (in fact, there can be one block with some divergences, but only one).

H.

I still don’t get it. :wacko:

I understand that in the 2nd loop, all threads where the if() is evaluated to TRUE are grouped together at the beginning (first thread IDs). Still, all threads belong to the same warp (in Mark Harris’ example he has 16 threads, which is smaller than the warp size of 32). So there should still be divergence within a warp.

Loop 1, 1st pass:

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Loop 2: 1st pass:

1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

What do you call a block? Which block would show divergence?

Is my fault, it is no question of block here… sorry External Image

In the first case, all warps (sorry for my mistake, here we focus on only one block of course) show some divergences, so have to be serialized to manage them.
In opposite, in the seconde loop only one warp maximum could contain, both, threads evaluated TRUE and threads evaluated FALSE. In consequence, only this warp should be serialized.

The goal is to reduce as well as possible the number of divergence in your warps.

Hi

I’m trying to undestand the paper about parallel reduction. Why does, in the kernel 2, after first optimization, we have shared memory bank conflicts? Why is sequential addressing conflict free?

Please, help me.

Thanks.