SDK Reduce Example

In the parallel reduction sample the kernel with an unrolled last warp contains the following code:

[codebox] if (tid < 32)

{

    if (blockSize >=  64) { sdata[tid] += sdata[tid + 32]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  32) { sdata[tid] += sdata[tid + 16]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  16) { sdata[tid] += sdata[tid +  8]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   8) { sdata[tid] += sdata[tid +  4]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   4) { sdata[tid] += sdata[tid +  2]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   2) { sdata[tid] += sdata[tid +  1]; barrier(CLK_LOCAL_MEM_FENCE); }

}[/codebox]

I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.

I thought that barrier() blocks until all threads in the work-group have reached it.

Edit: I’m sorry, I completely misread your question and even NVIDIA’s example!

That would appear to violate the specification.

Nice catch.

Indeed, it looks like these should be using

write_mem_fence(CLK_LOCAL_MEM_FENCE);

instead. And your question of why the program even works is a good one! Perhaps it’s an optimization by the compiler which knows for this hardware that warps are always self-syncronized, so barriers are unnecessary when only one warp is active?

I forgot about write_mem_fence(). That sounds like a good replacement to me. And if it’s a compiler optimization then it sounds like it can break when using a different OpenCL implementation later on.

EDIT: I just realized that a mem_fence(CLK_LOCAL_MEM_FENCE) instead of a write_mem_fence is neccessary, because threads are also reading memory that other threads are writing.

EDIT2: after thinking about it some more, you’re right, write_mem_fence is enough.

Oh joy, looks like I’ll be using mem_fence after all:

[codebox]In file included from :5:

./sum64.clh:12: error: cannot codegen this builtin function yet

            data[tid] += data[tid + 32]; write_mem_fence(CLK_LOCAL_MEM_FENCE);

                                         ^~~~~~~~~~~~~~~[/codebox]

This was kind of unexpected, since the spec defines write_mem_fence and this is a conformant release.