In the parallel reduction sample the kernel with an unrolled last warp contains the following code:
[codebox] if (tid < 32)
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1]; barrier(CLK_LOCAL_MEM_FENCE); }
}[/codebox]
I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.
I thought that barrier() blocks until all threads in the work-group have reached it.
I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.
I thought that barrier() blocks until all threads in the work-group have reached it.
Edit: I’m sorry, I completely misread your question and even NVIDIA’s example!
That would appear to violate the specification.
In the parallel reduction sample the kernel with an unrolled last warp contains the following code:
[codebox] if (tid < 32)
{
if (blockSize >= 64) { sdata[tid] += sdata[tid + 32]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 32) { sdata[tid] += sdata[tid + 16]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 16) { sdata[tid] += sdata[tid + 8]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 8) { sdata[tid] += sdata[tid + 4]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 4) { sdata[tid] += sdata[tid + 2]; barrier(CLK_LOCAL_MEM_FENCE); }
if (blockSize >= 2) { sdata[tid] += sdata[tid + 1]; barrier(CLK_LOCAL_MEM_FENCE); }
}[/codebox]
I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.
I thought that barrier() blocks until all threads in the work-group have reached it.
Nice catch.
Indeed, it looks like these should be using
write_mem_fence(CLK_LOCAL_MEM_FENCE);
instead. And your question of why the program even works is a good one! Perhaps it’s an optimization by the compiler which knows for this hardware that warps are always self-syncronized, so barriers are unnecessary when only one warp is active?
I forgot about write_mem_fence(). That sounds like a good replacement to me. And if it’s a compiler optimization then it sounds like it can break when using a different OpenCL implementation later on.
EDIT: I just realized that a mem_fence(CLK_LOCAL_MEM_FENCE) instead of a write_mem_fence is neccessary, because threads are also reading memory that other threads are writing.
EDIT2: after thinking about it some more, you’re right, write_mem_fence is enough.
Oh joy, looks like I’ll be using mem_fence after all:
[codebox]In file included from :5:
./sum64.clh:12: error: cannot codegen this builtin function yet
data[tid] += data[tid + 32]; write_mem_fence(CLK_LOCAL_MEM_FENCE);
^~~~~~~~~~~~~~~[/codebox]
This was kind of unexpected, since the spec defines write_mem_fence and this is a conformant release.