SDK Reduce Example

Confusopoly · August 11, 2009, 10:21am

In the parallel reduction sample the kernel with an unrolled last warp contains the following code:

[codebox] if (tid < 32)

{

    if (blockSize >=  64) { sdata[tid] += sdata[tid + 32]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  32) { sdata[tid] += sdata[tid + 16]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  16) { sdata[tid] += sdata[tid +  8]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   8) { sdata[tid] += sdata[tid +  4]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   4) { sdata[tid] += sdata[tid +  2]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   2) { sdata[tid] += sdata[tid +  1]; barrier(CLK_LOCAL_MEM_FENCE); }

}[/codebox]

I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.

I thought that barrier() blocks until all threads in the work-group have reached it.

jcornwall · August 11, 2009, 1:40pm

Edit: I’m sorry, I completely misread your question and even NVIDIA’s example!

That would appear to violate the specification.

SPWorley · August 12, 2009, 4:15am

In the parallel reduction sample the kernel with an unrolled last warp contains the following code:

[codebox] if (tid < 32)
{

    if (blockSize >=  64) { sdata[tid] += sdata[tid + 32]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  32) { sdata[tid] += sdata[tid + 16]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=  16) { sdata[tid] += sdata[tid +  8]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   8) { sdata[tid] += sdata[tid +  4]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   4) { sdata[tid] += sdata[tid +  2]; barrier(CLK_LOCAL_MEM_FENCE); }

    if (blockSize >=   2) { sdata[tid] += sdata[tid +  1]; barrier(CLK_LOCAL_MEM_FENCE); }

}[/codebox]
I’ve been wondering why this works, since more than 32 threads are executing the kernel and only the first 32 execute the barrier() instructions.

I thought that barrier() blocks until all threads in the work-group have reached it.

Nice catch.

Indeed, it looks like these should be using

write_mem_fence(CLK_LOCAL_MEM_FENCE);

instead. And your question of why the program even works is a good one! Perhaps it’s an optimization by the compiler which knows for this hardware that warps are always self-syncronized, so barriers are unnecessary when only one warp is active?

Confusopoly · August 12, 2009, 8:27am

I forgot about write_mem_fence(). That sounds like a good replacement to me. And if it’s a compiler optimization then it sounds like it can break when using a different OpenCL implementation later on.

EDIT: I just realized that a mem_fence(CLK_LOCAL_MEM_FENCE) instead of a write_mem_fence is neccessary, because threads are also reading memory that other threads are writing.

EDIT2: after thinking about it some more, you’re right, write_mem_fence is enough.

Confusopoly · August 12, 2009, 9:57am

Oh joy, looks like I’ll be using mem_fence after all:

[codebox]In file included from :5:

./sum64.clh:12: error: cannot codegen this builtin function yet

            data[tid] += data[tid + 32]; write_mem_fence(CLK_LOCAL_MEM_FENCE);

                                         ^~~~~~~~~~~~~~~[/codebox]

This was kind of unexpected, since the spec defines write_mem_fence and this is a conformant release.

Topic		Replies	Views
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1224	March 31, 2010
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6730	June 2, 2012
Unrolling warps CUDA Programming and Performance	3	2293	December 11, 2011
Why barrier synchronisation not needed here from best practices guide page number 29 CUDA Programming and Performance	6	7863	June 25, 2010
NVIDIA SDK reduction invalid shared memory read? CUDA Programming and Performance	1	1202	November 6, 2009
Bank conflict parallel reduction 2 where are they? CUDA Programming and Performance	0	3271	July 9, 2008
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6538	April 13, 2009
Parallel Sum Reduction Problem CUDA Programming and Performance	5	4510	July 29, 2011
CLK_LOCAL_MEM_FENCE vs CLK_GLOBAL_MEM_FENCE CUDA Programming and Performance	0	4033	June 16, 2010
sum up blockresults with last thread-block? CUDA Programming and Performance	2	2372	July 7, 2008

SDK Reduce Example

Related topics