GPU Instruction Scheduling Questions

Let’s say I’m trying to do a simple reduction over an array size n, say kept within one work unit… With infinite compute units this would seem to take log n steps, but it’s not as if the first wave of threads all execute concurrently, right? One warp fires at a time, correct?

for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset >>= 1) {
     if (local_index < offset) {
       float other = scratch[local_index + offset];
       float mine = scratch[local_index];
       scratch[local_index] = (mine < other) ? mine : other;
     }
     barrier(CLK_LOCAL_MEM_FENCE);
   }

So as I understand it, 32 items get added in parallel, and then that thread waits at the barrier. The scheduler fires another 32 and we wait at the barrier. Another 32 go and we wait at the barrier until all the threads have done the n/2 additions necessary to go at the topmost level of the tree, and we go around the loop.

If indeed only 32 threads can fire simultaneously on a block, why not just spawn a job with 32 threads and go at it with the knowledge that all your threads are running? I understand instruction level parallelism is a big deal, so why not spawn ONE thread and do something like

while(i<array size){
    a += scratch[i]
    b += scratch[i+1]
    c += scratch[i+2]
    ...
    i+=32
}
...
int accum = 0;
accum += a
accum += b
accum += c
accum += d
...

such that all the adds happen within a warp. Now you have ONE thread going keeping the block as busy as you like.

Now assuming instruction level parallelism isn’t really a thing. What about the following, with the work size set to 32.

for(int i = get_local_id(0);i += 32;i++){
    scratch[get_local_id(0)] += scratch[i+get_local_id(0)]
}

and then add the first 32 items together. I imagine that those 32 threads would keep firing again and again.

If you’re not adverse to giving up the generality of cuda/opencl, why bother reducing in a tree when you KNOW how many adds will fire per cycle?

Thanks a bunch for the help!

Evan

Very roughly speaking, you can think that 32 threads (1 warp) are simultaneously executing on a Streaming Multiprocessor (SM). However, for example, Fermi SMs have a dual warp scheduler which can issue 2 warps simultaneously to hide latency by overlapping execution and memory transactions. Kepler SMs (SMX) have quad warp schedulers for the same reason.

Which kind of barrier are you thinking about? __synctreads() synchronizes all the threads in a block, but there is no mechanism to synchronize threads in different blocks. You may wish to have a look at the recent post

https://devtalk.nvidia.com/default/topic/570147/cuda-programming-and-performance/inter-block-synchronization/