Why is __shfl slower than shared memory

Hi Everyone,

I have this an example which is implemented by using shared memory. But in some cases i want to change shared memory reading codes with __shfl instruction.

The code piece is like this

for (int i=0; i<iteration ; i++)
 {    
  int left = sh_mem_read_array[W];
  int up = sh_mem_read_array[tx];
  int right = sh_mem_read_array[E];
  
  int shortest = MIN(left, up);
  shortest = MIN(shortest, right);

  sh_mem_write_array[tx] = shortest + global_Memory_Array[ some_index ];

  __syncthreads(); 

  //..  <<other very important codes>>
}

I changed with this codes by using shfl like this

for (int i=0; i<iteration ; i++)
 {    
  int left = sh_mem_read_array[W];
  int up = __shfl_up(left,1);
  int right = __shfl_up(left,2);
  
  int shortest = MIN(left, up);
  shortest = MIN(shortest, right);

  sh_mem_write_array[tx] = shortest + global_Memory_Array[ some_index ];

  __syncthreads(); 

  //..  <<other very important codes>>
}

When i analyse program, i can’t see advantages between 2 application. Moreover first application works little bit faster than 2nd.

(P.S. i know __shfl implementation doesn’t work properly for last 2 threads in warp since i didn’t check laneid. but nevertheless it doesn’t work fast)

Note that shuffle operations aren’t much faster than a single, non-bank-conflicted shared memory read:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput

In your case, you have replaced two shared memory reads with two shuffle operations.

Since shuffle operations are moving data between threads, a more valid (functional) comparison is to compare a single shuffle operation with a single (indexed) shared memory write followed by a single (indexed) shared memory read. This functionally performs the same thing (moving data from one thread to another) and in this scenario, a single shuffle is usually noticeably faster than performing 2 shared memory operations.

@txbob
Thanks to you for quick answer.

You are absolutely right. So, Can we figure out whether when should we use __shfl instruction?

My other connected question is How many cycle does __shfl take? (since it is an instruction, i am thinking only 1 cycle. Or is it special instruction? )

Most GPU instructions are pipelined. So the way to think about them from a throughput standpoint is not to ask how long they take, but how many of these types of operations can be issued per clock cycle (or retired per clock cycle - the same).

In that case the throughput info is in the table I already linked. For example, a cc 3.0 SM can issue up to 192 SP floating point adds (or multiply-adds) in a single cycle. Since instructions are issued per-warp, that translates to up to 6 warps of SP FP ops can be issued in a single cycle.

On the other hand, the corresponding throughput number for shuffle is 32. That means only one warp of shuffle operations can be issued per cycle.

In general, shuffle is pretty fast, when compared to back-to-back shared memory swizzled store then load (to accomplish shuffling). In other cases, I can’t make generalizations about when to use it. You’ll need to compare two different realizations, and the specifics of those realizations will determine which is faster.

In this particular case, you replaced two shared memory loads with two shuffle operations, and the result was not faster, according to your report.

You can also use the profilers to help.

1 Like

On Maxwell, both SHFL and LDS have a latency of about 24 clocks. An STS followed by an LDS has a latency of only about 30 clocks (this is warp synchronous so no BAR.SYNCs). So to see a benefit you’ll want to be able to remove a store to shared as well as a load and even so it looks like a 25% bump is the most you should expect. Then again, depending on how much ILP you have around you might be hiding that extra latency and not see any speedup at all (these instructions are both synchronized by barriers and hence you can continue calculating while they fetch in the background).

And if you’re curious, more latency values:

LDS.U.64: 31 clocks

STS.64
LDS.U.64: 39 clocks

LDS.U.128: 44 clocks

STS.128
LDS.U.128: 56 clocks

These numbers aren’t 100% accurate because of time slicing issues, but they should be pretty close.

And txbob brought up the other important performance metric of throughput. On Maxwell, both LDS and SHFL have a throughput of 32 (or a quarter that of arithmetic). This is provided you don’t have any bank conflicts in your shared access. That will cut the throughput down for each conflict.

LDS.64 also has a 32 throughput.
LDS.128 has a 12.5 throughput.

STS numbers are the same as LDS.

LDG throughput by comparison is 12.5.

Thank you for all quick answers.

@scottgray
Thank you for metrics. But how did you perform these measurements? I couldn’t do Instruction level profiling with nvprof.

By the way, when i check the metrics of nvprof i saw “Inter-Thread Instructions”. Does this metric involve how many instruction executed because of __shfl? Because when i used __shfl i can saw some instructions with metric. If i don’t use __shfl, i haven’t seen something from it yet.

For latencies I write a little benchmark in sass assembly like so:

--:-:-:-:1      CS2R clock1, SR_CLOCKLO;
--:-:1:-:2      SHFL.BFLY PT, result, a, 0x1, 0x1f;
01:-:-:-:6      CS2R clock2, SR_CLOCKLO;
--:-:-:-:1      IADD clock1, clock2, -clock1;

Then launch just one warp to minimize time slicing issues.

https://code.google.com/p/maxas/

For thoughput I drop a bunch of the same instructions a in big loop and run that loop a bunch of times. Then I compare the throughput to the theoretical of 128 instructions issued per clock per SM.

Not sure what “Inter-Thread Instructions” measures. Shfl might fit that description… or syncthreads maybe…