Why is __shfl slower than shared memory

grynet · November 25, 2014, 9:32am

Hi Everyone,

I have this an example which is implemented by using shared memory. But in some cases i want to change shared memory reading codes with __shfl instruction.

The code piece is like this

for (int i=0; i<iteration ; i++)
 {    
  int left = sh_mem_read_array[W];
  int up = sh_mem_read_array[tx];
  int right = sh_mem_read_array[E];
  
  int shortest = MIN(left, up);
  shortest = MIN(shortest, right);

  sh_mem_write_array[tx] = shortest + global_Memory_Array[ some_index ];

  __syncthreads(); 

  //..  <<other very important codes>>
}

I changed with this codes by using shfl like this

for (int i=0; i<iteration ; i++)
 {    
  int left = sh_mem_read_array[W];
  int up = __shfl_up(left,1);
  int right = __shfl_up(left,2);
  
  int shortest = MIN(left, up);
  shortest = MIN(shortest, right);

  sh_mem_write_array[tx] = shortest + global_Memory_Array[ some_index ];

  __syncthreads(); 

  //..  <<other very important codes>>
}

When i analyse program, i can’t see advantages between 2 application. Moreover first application works little bit faster than 2nd.

(P.S. i know __shfl implementation doesn’t work properly for last 2 threads in warp since i didn’t check laneid. but nevertheless it doesn’t work fast)

Robert_Crovella · November 25, 2014, 10:28am

Note that shuffle operations aren’t much faster than a single, non-bank-conflicted shared memory read:

[url]Programming Guide :: CUDA Toolkit Documentation

In your case, you have replaced two shared memory reads with two shuffle operations.

Since shuffle operations are moving data between threads, a more valid (functional) comparison is to compare a single shuffle operation with a single (indexed) shared memory write followed by a single (indexed) shared memory read. This functionally performs the same thing (moving data from one thread to another) and in this scenario, a single shuffle is usually noticeably faster than performing 2 shared memory operations.

grynet · November 27, 2014, 12:54pm

@txbob
Thanks to you for quick answer.

You are absolutely right. So, Can we figure out whether when should we use __shfl instruction?

My other connected question is How many cycle does __shfl take? (since it is an instruction, i am thinking only 1 cycle. Or is it special instruction? )

Robert_Crovella · November 27, 2014, 4:00pm

Most GPU instructions are pipelined. So the way to think about them from a throughput standpoint is not to ask how long they take, but how many of these types of operations can be issued per clock cycle (or retired per clock cycle - the same).

In that case the throughput info is in the table I already linked. For example, a cc 3.0 SM can issue up to 192 SP floating point adds (or multiply-adds) in a single cycle. Since instructions are issued per-warp, that translates to up to 6 warps of SP FP ops can be issued in a single cycle.

On the other hand, the corresponding throughput number for shuffle is 32. That means only one warp of shuffle operations can be issued per cycle.

In general, shuffle is pretty fast, when compared to back-to-back shared memory swizzled store then load (to accomplish shuffling). In other cases, I can’t make generalizations about when to use it. You’ll need to compare two different realizations, and the specifics of those realizations will determine which is faster.

In this particular case, you replaced two shared memory loads with two shuffle operations, and the result was not faster, according to your report.

You can also use the profilers to help.

scottgray · November 27, 2014, 4:02pm

On Maxwell, both SHFL and LDS have a latency of about 24 clocks. An STS followed by an LDS has a latency of only about 30 clocks (this is warp synchronous so no BAR.SYNCs). So to see a benefit you’ll want to be able to remove a store to shared as well as a load and even so it looks like a 25% bump is the most you should expect. Then again, depending on how much ILP you have around you might be hiding that extra latency and not see any speedup at all (these instructions are both synchronized by barriers and hence you can continue calculating while they fetch in the background).

And if you’re curious, more latency values:

LDS.U.64: 31 clocks

STS.64
LDS.U.64: 39 clocks

LDS.U.128: 44 clocks

STS.128
LDS.U.128: 56 clocks

These numbers aren’t 100% accurate because of time slicing issues, but they should be pretty close.

scottgray · November 27, 2014, 4:20pm

And txbob brought up the other important performance metric of throughput. On Maxwell, both LDS and SHFL have a throughput of 32 (or a quarter that of arithmetic). This is provided you don’t have any bank conflicts in your shared access. That will cut the throughput down for each conflict.

LDS.64 also has a 32 throughput.
LDS.128 has a 12.5 throughput.

STS numbers are the same as LDS.

LDG throughput by comparison is 12.5.

grynet · November 27, 2014, 6:00pm

Thank you for all quick answers.

@scottgray
Thank you for metrics. But how did you perform these measurements? I couldn’t do Instruction level profiling with nvprof.

By the way, when i check the metrics of nvprof i saw “Inter-Thread Instructions”. Does this metric involve how many instruction executed because of __shfl? Because when i used __shfl i can saw some instructions with metric. If i don’t use __shfl, i haven’t seen something from it yet.

scottgray · November 27, 2014, 6:21pm

For latencies I write a little benchmark in sass assembly like so:

--:-:-:-:1      CS2R clock1, SR_CLOCKLO;
--:-:1:-:2      SHFL.BFLY PT, result, a, 0x1, 0x1f;
01:-:-:-:6      CS2R clock2, SR_CLOCKLO;
--:-:-:-:1      IADD clock1, clock2, -clock1;

Then launch just one warp to minimize time slicing issues.

https://code.google.com/p/maxas/

For thoughput I drop a bunch of the same instructions a in big loop and run that loop a bunch of times. Then I compare the throughput to the theoretical of 128 instructions issued per clock per SM.

Not sure what “Inter-Thread Instructions” measures. Shfl might fit that description… or syncthreads maybe…

Topic		Replies	Views
Conflict of shfl.sync() CUDA Programming and Performance	16	106	December 15, 2024
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	8950	November 29, 2013
Cannot achieve max shuffle bandwidth CUDA Programming and Performance	1	330	November 20, 2023
Reduction through shared memory vs. shuffle CUDA Programming and Performance	8	7035	April 29, 2014
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	342	July 31, 2024
Three questions about register shuffle and shared memory CUDA Programming and Performance	3	1321	October 12, 2021
Performance comparison between Shfl() and _shfl_sync() CUDA Programming and Performance cuda , kernel , ubuntu	5	1175	February 16, 2022
shfl function in kepler CUDA Programming and Performance	11	6652	November 26, 2013
clock cycles of double operation CUDA Programming and Performance	9	5045	April 23, 2009
Are there anything like DPP instructions other than shfl.sync? CUDA Programming and Performance	11	831	November 9, 2021

Why is __shfl slower than shared memory

Related topics