Performance comparison between Shfl() and _shfl_sync()

On V100 (sm_70), “_shfl()” is changed to a new instruction “_shfl_sync()”, and the program performance becomes worse. What are the advantages of _shfl_sync() compare with _shfl()。
And I use Demo to test this problem:
#define OLD 1
global void TEST(float* reidx)
{
const int wChanIdex = blockIdx.x + blockIdx.y * gridDim.x;
const int tid = threadIdx.x;
float tmp = 10.0;
for(int i=0; i<9000000;i++)
{
#if OLD
tmp =__shfl(tid,0)+tmp;
#else
tmp=__shfl_sync(0xffffffff,tid,0)+tmp;
#endif
}
for(int i=0; i<9000000;i++)
{
#if OLD
reidx[wChanIdex] =__shfl(tid,0)+tmp;
#else
reidx[wChanIdex] =__shfl_sync(0xffffffff,tid,0)+tmp;
#endif
}
}
when OLD is 1 function process time is 82.69ms
when OLD is 0,function process time is 87.33ms

First of all, the old shuffle (__shfl()) is deprecated. Furthermore, for cc7.0 and above it is removed. You’re not supposed to use it on cc7.0 or later architectures. Second, the __shfl_sync() operation has guaranteed/predictable semantics in the presence of incidental warp divergence, the previous version does not. If you want to know what the _sync version does, please read the description in the programming guide.

Thanks very much for your reply .

On the same code, the test performance of Compiling “_shfl()” with compute_60 is better than that of Compiling “_shfl_sync()” with compute_70, and the results are the same.

When our optimized code is transplanted from Pascal to the new architecture of Volta, the performance will decline if all shuffle instructions are replaced with “_sync()”, which we don’t want to see.We want to know the cause of this problem.

We also read the the description in the programming guide,and did not mention the description of performance。

when the number of cycles is changed from 900000 to 1, viewed through “cuobjdump -sass”. The sass code generated by OLD=1 and OLD=0 is exactly the same, and the performance is the same.

When the cycle is 9000000 times, the sass code generated by old = 1 is different from that generated by old = 0.
when old = 1, the cycle is expanded 30 times,
when old = 0, the cycle is expanded 6 times. Other things are basically the same.
the performance of old = 1 is better than old = 0
so the more expanded times the better it performance?

Yes, in your concrete example, the higher the unroll factor, the better the performance. This is because the compiler will optimize away most of the shuffle instructions. For example, when the unroll factor is 30, there will be only 1 shuffle per 30 iterations. You can manually set the unroll factor using #pragma unroll 30.

You are not measuring shuffle performance when using different unroll factors.

1 Like

thank for your help ,it’s very useful.