Do I need to write my own warp-wide broadcasting function or will __shfl handle it efficiently?

I’m recalling what I read about __shfl() functions, that when all threads in a warp call __shfl(, i) to the same value of i it results in a broadcast. But, does that just mean it is TECHNICALLY a broadcast, or can I count on the compiler to be smart enough to know that it should do the following:

int tX;

<assign some value to tX>
<only the value of tX in thread 0 of the warp matters>

tX = __shfl_up(tX, 16);
tX = __shfl_up(tX, 8);
tX = __shfl_up(tX, 4);
tX = __shfl_up(tX, 2);
tX = __shfl_up(tX, 1);


int tX;

if (threadIdx.x % 32 == 0) tX = 1234; else tX=2345; // populate each warp with test data, assuming a 1D thread block

tX = __shfl(tX, 0);

this does the job and requires exactly one shuffle instruction. All threads of the warp will
contain the value 1234 afterwards. There is no need for the compiler to recognize or optimize anything. The shuffle instruction does what it does, and it does so efficiently.

in hardware, there is a mixing engine that in a fixed number of cycles copies data into all lanes as requested. it is equally fast for shuffle, suggle_up and shuffle_xor. throughput of this operation is the same as ld/st operations, i.e. usually 1/4 of simple alu operations throughput

afaik, shuffle operations just reuse shared adressing engine, i.e. it’s essentiall st+ld operations, but data are never stored in shared memory and instead intercepted on its path to shared memory

btw, amd gcn also supports “inverse shuffle” operation. i.e. nvidia shuffle is the same as

shmem[i] = reg.lane[index.lane[i]]
reg.lane[i] = shmem[i]

and amd also supports

shmem[i] = reg.lane[i]
reg.lane[index.lane[i]] = shmem[i]

Great, so I can count on one __shfl to broadcast to all 32 threads and do so more efficiently than five separate shuffles ordering each thread to pull on one of their neighbors.