What kind of function is __shfl_down_sync

I understand how shuffle operations are used. Nevertheless, I am kind of confused about what kind of function shuffle functions are to the compiler. For instance __shfl_down_sync does not use pass-by-value. Instead, it returns value from a different lane using variable name. This, to me, is not an ordinary C/C++ function. Any explanation?

Think of it like doing a permutation on a vector register with 32 lanes, similar to what _mm256_permutevar8x32 does with 8 lanes in AVX.

It is definitely pass-by-value. Each lane provides its source argument via a pass-by-value parameter. Each lane can retrieve its result (provided by another lane) as the return value of the function.

The pass-by-value argument is unchanged, just as you would expect. And if you assign the result to the same argument that you passed by value, that is not any different than C or C++ function behavior, either.

Oh, so if I use, say, __shfl_down_sync(FULL_MASK, __shfl_down_sync(FULL_MASK, val, offset), offset), it also works in the way of pass-by-value?

It’s a bit confusing to say this:

__shfl_down_sync(FULL_MASK, __shfl_down_sync(FULL_MASK, val, offset), offset);

The innermost shuffle function provides a return value. That return value is assigned to a temporary by the compiler, and then used subsequently in the outermost shuffle (the temporary is not otherwise visible to the program).

With respect to the outermost shuffle, the return value is ignored.

Therefore your code as a whole has no program-visible effect that I can quickly see.

How does that make sense? It doesn’t make sense to me. At least, it does not seem useful. But syntactically, there i nothing wrong with what you have shown, and the argument handling is pass-by-value, both for the innermost function and the outermost function.

sorryffor the confusion. I should write SOME_OTHER_VAR = __shfl_down_sync(FULL_MASK, __shfl_down_sync(FULL_MASK, SOME_VAR, offset), offset);

So this statement is to get SOME_VAR of distance 2*offset (within range of lane), right?

Yes, I believe so.

First the innermost shuffle will publish SOME_VAR from each thread. Then the return value for a given thread will be indicated by the offset (a given thread will select the published value from the thread indicated by offset).

That value obtained above will then be published by each thread in the outermost shuffle. Each thread will capture the published value of the thread indicated by offset, into SOME_OTHER_VAR.

If offset is 1, then the innermost shuffle warp lane 1 would retrieve the published value from warp lane 2.

At the outermost shuffle, the warp lane 0 would retrieve the published value from warp lane 1.

So yes, it it appears for that example, it would be like a single shuffle of 2*offset.

The border/wrap situations would be different, they always are. But most threads in the warp would retrieve the SOME_VAR value from the warp lane indicated by 2*offset.