[q] mask in __shfl_sync()

byoungseon.jeon · August 29, 2024, 6:31pm

Hi,
I am studying device functions like __shfl_sync()

For an input array like:
50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81

__global__ void test_shfl_broadcast_32(int * in, int *out)
{
	int x = in[threadIdx.x];
	int y = __shfl_sync(0xffffffff, x, 3, 32);
	out[threadIdx.x] = y;
}

returns
53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53
as broadcasted 3rd lane (or 3rd thread as grid=1 block=32) - as shown in the cuda documents.

Now I apply little different mask, expecting a half of array will not be broadcasted (or might be undefined):

__global__ void test_shfl_broadcast_32_masked (int * in, int *out)
{
    int x = in[threadIdx.x];
    int y = __shfl_sync(0x0f0f0f0f, x, 3, 32);
    out[threadIdx.x] = y;
}

but still yields 53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53
.
Tried different masks like 0x00000000 or 0xabcdffff or 0x0000ffff but still same.
In case of undefined behavior, I initialized x/y as 0 but still no differences.

I expected MASK may control which lane will allow the operation or not, yielding different array value.
Briefly, can I get results of 50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53 using MASK control?
Any comments are appreciated.
B.

PS. I am using Quadro RTX 5000 + cuda 12.3 on RockyOS8.9

Robert_Crovella · August 29, 2024, 6:40pm

No, it does not do that, i.e. it should not be used for that.

Results from a source lane that is not participating (i.e. not represented in the mask) are UB.
Results depending on a destination lane that is not participating (i.e. not represented in the mask) are UB.

If you desire an accurate result for a particular lane, that lane plus any source lane it depends on must both be represented in the mask.

The results for lanes that don’t meet these criteria are UB. Depending on UB for any specific outcome is incorrect.

There is an exception for divergent code that has sufficient/matching masks, but I assume you’re not asking about this.

striker159 · August 29, 2024, 6:43pm

You do not need to modify the mask. You can use the full mask, and conditionally use the return value of the shuffle operation.

    int x = in[threadIdx.x];
    int y = __shfl_sync(0xFFFFFFFF, x, 3, 32);
    if(threadIdx.x % 32 >= 16){
       y = x;
    }
    out[threadIdx.x] = y;

byoungseon.jeon · August 29, 2024, 6:49pm

Hi,
Thanks for the quick response.
So we may use MASK to define the range of valid lines - we will use 0xFFFFFFFF only then.
Additional thread-operation will suffice for our purpose.
We appreciate your help and the sample code.

B.

Curefab · August 30, 2024, 10:22am

@striker159 's response was good. Just adding another way:

int x = in[threadIdx.x];
int rcvidx = (threadIdx.x & 0x0f0f0f0f) ? 3 : (threadIdx.x % 32); // ternary conditional
int y = __shfl_sync(0xFFFFFFFF, x, rcvidx, 32);
out[threadIdx.x] = y;

The lane you are receiving from does not have to be a constant over all threads.

system · September 13, 2024, 10:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.