Hi,
I am studying device functions like __shfl_sync()
For an input array like:
50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81
__global__ void test_shfl_broadcast_32(int * in, int *out)
{
int x = in[threadIdx.x];
int y = __shfl_sync(0xffffffff, x, 3, 32);
out[threadIdx.x] = y;
}
returns
53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53
as broadcasted 3rd lane (or 3rd thread as grid=1 block=32) - as shown in the cuda documents.
Now I apply little different mask, expecting a half of array will not be broadcasted (or might be undefined):
__global__ void test_shfl_broadcast_32_masked (int * in, int *out)
{
int x = in[threadIdx.x];
int y = __shfl_sync(0x0f0f0f0f, x, 3, 32);
out[threadIdx.x] = y;
}
but still yields 53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53
.
Tried different masks like 0x00000000 or 0xabcdffff or 0x0000ffff but still same.
In case of undefined behavior, I initialized x/y as 0 but still no differences.
I expected MASK may control which lane will allow the operation or not, yielding different array value.
Briefly, can I get results of 50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53,53 using MASK control?
Any comments are appreciated.
B.
PS. I am using Quadro RTX 5000 + cuda 12.3 on RockyOS8.9