I ran into an issue where the compiler’s output does not match my expectation. The following is a minimized reproducer (a live repro is at Compiler Explorer link):
// includes omitted
inline __device__ uint32_t swap_endian(uint32_t x)
{
return __byte_perm(x, uint32_t{0}, uint32_t{0x0123});
}
constexpr int block_size = 256;
__global__ void kernel(uint8_t* data)
{
__shared__ ulonglong4 buffer[block_size];
ulonglong4* buf_long = &(buffer[threadIdx.x]);
uint4* buf_int = &(reinterpret_cast<uint4*>(buf_long)[0]);
*buf_int = *reinterpret_cast<uint4*>(data);
buf_int->x = swap_endian(buf_int->x);
buf_int->y = swap_endian(buf_int->y);
buf_int->z = swap_endian(buf_int->z);
buf_int->w = swap_endian(buf_int->w);
if (threadIdx.x == 0 && blockIdx.x == 0) {
for (int i = 0; i < sizeof(uint4) / sizeof(uint32_t); ++i) {
printf("0x%08x\n", reinterpret_cast<uint32_t*>(buffer)[i]);
}
}
}
int main()
{
constexpr size_t num_bytes = block_size * sizeof(ulonglong4);
uint8_t* h_data = reinterpret_cast<uint8_t*>(std::malloc(num_bytes));
for (size_t i = 0; i < num_bytes; ++i) {
h_data[i] = i;
}
uint8_t* d_data = nullptr;
cudaMalloc(&d_data, num_bytes);
cudaMemcpy(d_data, h_data, num_bytes, cudaMemcpyHostToDevice);
kernel<<<1, block_size, 0, 0>>>(d_data);
cudaFree(d_data);
std::free(h_data);
}
The point of this code is to load 128 bytes at a time into shared memory, and perform a byte permutation on the loaded values. Against my expectation, this code prints out
0x00010203
0x04050607
0x04050607
0x0c0d0e0f
Inspecting the generated PTX, the relevant section looks something like
mov.u64 %SPL, __local_depot0;
cvta.local.u64 %SP, %SPL;
ld.param.u64 %rd1, [_Z6kernelPh_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 5;
mov.u32 %r3, _ZZ6kernelPhE6buffer;
add.s32 %r4, %r3, %r2;
ld.global.v4.u32 {%r5, %r6, %r7, %r8}, [%rd2];
mov.u32 %r12, 291;
mov.u32 %r13, 0;
prmt.b32 %r14, %r8, %r13, %r12;
prmt.b32 %r15, %r6, %r13, %r12;
prmt.b32 %r16, %r5, %r13, %r12;
st.shared.v4.u32 [%r4], {%r16, %r15, %r15, %r14};
mov.u32 %r17, %ctaid.x;
or.b32 %r18, %r1, %r17;
setp.ne.s32 %p1, %r18, 0;
@%p1 bra $L__BB0_2;
It seems to me that 128 bytes are read from global memory into registers, and only three byte permutes are performed, and one value is saved twice to shared memory.
From what I understand, the code breaks C++ strict aliasing rules. However, similar instances can also be found in the CUDA Samples, to similarly load multiple bytes at once. I am unsure about aliasing the ulonglong4*
as a uint4*
though. Crucially, it seems that using a shared memory array of uint4
s does not have the same problem (but then again, removing the byte permutes also generates correct PTX).
My question is now: is this a bug in the compiler, or does this code produce undefined behavior (that is not supported by the compiler)? If it is indeed a bug, I’ll file a report.
Edit:
NVCC version: Cuda compilation tools, release 12.6, V12.6.77, Build cuda_12.6.r12.6/compiler.34841621_0
Driver version: 560.35.03
GPU: NVIDIA GeForce RTX 2070
OS: Ubuntu 22.04.1 x86_64