Undefined behavior or compiler bug

I ran into an issue where the compiler’s output does not match my expectation. The following is a minimized reproducer (a live repro is at Compiler Explorer link):

// includes omitted

inline __device__ uint32_t swap_endian(uint32_t x) 
{ 
   return __byte_perm(x, uint32_t{0}, uint32_t{0x0123}); 
}

constexpr int block_size  = 256;

__global__ void kernel(uint8_t* data)
{
    __shared__ ulonglong4 buffer[block_size];

    ulonglong4* buf_long = &(buffer[threadIdx.x]);
    uint4* buf_int       = &(reinterpret_cast<uint4*>(buf_long)[0]);
    *buf_int             = *reinterpret_cast<uint4*>(data);

    buf_int->x = swap_endian(buf_int->x);
    buf_int->y = swap_endian(buf_int->y);
    buf_int->z = swap_endian(buf_int->z);
    buf_int->w = swap_endian(buf_int->w);

    if (threadIdx.x == 0 && blockIdx.x == 0) {
        for (int i = 0; i < sizeof(uint4) / sizeof(uint32_t); ++i) {
            printf("0x%08x\n", reinterpret_cast<uint32_t*>(buffer)[i]);
        }
    }
}

int main()
{
    constexpr size_t num_bytes = block_size * sizeof(ulonglong4);

    uint8_t* h_data = reinterpret_cast<uint8_t*>(std::malloc(num_bytes));
    for (size_t i = 0; i < num_bytes; ++i) {
        h_data[i] = i;
    }

    uint8_t* d_data = nullptr;
    cudaMalloc(&d_data, num_bytes);
    cudaMemcpy(d_data, h_data, num_bytes, cudaMemcpyHostToDevice);

    kernel<<<1, block_size, 0, 0>>>(d_data);

    cudaFree(d_data);

    std::free(h_data);
}

The point of this code is to load 128 bytes at a time into shared memory, and perform a byte permutation on the loaded values. Against my expectation, this code prints out

0x00010203
0x04050607
0x04050607
0x0c0d0e0f

Inspecting the generated PTX, the relevant section looks something like

	mov.u64 	%SPL, __local_depot0;
	cvta.local.u64 	%SP, %SPL;
	ld.param.u64 	%rd1, [_Z6kernelPh_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	mov.u32 	%r1, %tid.x;
	shl.b32 	%r2, %r1, 5;
	mov.u32 	%r3, _ZZ6kernelPhE6buffer;
	add.s32 	%r4, %r3, %r2;
	ld.global.v4.u32 	{%r5, %r6, %r7, %r8}, [%rd2];
	mov.u32 	%r12, 291;
	mov.u32 	%r13, 0;
	prmt.b32 	%r14, %r8, %r13, %r12;
	prmt.b32 	%r15, %r6, %r13, %r12;
	prmt.b32 	%r16, %r5, %r13, %r12;
	st.shared.v4.u32 	[%r4], {%r16, %r15, %r15, %r14};
	mov.u32 	%r17, %ctaid.x;
	or.b32  	%r18, %r1, %r17;
	setp.ne.s32 	%p1, %r18, 0;
	@%p1 bra 	$L__BB0_2;

It seems to me that 128 bytes are read from global memory into registers, and only three byte permutes are performed, and one value is saved twice to shared memory.

From what I understand, the code breaks C++ strict aliasing rules. However, similar instances can also be found in the CUDA Samples, to similarly load multiple bytes at once. I am unsure about aliasing the ulonglong4* as a uint4* though. Crucially, it seems that using a shared memory array of uint4s does not have the same problem (but then again, removing the byte permutes also generates correct PTX).

My question is now: is this a bug in the compiler, or does this code produce undefined behavior (that is not supported by the compiler)? If it is indeed a bug, I’ll file a report.

Edit:
NVCC version: Cuda compilation tools, release 12.6, V12.6.77, Build cuda_12.6.r12.6/compiler.34841621_0
Driver version: 560.35.03
GPU: NVIDIA GeForce RTX 2070
OS: Ubuntu 22.04.1 x86_64

Your shared memory variable is not volatile. And you are doing no synchronizations. From the view of the compiler, shared memory does not have to be used at all.

I don’t really know what output you are expecting, but if I add __syncthreads() like so:

__syncthreads();
buf_int->x = swap_endian(buf_int->x);
__syncthreads();
buf_int->y = swap_endian(buf_int->y);
__syncthreads();
buf_int->z = swap_endian(buf_int->z);
__syncthreads();
buf_int->w = swap_endian(buf_int->w);

then I witness:

0x00010203
0x04050607
0x08090a0b
0x0c0d0e0f

I haven’t studied the various thread interactions to see if all of those are needed.

Thanks for your replies. Let me clarify by saying that the shared memory is used as a per-thread cache. Each thread has it’s own ulonglong4, so no synchronization between the threads should be needed. Each thread stores a uint4 in that per-thread buffer, and prints from that memory location. Hence, the shared memory must be used. Similar results are observed if, instead of printing, the results are written to a global memory buffer and copied to the host.

I suggest filing a bug.