Reading from float4 corrupts memory

X100 · October 3, 2024, 3:22pm

I’ve got the following code which does some computation to obtain a vector y. It is bottlenecked by global memory bandwidth so I was trying to accelerate it by reading data using float4, but just reading the data somehow leads to parts of the result vector y being zero instead of the correct result, although the float4 data is not being used anywhere. (I write the data to an unused vector to prevent removal of the code through compiler optimization, but the #if block should otherwise be a no-op).

Is this a compiler bug or am I doing something incorrectly?

(Very sorry that I could not reduce the code further, but removing seemingly innocuous parts of the code made the bug go away.)

#define ACTIVATE_BUG 1

#include <cuda.h>
#include <stdio.h>

#define N_CONSECUTIVE 8

__global__ void matvec_gpu(const float *diagonals, const int *offsets, const float *x, float *y, int n, int n_diags, float *unused){
    int i = (threadIdx.x + blockIdx.x * blockDim.x) * N_CONSECUTIVE;

    int max_offset = offsets[n_diags - 1];

    if (i >= max_offset + N_CONSECUTIVE && i < n - max_offset - N_CONSECUTIVE){
        float s[N_CONSECUTIVE] = {0};

        for (int j = 0; j < n_diags; j++){
            int offset = offsets[j];
            const float *diag = diagonals + j * n;

            float *dst = s;
            const float *a = diag + i;
            const float *b = x + i + offset;

            intptr_t a_align = (intptr_t)a & 15;
            intptr_t b_align = (intptr_t)b & 15;

            switch (a_align | (b_align << 8)){
                case 0:
                    for (int j = 0; j < N_CONSECUTIVE; j += 4){
                        float4 b4 = *(const float4*)(b + j);

#if ACTIVATE_BUG
                        // Reading 'b4' somehow corrupts 'y'.
                        // The variable 'unused' is used so the compiler does not remove the read.
                        if (b[j + 2] != b4.z){
                            *unused = 0;
                        }
                        if (b[j + 3] != b4.w){
                            *unused = 0;
                        }
#endif
                    }
                    for (int k = 0; k < N_CONSECUTIVE; k++){
                        dst[k] += a[k] * b[k];
                    }

                    break;

                default:{
                    for (int k = 0; k < N_CONSECUTIVE; k++){
                        dst[k] += a[k] * b[k];
                    }
                }
            }
        }

        for (int k = 0; k < N_CONSECUTIVE; k++)
            y[i + k] = s[k];
    }else{
        for (int k = 0; k < N_CONSECUTIVE; k++){
            if (i >= n) return;

            // Safe variant of the above with bounds checking for offset
            float s = 0;
            for (int j = 0; j < n_diags; j++){
                int offset = offsets[j];
                const float *diag = diagonals + j * n;
                if (i + offset >= 0 && i + offset < n){
                    s += diag[i] * x[i + offset];
                }
            }
            y[i++] = s;
        }
    }
}

int cdiv(int a, int b){
    return (a + b - 1) / b;
}

template <typename T>
struct Memory {
    T *ptr;
    int count;

    Memory(int count){
        this->count = count;
        cudaMallocManaged(&ptr, count * sizeof(T));
    }

    ~Memory(){
        cudaFree(ptr);
    }

    T& operator[](int i){
        return ptr[i];
    }

    const T& operator[](int i) const {
        return ptr[i];
    }
};

int main(){
    int w = 800;
    int h = 575;
    int n = w * h;
    int n_diags = 13;
    Memory<int> offsets(n_diags);
    Memory<float> x(n), y(n), y_expected(n), diagonals(n_diags * n), unused(1);
    for (int i = 0; i < n; i++) x[i] = 1;
    for (int i = 0; i < n_diags * n; i++) diagonals[i] = 1;
    for (int i = 0; i < n_diags; i++) offsets[i] = i + 1;
    for (int i = 0; i < n; i++) y_expected[i] = i >= n - n_diags ? n - i - 1 : n_diags;

    int block_dim = 128;
    int grid_dim = cdiv(n, block_dim * N_CONSECUTIVE);

    matvec_gpu<<<grid_dim, block_dim>>>(diagonals.ptr, offsets.ptr, x.ptr, y.ptr, n, n_diags, unused.ptr);

    cudaDeviceSynchronize();

    for (int i = 0; i < n; i++){
        if (y[i] != y_expected[i]){
            printf("ERROR: y[%d] = %.30f, expected %.30f\n", i, y[i], y_expected[i]);
            return 1;
        }
    }

    printf("ok\n");

    return 0;
}

Robert_Crovella · October 3, 2024, 3:40pm

When I run your code with compute-sanitizer I get lots of errors. It may be possible to study those to get more clarity.
Example:

========= Invalid __global__ read of size 8 bytes
=========     at 0xf10 in /root/bobc/t295.cu:30:matvec_gpu(const float *, const int *, const float *, float *, int, int, float *)
=========     by thread (35,0,0) in block (177,0,0)
=========     Address 0x7f766a0b1674 is misaligned
=========     and is inside the nearest allocation at 0x7f766a000200 of size 1,840,000 bytes

It may be a clue as to what is going on. The word “misaligned” in the output above is usually indicating that you have violated a natural alignment rule.

This is line 30:

So its curious that presumably a 16 byte read would be triggered, but the compiler has generated code that is only reading 8 bytes. This can happen if the compiler determines that some of the bytes are “unused”/“unneeded”, but it may require some unravelling of SASS code to fully understand. It can also happen if the compiler somehow knows that the structure does not have guaranteed alignment, but I wouldn’t expect that to be the case for use of float4. That structure (provided by CUDA header files) should have appropriate alignment guarantees.

If I suspected misalignment in a code, another diagnostic approach I might consider is to inspect the numerical values of the pointers actually used in the code that compute-sanitizer is complaining about.

I was able to reproduce the above observation on CUDA 12.2, L4. Another possible diagnostic is to run your code on the latest CUDA version available. Bugs are always possible. Bugs get fixed all the time.

njuffa · October 3, 2024, 3:51pm

Given that subsequent code only uses the .z and .w components of the float4, an 8-byte load seems appropriate. However, per the output of compute-sanitizer, that 8-byte load is only 4-byte aligned, which means that the guards put into the code to avoid misaligned access are not working properly.

Robert_Crovella · October 3, 2024, 3:57pm

I wonder how the .w and .z components translate into an 8-byte load.

I also note that if I compile with -G, the code reports “ok” and the compute-sanitizer errors go away. Another possible clue.

njuffa · October 3, 2024, 4:10pm

A float4 comprises float components .x, .y, .z, and .w. Each of those uses 4 bytes. Only the two components .z and .w are used here, so a compiler could issue an 8-byte load to access just those, instead of issuing a 16-byte load for the full float4. This would be a perfectly legal optimization by the “as-is” rule (IMHO). Since the address of a float4 must be 16-byte aligned, the derived address of the 8-byte load would automatically be 8-byte aligned, so the optimization would be safe under that aspect.

Robert_Crovella · October 3, 2024, 4:11pm

right, sorry. w and z are adjacent. I don’t know what I was thinking.

njuffa · October 3, 2024, 4:32pm

FWIW, I see only two LDG.E.64 instructions (that is, 8-byte loads) in the generated SASS (machine code), so one might want to fire up a debugger and put break points on those to see which winds up with the unaligned address and work backwards from there. There is just too much code to do that casually.

The optimization “generate partial load for a vector load only some of whose result components are used” appears to occur in the compiler backend, i.e. ptxas, as I see two ld.global.v4.f32 (full float4 load) at PTX level.

While compiler bugs are always a possibility, they are rare, so the most likely cause of the misaligned access is an error in the source code. I would suggest reducing to the smallest possible matrix (minimize number of matrix elements) and smallest possible launch configuration (minimize the number of threads) that triggers the issue, then instrumenting the code to dump out relevant pointer values.

I am not a C++ language lawyer, but I am wondering how “legal” the pointer casting from float* to float4* is. A notion stuck in the back of my head is that in C++ two pointers to different types cannot point to the same data object, and the compiler can transform code under this assumption. This trivially applies when one is a float* and the other a int32_t *, so the classical type punning through pointer casting invokes UB, which is why memcpy() is the canonical method of bit-wise transfer in classical C++, or std::bit_cast in C++20. How strictly this applies to casting between pointers to “related” types like float4 and float I do not know.

X100 · October 3, 2024, 6:10pm

compute-sanitizer looks like a useful tool. Thanks! Unfortunately, it does not seem to work on Ubuntu 22.04 (“Target application terminated before first instrumented API call”), but I now know what to look for. Maybe this will help me to reduce the test case even more.

X100 · October 3, 2024, 6:17pm

Technically, float4 b4 = (const float4*)(b + j); might be illegal, but it is such a common pattern that the compiler should allow it anyway. For example, this NVIDIA Developer Blog post does something similar by casting int* to int4*: CUDA Pro Tip: Increase Performance with Vectorized Memory Access | NVIDIA Technical Blog

memcpy(&b4, b + j, sizeof(b4)); instead of float4 b4 = *(const float4*)(b + j); seems to fix it! Thank you!

njuffa · October 3, 2024, 6:23pm

My observation may be outdated, but I seem to recall that the CUDA compiler does not always emit the highest performing code for memcpy() in device code. I think CUDA already supports various C++20 features such as std::bit_cast, so you might want to give that a try as well.

It is probably worthwhile to look at the generated machine code (SASS), e.g. by looking at the output of cuobjdump --dump-sass. If the vector type float4 is used for performance reasons, transfer via memcopy() or std::bit_cast may simply cause this to be split into multiple float loads under the hood, (partially) negating any positive performance effects one was hoping for from the use of vector types.

Robert_Crovella · October 3, 2024, 6:25pm

Yes, I suspect the use of memcpy will throw out any benefit you were hoping to achieve by cleverly using vector loads.

I have certainly used compute-sanitizer successfully on Ubuntu 22.04. I’m not certain what the issue may be, more info would probably be needed. It’s possible the GPU you are using could be an issue, you might need to use cuda-memcheck instead if on a very old GPU.

X100 · October 3, 2024, 7:25pm

Oh, you are right, memcpy results in horrible SASS with 16 individual 8 bit loads. Is there some canonical way to get wider loads without resorting to PTX?

njuffa · October 3, 2024, 7:39pm

Yeah, that is kind of what I remembered. I am not a compiler engineer, but from what I observe from compilers targeting x86-64 and AMR64 it should be possible for the compiler to optimize that by treating memcpy() as a well-known function and observing that it receives a float* meaning it is safe to copy in 32-bit chunks. Consider filing an enhancement request / performance bug with NVIDIA.

I do not want to sound like a broken record, but have you tried std::bit_cast? It is entirely possible that this simply maps back to memcpy() under the hood but seems worth trying.

If the compiler cannot produce a faster bit transfer, but use of the vector type provides advantages in your code besides potentially producing a vector load instruction, you could always copy component-wise with your own code, as a workaround.

X100 · October 3, 2024, 7:44pm

I have not yet figured out how to use std::bit_cast. To me, it looks like the source and destination type should be of the same size, but float4 is 16 bytes while a single float is 4 bytes. With memcpy, I could just use the pointers and did not have that issue. Am I supposed to create my_own_float4 struct and cast from it to float4 or is there a more straightforward way?

njuffa · October 3, 2024, 7:54pm

Sorry, I have not advanced to C++20 yet and have never tried std::bit_cast. It is entire possible that std::bit_cast is only useful for transfers between same-size data types, i.e. for __int_as_float() and __float_as_int() kind of functionality.

I don’t see how there would be any advantage that using one’s own struct has over use of a pre-defined vector type. To copy in the obvious way you would simply write:

float *float_ptr;
float4 r;
r.z = float_ptr[0];
r.y = float_ptr[1];
r.z = float_ptr[2];
r.w = float_ptr[3];

Obviously this will result in four 32-bit loads, which may or may not have an impact on performance compared to a 128-bit vector load.

X100 · October 3, 2024, 8:06pm

I was thinking of first casting float* to my_float4* to make it the same size as float4 and then cast it to float4 with std::bit_cast to hint to the compiler to not do evil things with it. Something like this:

#include <bit>
#include <stdio.h>

struct my_float4 {
    float x, y, z, w;
};


struct float4 {
    float x, y, z, w;
};

int main(){
    float a[4] = {1, 2, 3, 4};

    float4 b = std::bit_cast<float4, my_float4>(*(my_float4*)a);

    printf("%f %f %f %f\n", b.x, b.y, b.z, b.w);

    return 0;
}

But it looks like my CUDA toolkit does not support C++20 yet and I do not want to break my system when upgrading, so I can not try std::bit_cast unfortunately.

X100 · October 3, 2024, 8:10pm

Maybe I’m just unlucky with my setup again. I’ve opened a new issue for compute-sanitizer: Compute-sanitizer "Error: Target application terminated before first instrumented API call" Ubuntu 22.04

njuffa · October 3, 2024, 8:20pm

I can totally understand that. I also upgrade toolchains reluctantly and rather work around known problems with the toolchain I am already using than discover unknown new issues with newer toolchains.

Per the CUDA Programming Guide:

All C++20 language features are supported in nvcc version 12.0 and later, subject to restrictions

So if you are using a CUDA version < 12.0 there is no support for C++20. I am currently on CUDA 12.3, but I have simply not gotten around to trying any C++20 specific functionality. Keeping up with the latest language features is generally low on my list of priorities.

Curefab · October 3, 2024, 8:22pm

Or you just stay with the UB for now. As long as the specific compiler supports it without generating UB.

Does anybody know if nvcc has flags for type aliasing settings?

You could do reinterpret_cast with reference or pointer types instead of functional casts.

But is is your responsibility to ensure alignment.

Topic		Replies	Views
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7108	July 22, 2011
Cuda code performance CUDA Programming and Performance	14	3111	December 16, 2014
New Features in CUDA 7.5 Technical Blog	66	1065	August 10, 2016
pointer to shared memory compiler problems CUDA Programming and Performance	19	14507	June 7, 2008
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29434	April 29, 2010
How to optimize my cuda code? CUDA Programming and Performance	14	1683	June 28, 2023
Unable to Allocate on Memory and Stack Overflow CUDA Programming and Performance	11	2532	September 6, 2011
Troubles converting float array to float2 array CUDA Programming and Performance	11	3509	July 3, 2017
Upgrading to CUDA 12.4 broke down the application GPU-Accelerated Libraries cublas , cusparse	13	1063	July 21, 2024
Very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D for transfer of strided data (device to ... CUDA Programming and Performance	22	2510	July 10, 2019

Reading from float4 corrupts memory

Related topics