I’ve got the following code which does some computation to obtain a vector y. It is bottlenecked by global memory bandwidth so I was trying to accelerate it by reading data using float4, but just reading the data somehow leads to parts of the result vector y being zero instead of the correct result, although the float4 data is not being used anywhere. (I write the data to an unused vector to prevent removal of the code through compiler optimization, but the #if block should otherwise be a no-op).
Is this a compiler bug or am I doing something incorrectly?
(Very sorry that I could not reduce the code further, but removing seemingly innocuous parts of the code made the bug go away.)
#define ACTIVATE_BUG 1
#include <cuda.h>
#include <stdio.h>
#define N_CONSECUTIVE 8
__global__ void matvec_gpu(const float *diagonals, const int *offsets, const float *x, float *y, int n, int n_diags, float *unused){
int i = (threadIdx.x + blockIdx.x * blockDim.x) * N_CONSECUTIVE;
int max_offset = offsets[n_diags - 1];
if (i >= max_offset + N_CONSECUTIVE && i < n - max_offset - N_CONSECUTIVE){
float s[N_CONSECUTIVE] = {0};
for (int j = 0; j < n_diags; j++){
int offset = offsets[j];
const float *diag = diagonals + j * n;
float *dst = s;
const float *a = diag + i;
const float *b = x + i + offset;
intptr_t a_align = (intptr_t)a & 15;
intptr_t b_align = (intptr_t)b & 15;
switch (a_align | (b_align << 8)){
case 0:
for (int j = 0; j < N_CONSECUTIVE; j += 4){
float4 b4 = *(const float4*)(b + j);
#if ACTIVATE_BUG
// Reading 'b4' somehow corrupts 'y'.
// The variable 'unused' is used so the compiler does not remove the read.
if (b[j + 2] != b4.z){
*unused = 0;
}
if (b[j + 3] != b4.w){
*unused = 0;
}
#endif
}
for (int k = 0; k < N_CONSECUTIVE; k++){
dst[k] += a[k] * b[k];
}
break;
default:{
for (int k = 0; k < N_CONSECUTIVE; k++){
dst[k] += a[k] * b[k];
}
}
}
}
for (int k = 0; k < N_CONSECUTIVE; k++)
y[i + k] = s[k];
}else{
for (int k = 0; k < N_CONSECUTIVE; k++){
if (i >= n) return;
// Safe variant of the above with bounds checking for offset
float s = 0;
for (int j = 0; j < n_diags; j++){
int offset = offsets[j];
const float *diag = diagonals + j * n;
if (i + offset >= 0 && i + offset < n){
s += diag[i] * x[i + offset];
}
}
y[i++] = s;
}
}
}
int cdiv(int a, int b){
return (a + b - 1) / b;
}
template <typename T>
struct Memory {
T *ptr;
int count;
Memory(int count){
this->count = count;
cudaMallocManaged(&ptr, count * sizeof(T));
}
~Memory(){
cudaFree(ptr);
}
T& operator[](int i){
return ptr[i];
}
const T& operator[](int i) const {
return ptr[i];
}
};
int main(){
int w = 800;
int h = 575;
int n = w * h;
int n_diags = 13;
Memory<int> offsets(n_diags);
Memory<float> x(n), y(n), y_expected(n), diagonals(n_diags * n), unused(1);
for (int i = 0; i < n; i++) x[i] = 1;
for (int i = 0; i < n_diags * n; i++) diagonals[i] = 1;
for (int i = 0; i < n_diags; i++) offsets[i] = i + 1;
for (int i = 0; i < n; i++) y_expected[i] = i >= n - n_diags ? n - i - 1 : n_diags;
int block_dim = 128;
int grid_dim = cdiv(n, block_dim * N_CONSECUTIVE);
matvec_gpu<<<grid_dim, block_dim>>>(diagonals.ptr, offsets.ptr, x.ptr, y.ptr, n, n_diags, unused.ptr);
cudaDeviceSynchronize();
for (int i = 0; i < n; i++){
if (y[i] != y_expected[i]){
printf("ERROR: y[%d] = %.30f, expected %.30f\n", i, y[i], y_expected[i]);
return 1;
}
}
printf("ok\n");
return 0;
}
When I run your code withcompute-sanitizer I get lots of errors. It may be possible to study those to get more clarity.
Example:
========= Invalid __global__ read of size 8 bytes
========= at 0xf10 in /root/bobc/t295.cu:30:matvec_gpu(const float *, const int *, const float *, float *, int, int, float *)
========= by thread (35,0,0) in block (177,0,0)
========= Address 0x7f766a0b1674 is misaligned
========= and is inside the nearest allocation at 0x7f766a000200 of size 1,840,000 bytes
It may be a clue as to what is going on. The word “misaligned” in the output above is usually indicating that you have violated a natural alignment rule.
This is line 30:
So its curious that presumably a 16 byte read would be triggered, but the compiler has generated code that is only reading 8 bytes. This can happen if the compiler determines that some of the bytes are “unused”/“unneeded”, but it may require some unravelling of SASS code to fully understand. It can also happen if the compiler somehow knows that the structure does not have guaranteed alignment, but I wouldn’t expect that to be the case for use of float4. That structure (provided by CUDA header files) should have appropriate alignment guarantees.
If I suspected misalignment in a code, another diagnostic approach I might consider is to inspect the numerical values of the pointers actually used in the code that compute-sanitizer is complaining about.
I was able to reproduce the above observation on CUDA 12.2, L4. Another possible diagnostic is to run your code on the latest CUDA version available. Bugs are always possible. Bugs get fixed all the time.
Given that subsequent code only uses the .z and .w components of the float4, an 8-byte load seems appropriate. However, per the output of compute-sanitizer, that 8-byte load is only 4-byte aligned, which means that the guards put into the code to avoid misaligned access are not working properly.
A float4 comprises float components .x, .y, .z, and .w. Each of those uses 4 bytes. Only the two components .z and .w are used here, so a compiler could issue an 8-byte load to access just those, instead of issuing a 16-byte load for the full float4. This would be a perfectly legal optimization by the “as-is” rule (IMHO). Since the address of a float4 must be 16-byte aligned, the derived address of the 8-byte load would automatically be 8-byte aligned, so the optimization would be safe under that aspect.
FWIW, I see only two LDG.E.64 instructions (that is, 8-byte loads) in the generated SASS (machine code), so one might want to fire up a debugger and put break points on those to see which winds up with the unaligned address and work backwards from there. There is just too much code to do that casually.
The optimization “generate partial load for a vector load only some of whose result components are used” appears to occur in the compiler backend, i.e. ptxas, as I see two ld.global.v4.f32 (full float4 load) at PTX level.
While compiler bugs are always a possibility, they are rare, so the most likely cause of the misaligned access is an error in the source code. I would suggest reducing to the smallest possible matrix (minimize number of matrix elements) and smallest possible launch configuration (minimize the number of threads) that triggers the issue, then instrumenting the code to dump out relevant pointer values.
I am not a C++ language lawyer, but I am wondering how “legal” the pointer casting from float* to float4* is. A notion stuck in the back of my head is that in C++ two pointers to different types cannot point to the same data object, and the compiler can transform code under this assumption. This trivially applies when one is a float* and the other a int32_t *, so the classical type punning through pointer casting invokes UB, which is why memcpy() is the canonical method of bit-wise transfer in classical C++, or std::bit_cast in C++20. How strictly this applies to casting between pointers to “related” types like float4 and float I do not know.
compute-sanitizer looks like a useful tool. Thanks! Unfortunately, it does not seem to work on Ubuntu 22.04 (“Target application terminated before first instrumented API call”), but I now know what to look for. Maybe this will help me to reduce the test case even more.
My observation may be outdated, but I seem to recall that the CUDA compiler does not always emit the highest performing code for memcpy() in device code. I think CUDA already supports various C++20 features such as std::bit_cast, so you might want to give that a try as well.
It is probably worthwhile to look at the generated machine code (SASS), e.g. by looking at the output of cuobjdump --dump-sass. If the vector type float4 is used for performance reasons, transfer via memcopy() or std::bit_castmay simply cause this to be split into multiple float loads under the hood, (partially) negating any positive performance effects one was hoping for from the use of vector types.
Yes, I suspect the use of memcpy will throw out any benefit you were hoping to achieve by cleverly using vector loads.
I have certainly used compute-sanitizer successfully on Ubuntu 22.04. I’m not certain what the issue may be, more info would probably be needed. It’s possible the GPU you are using could be an issue, you might need to use cuda-memcheck instead if on a very old GPU.
Oh, you are right, memcpy results in horrible SASS with 16 individual 8 bit loads. Is there some canonical way to get wider loads without resorting to PTX?
Yeah, that is kind of what I remembered. I am not a compiler engineer, but from what I observe from compilers targeting x86-64 and AMR64 it should be possible for the compiler to optimize that by treating memcpy() as a well-known function and observing that it receives a float* meaning it is safe to copy in 32-bit chunks. Consider filing an enhancement request / performance bug with NVIDIA.
I do not want to sound like a broken record, but have you tried std::bit_cast? It is entirely possible that this simply maps back to memcpy() under the hood but seems worth trying.
If the compiler cannot produce a faster bit transfer, but use of the vector type provides advantages in your code besides potentially producing a vector load instruction, you could always copy component-wise with your own code, as a workaround.
I have not yet figured out how to use std::bit_cast. To me, it looks like the source and destination type should be of the same size, but float4 is 16 bytes while a single float is 4 bytes. With memcpy, I could just use the pointers and did not have that issue. Am I supposed to create my_own_float4 struct and cast from it to float4 or is there a more straightforward way?
Sorry, I have not advanced to C++20 yet and have never tried std::bit_cast. It is entire possible that std::bit_cast is only useful for transfers between same-size data types, i.e. for __int_as_float() and __float_as_int() kind of functionality.
I don’t see how there would be any advantage that using one’s own struct has over use of a pre-defined vector type. To copy in the obvious way you would simply write:
I was thinking of first casting float* to my_float4* to make it the same size as float4 and then cast it to float4 with std::bit_cast to hint to the compiler to not do evil things with it. Something like this:
But it looks like my CUDA toolkit does not support C++20 yet and I do not want to break my system when upgrading, so I can not try std::bit_cast unfortunately.
I can totally understand that. I also upgrade toolchains reluctantly and rather work around known problems with the toolchain I am already using than discover unknown new issues with newer toolchains.
Per the CUDA Programming Guide:
All C++20 language features are supported in nvcc version 12.0 and later, subject to restrictions
So if you are using a CUDA version < 12.0 there is no support for C++20. I am currently on CUDA 12.3, but I have simply not gotten around to trying any C++20 specific functionality. Keeping up with the latest language features is generally low on my list of priorities.