How to get one load operation to load values of different types?

I’d like to load at most 32 4-byte values from global memory to shared memory in one load operation, but the problem is that some of those values are ints and others are floats. I thought of using uchar4 to represent a type that can be either float or int, and use reinterpret_cast like in the example code below. The problem is that I believe my example code has undefined behavior (the last part of this example indicates so) unless I replace each reinterpret_cast<..>(..) expression with std::launder(reinterpret_cast<..>(..)), which would require compiling with the experimental flag -expt-relaxed-constexpr. Is there some other way to get one global memory load operation to load up to 32 4-byte values of varying types?

__global__ void foo(const uchar4* common, const float* src, uchar4* dest) {
    __shared__ uchar4 s_common[5];

    if (threadIdx.x < 5) {
        s_common[threadIdx.x] = common[blockIdx.x * 5 + threadIdx.x];
    }
    __syncthreads();
    int v0 = *reinterpret_cast<int*>(&s_common[0]);
    float v1 = *reinterpret_cast<float*>(&s_common[1]);
    int v2 = *reinterpret_cast<int*>(&s_common[2]);
    float v3 = *reinterpret_cast<float*>(&s_common[3]);
    float v4 = *reinterpret_cast<float*>(&s_common[4]);

    float res = src[v0 + threadIdx.x] * v1 + src[v2 + threadIdx.x] * v3 + v4;
    int dest_idx = (blockIdx.x * blockDim.x + threadIdx.x) * 2;
    *reinterpret_cast<float*>(&dest[dest_idx]) = res;
}
  • There are unions (e.g. for storing in shared memory), but those are also UB, when used for data conversion. Could be quite safe, as Cuda accepts that any other thread or engine from the outside modifies shared memory, especially if defined as volatile __shared__, and will not make assumptions, but just read the stored bit values.
  • A modern way in C++ is to use bit_cast (is it already supported by nvcc?)
  • Sometimes memcpy can be used. (Compilers often can replace a few bytes of memcpy with reinterpreting the value).
  • In my experience, nvcc is quite tolerant in regards to reinterpret_cast. I would use int or unsigned int as base type instead of uchar4, but probably all types work, as long as they are aligned to 4-byte boundaries. BTW a slightly shorter variant with less special symbols uses references int v0 = reinterpret_cast<int&>(s_common[0])
  • A clean way could be the reinterpretation intrinsics, e.g. __int_as_float, if the offered type variants are sufficient: 11. Type Casting Intrinsics — CUDA Math API Reference Manual 12.8 documentation
1 Like

Keeping my varying-type data as ints by default and using the __int_as_float and __float_as_int intrinsics as needed seems to be the best solution for my purposes, since it doesn’t require the -expt-relaxed-constexpr flag. Thanks a lot - I wasn’t aware of such intrinsics.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.