I’d like to load at most 32 4-byte values from global memory to shared memory in one load operation, but the problem is that some of those values are int
s and others are float
s. I thought of using uchar4
to represent a type that can be either float
or int
, and use reinterpret_cast
like in the example code below. The problem is that I believe my example code has undefined behavior (the last part of this example indicates so) unless I replace each reinterpret_cast<..>(..)
expression with std::launder(reinterpret_cast<..>(..))
, which would require compiling with the experimental flag -expt-relaxed-constexpr
. Is there some other way to get one global memory load operation to load up to 32 4-byte values of varying types?
__global__ void foo(const uchar4* common, const float* src, uchar4* dest) {
__shared__ uchar4 s_common[5];
if (threadIdx.x < 5) {
s_common[threadIdx.x] = common[blockIdx.x * 5 + threadIdx.x];
}
__syncthreads();
int v0 = *reinterpret_cast<int*>(&s_common[0]);
float v1 = *reinterpret_cast<float*>(&s_common[1]);
int v2 = *reinterpret_cast<int*>(&s_common[2]);
float v3 = *reinterpret_cast<float*>(&s_common[3]);
float v4 = *reinterpret_cast<float*>(&s_common[4]);
float res = src[v0 + threadIdx.x] * v1 + src[v2 + threadIdx.x] * v3 + v4;
int dest_idx = (blockIdx.x * blockDim.x + threadIdx.x) * 2;
*reinterpret_cast<float*>(&dest[dest_idx]) = res;
}