Misaligned address of union

I have the following kernel which will generate a runtime error of misaligned memory access.

#include <stdlib.h>
#include <stdint.h>
#include "repeat.h"

#define ITER 32
#define BS 160

__global__ void chase_pointers (
    uintptr_t *ptr_array,
    int *output,
    float b)
    b=b-1; // make sure b is 0
    float a;
    int pid = threadIdx.x + blockIdx.x*blockDim.x;
    int warpid = pid / 32;
    int laneid = pid % 32;

        uintptr_t * addr;
        struct {
            int low;
            int high;
        } myInt;
    } myUnion;
    int startIdx = threadIdx.x + blockIdx.x * ITER * BS;
    myUnion.addr = &ptr_array[startIdx];
#pragma unroll 16
    for (int i = 0; i < ITER; i++) {
        myUnion.addr = (uintptr_t *)(*myUnion.addr);
        myUnion.myInt.low = __float_as_int(a);

    if (laneid == 0){
        output[warpid] = myUnion.myInt.low;

int main (int argc, char *argv[])
    int arrayLen = 64*1024*1024;
    size_t arraySize = arrayLen*8 + BS*8;
    uintptr_t *ptr_array = (uintptr_t*)malloc(arraySize);
    uintptr_t *ptr_array_d = 0;
    cudaMalloc ((void **)&ptr_array_d, arraySize);
     * The array is initialized so that
     * array[i] = &array[i+N], where N is blocksize (BS)
    for (int i = 0; i < arrayLen; i++){
        ptr_array[i] = (uintptr_t)&ptr_array_d[i+BS];
    cudaMemcpy (ptr_array_d, ptr_array, arraySize, cudaMemcpyHostToDevice);

    int blocks = arrayLen/BS/ITER;
    int threads = BS;
    int warps = max(1, blocks*threads/32);
     * Initialize output array, one element for each warp
    int *output = (int*)malloc(sizeof(int)*warps);
    int *output_d;
    cudaMalloc((void**)&output_d, sizeof(int)*warps);

    chase_pointers<<<blocks,threads>>>(ptr_array_d, output_d, 1);

    /* clean up */
    cudaFree (ptr_array_d);
    cudaFree (output_d);
    return EXIT_SUCCESS;

When I use cuda-memcheck, sometimes I got the following error

========= Invalid __global__ read of size 8
=========     at 0x00000908 in /u/lxzhang/GPU-Benchmark/GPU-Benchmark-Volkov-Diss/test.cu:34:chase_pointers(unsigned long*, int*, float)
=========     by thread (127,0,0) in block (12906,0,0)
=========     Address 0x7fd27fffffff is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2d725e]
=========     Host Frame:test [0x22992]
=========     Host Frame:test [0x22b87]
=========     Host Frame:test [0x56f45]
=========     Host Frame:test [0x6c09]
=========     Host Frame:test [0x6abd]
=========     Host Frame:test [0x6b14]
=========     Host Frame:test [0x68d1]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:test [0x661a]

... more errors ... 

The error points to line 34, which is

myUnion.addr = (uintptr_t *)(*myUnion.addr);

It seems that myUnion.addr is not always aligned at 8 bytes. My question is how can I enforce alignment for the union so that myUnion.addr is always aligned at 8 bytes? Or is it caused by something else?
And it happens on GTX 1080 (Pascal CC=6.1) but always works fine on GTX TITAN (Kepler CC=3.5). So why is that?

The source code is compiled with nvcc-10.0 on GTX 1080.
Thanks in advance.

That means “reinterpret this int bit pattern as a float bit pattern”. Don’t you really want:

a = (float)(myUnion.myInt.low);

(I’m not sure that makes sense either, unless you can range limit the int, which you can’t, see below).

Why do you think that reinterpreting an arbitrary int to float will give you anything sensible to work with? What if the conversion happens to create a NaN ? Since the array you are running through is created/indexed off an address returned by cudaMalloc there is no guarantee of what the bit pattern may be, other than that the lowest few bits are zero.

I think its telling that the lowest 8 hex digits of your misaligned address are 0x7FFFFFFF

that corresponds to a float NaN


An earlier question by the same poster provides context for this question. Their goal is to replicate a kernel from Vasily Volkov’s thesis that uses the following instruction sequence for a micro-benchmark:

LD R1, [R1]
FADD R1, R1, R2
FADD R1, R1, R2
FADD R1, R1, R2
LD R1, [R1]
FADD R1, R1, R2

This sequence consisting of just loads and dependent floating-point additions (and importantly, nothing else) requires pointers and floating-point data to be kept in the same registers, therefore requiring data re-interpretation when expressed at the HLL level.

In a 32-bit environment, this is trivially doable as long as the numerical range of the pointers is suitably restricted, since both float and pointers are 4-byte types. In essence this is using float subnormals to operate on 24-bit integers. Subnormals are handled at full speed by the GPU, so this works without issues.

Current CUDA supports only 64-bit platforms and therefore requires 64-bit pointers. If double operations had the same high throughput as float operations, one could just switch to those and the code would map one-to-one from Volkov’s example. Alas, this isn’t the case.


I interpreted one aspect of OP’s question to be asking for a plausible explanation of what may be causing the misaligned value. Perhaps I was mistaken about that.

I think the key is basically this statement of yours:

“as long as the numerical range of the pointers is suitably restricted”

my guess is this is a “numerical range problem” on the lower 32 bits of the 64-bit pointer.

I have no idea how to restrict the numerical value of a pointer returned by cudaMalloc whether we are talking about the historical 32-bit case, or the 64-bit case. (And by extension its not clear to me how you would know such a pointer is not immediately a NaN as soon as you convert, whether we are talking about float/32-bit pointer or double/64-bit pointer arithmetic)

However it occurs to me that if you allocate a large enough allocation (larger than 2GB or 4GB, I guess) then you can guarantee that there is a valid address within that allocation that has a lower 32-bit field that is all zeroes. (I think 31-bits would also work, I’m not sure the sign bit one way or the other will upset the apple cart.) That happens to be zero in floating point also, so that may be an idea.

I tested the NaN behavior with the following kernel

__global__ void testNaN(float b)
    int inta=2147483647; // 0x7fffffff
    float floata=__int_as_float(inta); // should be NaN
    floata += b;
    int intb = __float_as_int(floata);
    printf("intb = %d, inta = %d\n", intb, inta);
    printf("floata = %f\n", floata);

int main ()
    testNaN <<<1,1>>> (0.0);

And when I run the kernel I get the following result:

intb = 2147483647, inta = 2147483647
floata = nan

This means even though floata is NaN, it can still be added by 0 (so that its bits do not change) and interpreted back as an integer, right?

The intention is to get the address, interpret the low (or high) 32 bit as float, perform some FADD instructions on it but do not change its value (this is done by adding zeros), and convert the float back to an integer. In this way, the obtained address should not change.

But cuda-memcheck gives the error “Address 0x7fd27fffffff is misaligned” for a lot of threads (but not all threads). Since the set up of the kernel guarantees that different threads will never access the same address, this address (0x7fd27ffffff) must be some local variable to those threads. That is why I think the problem is that the alignment of the Union is not 8 byte, so that addr within the union cannot be dereferenced as uintptr_t *.

Am I misunderstand something?

To me, this is a more interesting test case:

$ cat t53.cu
#include <stdio.h>
__global__ void testNaN(float b)
    int inta=2147483632; // 0x7ffffff0
    float floata=__int_as_float(inta); // should be NaN
    floata += b;
    int intb = __float_as_int(floata);
    printf("intb = %d, inta = %d\n", intb, inta);
    printf("floata = %f\n", floata);

int main ()
    testNaN <<<1,1>>> (0.0);
$ nvcc -arch=sm_35 -o t53 t53.cu -O3 -lineinfo -Wno-deprecated-gpu-targets -std=c++14
$ ./t53
intb = 2147483647, inta = 2147483632
floata = nan

GTX960, CUDA 11.1, driver 455.23.05

I really don’t know anything about NaN arithmetic. Therefore I would try to avoid it. That’s the way I see it.

GPUs adhere to the IEEE-754 standard, so NaNs are pass-through for most floating-point operations: NaN + x = NaN, NaN * x = NaN, etc. Furthermore, for single precision, i.e. float, only a single canonical NaN with bit pattern 0x7fffffff is output by GPUs. So-called NaN payloads are optional per the standard.

So once the integer / pointer bits correspond to a NaN representation (__float_as_int (fabsf(x) > 0x7F800000) no meaningful operation can be performed on their floating-point representation. Generally speaking, trying to perform integer arithmetic via float operations only works correctly in full generality for unsigned integers in [0,224]. This hack might work for shared memory addresses, but is not generally suitable when dealing with global memory addresses.

It would be well worth investigating why there needs to be a dependency through floating-point adds here, and whether the same goal cannot be achieved through integer adds. I haven’t read Volkov’s thesis, but I would be surprised if the use of single-precision adds is essential to whatever this piece of code is trying to achieve.

Thank you for the explanation. It helps a lot.

I tried to check the address returned by cudaMalloc and found that

  1. On my desktop, which has GTX TITAN, the returned address is always the same.
  2. On a server, which has GTX 1080, the returned address is changing every time I run the application.

So what causes the above difference?

Both the server and my desktop have the following specs:
OS: Ubuntu 18.04
CUDA Driver Version: 11.0
NVIDIA Driver Version: 450.51.06
The only difference is that the server is using nvcc-10.0 while my desktop is using mvcc-10.2

In general, it is futile and risky to make assumptions about the addresses returned by memory allocation beyond any documented alignment guarantees. We could speculate that more than one application using the GPU runs on the server so GPU memory usage is more dynamic.

Memory allocation is typically a layered process with a different allocator at each level. Except for an occasional user-accessible switch, details of memory allocation are usually not documented. In this case, I am not aware of any user-level switch and the observed behavior is an artifact of CUDA runtime, NVIDIA driver, and operating system heuristics which may well differ by GPU architecture and phase of moon.

If you make your allocation large enough (meaning the GPU has enough available memory) you can guarantee that there exists a location within the allocation where the lower 32 bits of the address/pointer value are all zero. Again, if you make it large enough, and with a bit of careful calculation, you can find and guarantee that you have an available chunk of memory starting at that point, of your desired size (appears to be 512MB). With such an approach which should only impact host code, you should be able to make your kernel “safe” from conversion to NaN. Note that this statement does not hold for an arbitrarily large desired allocation, but for your particular size (512MB) it should be possible. 1080 has 8GB, it should be workable there.

As I mentioned before, indepedent of the issue of encountering NaN encodings, integer (and thus pointer) arithmetic via re-interpretation as a float will only work in a straightfoward[*] way for unsigned integers in [0,224] ([0, 0x1000000]), so keep that in mind when you select a usable sub-space in a larger allocation suggested by @Robert_Crovella.

[*] straightfoward: a+b = __float_as_int (__int_as_float(a) + __int_as_float(b))

At least for the case here, where one of the add operands is 0.0f, and over the range of interest, it seems to be OK:

$ cat t55.cu
#include <cassert>
__device__ int f(const int a, const int b){
         return __float_as_int (__int_as_float(a) + __int_as_float(b));

__global__ void k(const int b){
        for (int i = threadIdx.x+blockDim.x*blockIdx.x; i < 0x7F800000; i+=gridDim.x*blockDim.x)
                assert(i == f(i, b));

int main(){

        cudaError_t err = cudaDeviceSynchronize();
        assert(err == cudaSuccess);
$ nvcc -o t55 t55.cu -arch=sm_35 -Wno-deprecated-gpu-targets
$ cuda-memcheck ./t55
========= ERROR SUMMARY: 0 errors

If you change the constant above from 0x7F800000 to 0x7F800002, the assert will be hit. Apparently the NaN range starts at 0x7F800001, not 0x7F800000.

Oh! I didn’t notice that the increment is 0. In that case any integer outside the NaN pattern range should be OK, with the exception of the integer corresponding to negative zero (0x80000000), because under IEEE-754, with default rounding mode “to nearest or even”, -0.0f + 0.0f = 0.0f. Inside the NaN pattern range, any source operand would be converted to the canonical NaN pattern (0x7fffffff).

Apparently the NaN range starts at 0x7F800001, not 0x7F800000.

That is correct for NaNs with the sign bit cleared. As I stated above, isNaN(x) = __float_as_int (fabsf (x)) > 0x7f800000. The encoding 0x7f800000 corresponds to positive infinity.