Vector load "int4 veca1 = reinterpret_cast<int4*>(&a[2])[0];" valid?

The following gives segfault. Can I load vector in this way? My goal is to load a[2,3,4,5]?

__global__ void kernel_vector(int *a) {
    int4 veca1 = reinterpret_cast<int4*>(&a[2])[0];
    printf("%d %d %d %d\n", veca1.x, veca1.y, veca1.z, veca1.w);    
}

Hi,

did you check the return code of your kernel launch? Maybe an alinement issue? make sure ‘a’ is proper aligned for the int4 access.

Here is my full code

#include <stdio.h>
__global__ void kernel_vector(int *a) {
    //This prints well
    int4 veca = reinterpret_cast<int4*>(a)[0];
    printf("%d %d %d %d\n", veca.x,veca.y,veca.z,veca.w);

    //This segfaults 
    veca = reinterpret_cast<int4*>(&a[1])[0];
    printf("%d %d %d %d\n", veca.x,veca.y,veca.z,veca.w);

    //This also segfaults 
    int *a2 = &a[2];
    veca = reinterpret_cast<int4*>(a2)[0];
    printf("%d %d %d %d\n", veca.x,veca.y,veca.z,veca.w);
}


int main(){
    int N = 100000;
    int *a1, *da1;
    a1 = (int*)malloc(N*sizeof(int));
    cudaMalloc(&da1, N*sizeof(int));
    for (int i = 0; i < N; ++i)
        a1[i]  = i;
    cudaMemcpy(da1, a1, N*sizeof(int), cudaMemcpyHostToDevice);
    kernel_vector<<<1, 1>>>(da1); 
    cudaDeviceSynchronize();
    cudaMemcpy(a1, da1, N*sizeof(int), cudaMemcpyDeviceToHost);
    return 0;
}

This line is illegal (even if a is int4 aligned):

veca = reinterpret_cast<int4*>(&a[1])[0];

Suggestion: run your code with cuda-memcheck. I usually recommend that for everyone, before asking for help. It may help you to figure out the problem yourself, and even if not, the error output will be useful for others trying to help you.

In short, the GPU has natural alignment requirements, which are spelled out in the programming guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

that line of code violates those requirements. cuda-memcheck will say something like this:

=========     Address 0x7ff445c00004 is misaligned

one possible alternative for this case. Instead of this:

veca = reinterpret_cast<int4*>(&a[1])[0];
    printf("%d %d %d %d\n", veca.x,veca.y,veca.z,veca.w);

do this:

int *pa1 = &a[1];
    printf("%d %d %d %d\n", pa1[0],pa1[1],pa1[2],pa1[3]);

Thank you very much for answers. I thought cudaMalloc guarantees alignment. Because “int4 veca = reinterpret_cast<int4*>(a)[0];” just works.
How can I do alignment?

cudaMalloc does guarantee alignment (to a 512 byte boundary, or something like that). a[0] is aligned to a 4 byte, 8-byte, 16-byte etc. boundary.

But if a[0] is aligned to a 16-byte boundary, by definition a[1] is not (for a of type int*, float*, double*, etc.) You cannot offset an arbitrary amount of bytes from that boundary and expect that you still have alignment.

I’ve edited my previous response to suggest an alternate approach that doesn’t trip over alignment.