global memory alignment issue [pycuda]

Hi all,

I asked the same question on pycuda mailing list but haven’t got any solution yet, so I post my question here hoping to get some answers.

The guy who wrote PyCuda suspects that CUDA does not automatically align the pointer, but he did not tell me how to fix the problem.

Please note that I do NOT have the issue if write the host program in C. Here is my question:

I modified the example Tutorial - pycuda 2022.1 documentation by removing the ‘__padding’ from the structure definition and got incorrect result. The kernel is launched with 2 blocks and one thread in each block.

Each thread prints the ‘len’ field in structure, which should be 3 for block 0 and 2 for block 1. However, the result I got is:

block 1: 2097664

block 0: 3

#include <stdio.h>

struct Vec {

        int len;

        float* data;

};

__global__ void test(Vec *a) {

        Vec v = a[blockIdx.x];

        printf("block %d: %d\n", blockIdx.x, v.len);

}

-------------------------------------------------- end of kernel ------------------------------------------------------

import numpy

import pycuda.autoinit

import pycuda.driver as cuda

from pycuda.compiler import SourceModule

class VecStruct:

    mem_size = 4 + numpy.intp(0).nbytes

    def __init__(self, array, struct_arr_ptr):

        data = cuda.to_device(array)

        cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))

        cuda.memcpy_htod(int(struct_arr_ptr) + 4, numpy.intp(int(data)))

# allocate memory to hold structure array

struct_arr = cuda.mem_alloc(2 * VecStruct.mem_size)

# populate data into structures

array1 = VecStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)

array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), int(struct_arr) + VecStruct.mem_size)

with open('test.cu', 'r') as f:

        src  = f.read()

mod = SourceModule(src)

func = mod.get_function("test")

func(struct_arr, block = (1, 1, 1), grid=(2, 1))

Does anyone have an answer?

I found this from Chapter 5 of CUDA 4.0 programming guide, which may be relevant.

“Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results”

But still, how do I fix the above problem?

Accesses on the device must be naturally aligned, e.g. 4-byte quantities must be 4-byte aligned, 8-byte quantities must be 8-byte aligned. An old technique that avoids padding problems in structures (and predates GPUs by decades) is to sort structure elements by descreasing element size which automatically aligns every structure member correctly as long as the structure as a whole is aligned suitably for the largest element type:

(1) double, long long   // 8 bytes

(2) pointers            // 4 bytes or 8 bytes

(3) float, int          // 4 bytes

I would suggest giving that a try. I don’t know what’s going on in your specific example, it may be an issue of the host compiler having different ideas about the required padding than the CUDA compiler, especially since x86 supports mis-aligned accesses at just a minor cost in performance. So if you are on a 64-bit platform, the struct may wind up packed (i.e. with a misaligned 8-byte pointer “data”) on the host side but automatically padded on the device side.

Yes, I’m on 64-bit Ubuntu.

Now the output changes to

block 1: 0

block 0: 3

I tried your suggestion by changing the structure definition to

struct Vec{

float* data;  /*8 byte pointer*/

int datalen;

};

And the host program to:

class VecStruct:

    mem_size = 4 + numpy.intp(0).nbytes

    def __init__(self, array, struct_arr_ptr):

        data = cuda.to_device(array)

        cuda.memcpy_htod(int(struct_arr_ptr) , numpy.intp(int(data)))

        cuda.memcpy_htod(int(struct_arr_ptr)+8, numpy.int32(array.size))

What should I do now?

Sorry, I am not familiar with PyCUDA and have no way of reproducing your observations (and it has been 10 years since I last used Python at all). Given the results from the latest experiment it is not clear to me that there is a problem on the CUDA side here. Maybe another CUDA user with PyCUDA experience will see this thread and be able to suggest additional lines of investigation to get to the bottom of this problem.

Since you are making an array of structs, you also have to worry about the alignment of the start of the second struct. Since your struct starts with a type that needs 8 byte alignment, the entire struct needs 8 byte alignment, so the sizeof() the struct is not 12, but 16.

PyCUDA provides a function that can calculate this for you:

import pycuda.characterize

vec_size = pycuda.characterize.sizeof('Vec', src)

Use that for the size of Vec in bytes, and I think this will all work as you expect.