global memory alignment issue [pycuda]

yifli · January 20, 2012, 2:57am

Hi all,

I asked the same question on pycuda mailing list but haven’t got any solution yet, so I post my question here hoping to get some answers.

The guy who wrote PyCuda suspects that CUDA does not automatically align the pointer, but he did not tell me how to fix the problem.

Please note that I do NOT have the issue if write the host program in C. Here is my question:

I modified the example Tutorial - pycuda 2022.1 documentation by removing the ‘__padding’ from the structure definition and got incorrect result. The kernel is launched with 2 blocks and one thread in each block.

Each thread prints the ‘len’ field in structure, which should be 3 for block 0 and 2 for block 1. However, the result I got is:

block 1: 2097664

block 0: 3

#include <stdio.h>

struct Vec {

        int len;

        float* data;

};

__global__ void test(Vec *a) {

        Vec v = a[blockIdx.x];

        printf("block %d: %d\n", blockIdx.x, v.len);

}

-------------------------------------------------- end of kernel ------------------------------------------------------

import numpy

import pycuda.autoinit

import pycuda.driver as cuda

from pycuda.compiler import SourceModule

class VecStruct:

    mem_size = 4 + numpy.intp(0).nbytes

    def __init__(self, array, struct_arr_ptr):

        data = cuda.to_device(array)

        cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))

        cuda.memcpy_htod(int(struct_arr_ptr) + 4, numpy.intp(int(data)))

# allocate memory to hold structure array

struct_arr = cuda.mem_alloc(2 * VecStruct.mem_size)

# populate data into structures

array1 = VecStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)

array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), int(struct_arr) + VecStruct.mem_size)

with open('test.cu', 'r') as f:

        src  = f.read()

mod = SourceModule(src)

func = mod.get_function("test")

func(struct_arr, block = (1, 1, 1), grid=(2, 1))

yifli · January 20, 2012, 8:24pm

Does anyone have an answer?

I found this from Chapter 5 of CUDA 4.0 programming guide, which may be relevant.

“Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results”

But still, how do I fix the above problem?

njuffa · January 21, 2012, 12:09am

Accesses on the device must be naturally aligned, e.g. 4-byte quantities must be 4-byte aligned, 8-byte quantities must be 8-byte aligned. An old technique that avoids padding problems in structures (and predates GPUs by decades) is to sort structure elements by descreasing element size which automatically aligns every structure member correctly as long as the structure as a whole is aligned suitably for the largest element type:

(1) double, long long   // 8 bytes

(2) pointers            // 4 bytes or 8 bytes

(3) float, int          // 4 bytes

I would suggest giving that a try. I don’t know what’s going on in your specific example, it may be an issue of the host compiler having different ideas about the required padding than the CUDA compiler, especially since x86 supports mis-aligned accesses at just a minor cost in performance. So if you are on a 64-bit platform, the struct may wind up packed (i.e. with a misaligned 8-byte pointer “data”) on the host side but automatically padded on the device side.

yifli · January 21, 2012, 1:41am

Accesses on the device must be naturally aligned, e.g. 4-byte quantities must be 4-byte aligned, 8-byte quantities must be 8-byte aligned. An old technique that avoids padding problems in structures (and predates GPUs by decades) is to sort structure elements by descreasing element size which automatically aligns every structure member correctly as long as the structure as a whole is aligned suitably for the largest element type:
(1) double, long long   // 8 bytes

(2) pointers            // 4 bytes or 8 bytes

(3) float, int          // 4 bytes
I would suggest giving that a try. I don’t know what’s going on in your specific example, it may be an issue of the host compiler having different ideas about the required padding than the CUDA compiler, especially since x86 supports mis-aligned accesses at just a minor cost in performance. So if you are on a 64-bit platform, the struct may wind up packed (i.e. with a misaligned 8-byte pointer “data”) on the host side but automatically padded on the device side.

Yes, I’m on 64-bit Ubuntu.

Now the output changes to

block 1: 0

block 0: 3

I tried your suggestion by changing the structure definition to

struct Vec{

float* data;  /*8 byte pointer*/

int datalen;

};

And the host program to:

class VecStruct:

    mem_size = 4 + numpy.intp(0).nbytes

    def __init__(self, array, struct_arr_ptr):

        data = cuda.to_device(array)

        cuda.memcpy_htod(int(struct_arr_ptr) , numpy.intp(int(data)))

        cuda.memcpy_htod(int(struct_arr_ptr)+8, numpy.int32(array.size))

What should I do now?

njuffa · January 21, 2012, 3:27am

Sorry, I am not familiar with PyCUDA and have no way of reproducing your observations (and it has been 10 years since I last used Python at all). Given the results from the latest experiment it is not clear to me that there is a problem on the CUDA side here. Maybe another CUDA user with PyCUDA experience will see this thread and be able to suggest additional lines of investigation to get to the bottom of this problem.

seibert · January 21, 2012, 4:28pm

Since you are making an array of structs, you also have to worry about the alignment of the start of the second struct. Since your struct starts with a type that needs 8 byte alignment, the entire struct needs 8 byte alignment, so the sizeof() the struct is not 12, but 16.

PyCUDA provides a function that can calculate this for you:

import pycuda.characterize

vec_size = pycuda.characterize.sizeof('Vec', src)

Use that for the size of Vec in bytes, and I think this will all work as you expect.

Topic		Replies	Views
Structure Alignment? CUDA Structure Alignment differs? CUDA Programming and Performance	12	49646	December 11, 2008
.cpp - .cu struct miss alignment problem on cuda for CPU or kernel functions struct, alignment CUDA Programming and Performance	0	7052	January 25, 2011
Passing a struct to CUDA kernel as parameter - 'align' specifier needed ? CUDA Programming and Performance	5	3357	October 12, 2016
Resolving CUDA struct alignment mismatch for Atom in atoms.h CUDA Programming and Performance kernel , visual-studio	5	122	May 17, 2025
array of structure problem CUDA Programming and Performance	7	6057	September 22, 2009
cuda passing user defined structure to a kernel failed CUDA Programming and Performance	3	1257	January 26, 2015
structures and global memory alignment requirement CUDA Programming and Performance	2	1094	July 17, 2014
casting char pointer to struct pointer in cuda Cuda gpu programming CUDA Programming and Performance	6	7306	January 30, 2011
Aligned structures problem CUDA Programming and Performance	1	1316	July 31, 2008
structure pack issue CUDA Programming and Performance	1	3556	March 18, 2008

global memory alignment issue [pycuda]

Related topics