I asked the same question on pycuda mailing list but haven’t got any solution yet, so I post my question here hoping to get some answers.
The guy who wrote PyCuda suspects that CUDA does not automatically align the pointer, but he did not tell me how to fix the problem.
Please note that I do NOT have the issue if write the host program in C. Here is my question:
I modified the example Tutorial - pycuda 2022.1 documentation by removing the ‘__padding’ from the structure definition and got incorrect result. The kernel is launched with 2 blocks and one thread in each block.
Each thread prints the ‘len’ field in structure, which should be 3 for block 0 and 2 for block 1. However, the result I got is:
block 1: 2097664
block 0: 3
#include <stdio.h>
struct Vec {
int len;
float* data;
};
__global__ void test(Vec *a) {
Vec v = a[blockIdx.x];
printf("block %d: %d\n", blockIdx.x, v.len);
}
-------------------------------------------------- end of kernel ------------------------------------------------------
import numpy
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
class VecStruct:
mem_size = 4 + numpy.intp(0).nbytes
def __init__(self, array, struct_arr_ptr):
data = cuda.to_device(array)
cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))
cuda.memcpy_htod(int(struct_arr_ptr) + 4, numpy.intp(int(data)))
# allocate memory to hold structure array
struct_arr = cuda.mem_alloc(2 * VecStruct.mem_size)
# populate data into structures
array1 = VecStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), int(struct_arr) + VecStruct.mem_size)
with open('test.cu', 'r') as f:
src = f.read()
mod = SourceModule(src)
func = mod.get_function("test")
func(struct_arr, block = (1, 1, 1), grid=(2, 1))
Accesses on the device must be naturally aligned, e.g. 4-byte quantities must be 4-byte aligned, 8-byte quantities must be 8-byte aligned. An old technique that avoids padding problems in structures (and predates GPUs by decades) is to sort structure elements by descreasing element size which automatically aligns every structure member correctly as long as the structure as a whole is aligned suitably for the largest element type:
(1) double, long long // 8 bytes
(2) pointers // 4 bytes or 8 bytes
(3) float, int // 4 bytes
I would suggest giving that a try. I don’t know what’s going on in your specific example, it may be an issue of the host compiler having different ideas about the required padding than the CUDA compiler, especially since x86 supports mis-aligned accesses at just a minor cost in performance. So if you are on a 64-bit platform, the struct may wind up packed (i.e. with a misaligned 8-byte pointer “data”) on the host side but automatically padded on the device side.
Sorry, I am not familiar with PyCUDA and have no way of reproducing your observations (and it has been 10 years since I last used Python at all). Given the results from the latest experiment it is not clear to me that there is a problem on the CUDA side here. Maybe another CUDA user with PyCUDA experience will see this thread and be able to suggest additional lines of investigation to get to the bottom of this problem.
Since you are making an array of structs, you also have to worry about the alignment of the start of the second struct. Since your struct starts with a type that needs 8 byte alignment, the entire struct needs 8 byte alignment, so the sizeof() the struct is not 12, but 16.
PyCUDA provides a function that can calculate this for you: