Accesses on the device must be naturally aligned, e.g. 4-byte quantities must be 4-byte aligned, 8-byte quantities must be 8-byte aligned. An old technique that avoids padding problems in structures (and predates GPUs by decades) is to sort structure elements by descreasing element size which automatically aligns every structure member correctly as long as the structure as a whole is aligned suitably for the largest element type:
(1) double, long long // 8 bytes
(2) pointers // 4 bytes or 8 bytes
(3) float, int // 4 bytes
I would suggest giving that a try. I don’t know what’s going on in your specific example, it may be an issue of the host compiler having different ideas about the required padding than the CUDA compiler, especially since x86 supports mis-aligned accesses at just a minor cost in performance. So if you are on a 64-bit platform, the struct may wind up packed (i.e. with a misaligned 8-byte pointer “data”) on the host side but automatically padded on the device side.
Sorry, I am not familiar with PyCUDA and have no way of reproducing your observations (and it has been 10 years since I last used Python at all). Given the results from the latest experiment it is not clear to me that there is a problem on the CUDA side here. Maybe another CUDA user with PyCUDA experience will see this thread and be able to suggest additional lines of investigation to get to the bottom of this problem.
Since you are making an array of structs, you also have to worry about the alignment of the start of the second struct. Since your struct starts with a type that needs 8 byte alignment, the entire struct needs 8 byte alignment, so the sizeof() the struct is not 12, but 16.
PyCUDA provides a function that can calculate this for you: