I’ve come across a strange problem and I think I’ve narrowed it down to this behaviour.
The built-in float4 type, which should be aligned to 16 bytes, aligns properly in a structure on the device, and on the host when that structure is not a template. However when I make a template struct, the members are misaligned. Manual padding is required to work around this problem, which is not something we should have to do.
Here’s a simple example that recreates this problem.
Use either:
#define xyz float4
or
typedef struct __align__(16) xyz { float x, y, z, w; } xyz;
template <class P>
struct test1 {
int i;
xyz v;
};
struct test2 {
int i;
xyz v;
};
int main() {
printf("sizeof test1: %d
", sizeof(test1<test2>));
printf("sizeof test2: %d
", sizeof(test2));
}
Using float4:
sizeof test1: 20
sizeof test2: 32
Using my alternate structure, which uses the same CUDA-provided align(16) as float4 uses in vector_types.h:
sizeof test1: 32
sizeof test2: 32
To be clear, on the device the alignment is like test2 (no template). The real problem is passing a template struct from the host to a kernel. In this case the struct has a different layout when it arrives on the device and kernels cannot read from it correctly.
A secondary problem is that when I use my alternate structure as a workaround, and then try to pass this structure into a kernel as a parameter, I receive this compile error:
error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms
I am using Windows with MSVC compiler version 15, on CUDA 5.0.