struct member alignment inconsistent using templates

I’ve come across a strange problem and I think I’ve narrowed it down to this behaviour.

The built-in float4 type, which should be aligned to 16 bytes, aligns properly in a structure on the device, and on the host when that structure is not a template. However when I make a template struct, the members are misaligned. Manual padding is required to work around this problem, which is not something we should have to do.

Here’s a simple example that recreates this problem.

Use either:

#define xyz float4

or

typedef struct __align__(16) xyz { float x, y, z, w; } xyz;
template <class P>
struct test1 {
	int i;
	xyz v;
};

struct test2 {
	int i;
	xyz v;
};

int main() {
	printf("sizeof test1: %d
", sizeof(test1<test2>));
	printf("sizeof test2: %d
", sizeof(test2));
}

Using float4:

sizeof test1: 20
sizeof test2: 32

Using my alternate structure, which uses the same CUDA-provided align(16) as float4 uses in vector_types.h:

sizeof test1: 32
sizeof test2: 32

To be clear, on the device the alignment is like test2 (no template). The real problem is passing a template struct from the host to a kernel. In this case the struct has a different layout when it arrives on the device and kernels cannot read from it correctly.

A secondary problem is that when I use my alternate structure as a workaround, and then try to pass this structure into a kernel as a parameter, I receive this compile error:

error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms

I am using Windows with MSVC compiler version 15, on CUDA 5.0.

It seems you are on a Windows32 platform? As far as I am aware the Windows32 ABI currently is not able to support 16-byte alignment, in particular I seem to recall that it cannot guarantee 16-byte alignment for objects stored on the stack, but this may also extend to other scenarios. Alignment directives provided by CUDA simply map to equivalent alignment directives provided by the native toolchain for the host portion of the code. It seems that align(n) maps to __declspec(align(n)) for Windows platforms. You may want to check relevant documentation for this with regard to the Win32 ABI.

If I create a struct without using a template, it is aligned correctly, and can be passed into a kernel without problem. With a template, I can use my own 16-byte-aligned members and the struct is created properly. It’s only when using the built-in CUDA vector types in combination with a template struct that the alignment is incorrect. This leads me to think this is a problem with CUDA.

I am not enough of an expert to say what you are observing is not a bug. If you have a repro case in hand, you may want to consider filing a bug report using the form linked from the registered developer website.