The very first element of your structure is a float4, and that clearly must be aligned to 16 bytes (so that it can be fetched in one opcode). So your whole structure must be aligned to 16 bytes.
The question is why nvcc doesn’t realize this itself. align(16) does fix the situation.
Bug, bug bug bug.