Strange memory performance

Im doing different benchmarks to be able to optimize cuda renderer more and the latest test confuses me.

Following code is ~50% (roughly) faster:

[codebox]struct A

{

int4 v;

float f;

};

int4 vec = a[index].v;[/codebox]

then reading only a int2 vector

[codebox]struct B

{

int2 v;

float f;

};

int2 vec = b[index].v;[/codebox]

Why is int4 faster than int2? Both reads are missaligned. Compute 1.1 hardware.

Please Ignore…