I have an arary of structs of the following type:
typedef struct __align__(16)
{
int values[4];
} intarr4;
This is basically like the int4 type, only that I use an arary of length 4 instead of 4 int variables. I expected each struct to be loaded in one 128 bit load instruction. However, the ptx code shows that 4 global 32 bit reads are performed:
ld.global.s32 %r13, [%rd11+0]; // id:129
st.local.s32 [__cuda___cuda_nbrStruct161344+0], %r13; // id:130 __cuda___cu\
da_nbrStruct161344+0x0
ld.global.s32 %r14, [%rd11+4]; // id:131
st.local.s32 [__cuda___cuda_nbrStruct161344+4], %r14; // id:132 __cuda___cu\
da_nbrStruct161344+0x4
ld.global.s32 %r15, [%rd11+8]; // id:133
st.local.s32 [__cuda___cuda_nbrStruct161344+8], %r15; // id:134 __cuda___cu\
da_nbrStruct161344+0x8
ld.global.s32 %r16, [%rd11+12]; // id:135
st.local.s32 [__cuda___cuda_nbrStruct161344+12], %r16; // id:136 __cuda___cu\
da_nbrStruct161344+0xc
Why does this happen and is there anything that I can do to get this to load in one 128 bit global memory read?
Thanks.