global load of struct with int[4] the struct is loaded in 4 global reads

I have an arary of structs of the following type:

typedef struct __align__(16)

{

  int values[4];

} intarr4;

This is basically like the int4 type, only that I use an arary of length 4 instead of 4 int variables. I expected each struct to be loaded in one 128 bit load instruction. However, the ptx code shows that 4 global 32 bit reads are performed:

ld.global.s32   %r13, [%rd11+0];  // id:129

  st.local.s32  [__cuda___cuda_nbrStruct161344+0], %r13;  // id:130 __cuda___cu\

da_nbrStruct161344+0x0

  ld.global.s32   %r14, [%rd11+4];  // id:131

  st.local.s32  [__cuda___cuda_nbrStruct161344+4], %r14;  // id:132 __cuda___cu\

da_nbrStruct161344+0x4

  ld.global.s32   %r15, [%rd11+8];  // id:133

  st.local.s32  [__cuda___cuda_nbrStruct161344+8], %r15;  // id:134 __cuda___cu\

da_nbrStruct161344+0x8

  ld.global.s32   %r16, [%rd11+12]; // id:135

  st.local.s32  [__cuda___cuda_nbrStruct161344+12], %r16; // id:136 __cuda___cu\

da_nbrStruct161344+0xc

Why does this happen and is there anything that I can do to get this to load in one 128 bit global memory read?

Thanks.

I don’t really know about it, but maybe this is related to the fact, that each array you declare in thread scope will also be put into local memory more likely than in registers (no matter the size). So maybe the compiler doesn’t realize the whole struct would fit in one 128-bit read.

I came across a similar issue once and started to use the int4 and float4 types, which solved the problem. Why do you need your own struct anyway? If you need this indexing, !maybe! a nasty hack would be to initialize those structs as int4 and internally cast them to your type.

Yes I’m using the int4 type now. Iterating through an array instead of having to access the x, y, z, w members would simply have been the nicer solution.

typedef struct __align__(16)

{

 int4 values;

} intarr4;

int operator[](intarr4 arr, int index) {

switch (index) {

case 0 : return arr.x;

case 1 : return arr.y;

case 2 : return arr.z;

case 3 : return arr.w;

}

something along these lines could work I think?