I ran into a data alignment issue when trying to read 32 bit integer from global memory. I am not sure whether this is a compiler issue or a GPU architecture issue or just my stupidity…
Basically, I have declared a union like this to consolidate the memory accesses since the code does a lot of byte level reads from global memory…
typedef union {
uint32_t i32;
uint16_t i16[2];
uint8_t plane[4];
} u_plane;
Now if I read a 32 bit int from global memory, depending on the alignment, I may not get the 32 bits I am expecting
uint32_t *ip_src0 = (uint32_t *) &src0[xx];
data0.i32 = *ip_src0;
This code only seems to work when (&src0[xx] MOD 4)==0. If (&src0[xx] MOD 4)==2, I get something different (I don’t know what I got but it is wrong on the GPU and it is too much work to figure out what was actually read)
The annoying thing is that this works fine in emulation mode…
So I end up doing this instead which runs slower but at least works correctly.
uint16_t *i16p_src0 = (uint16_t *) &src0[xx];
data0.i32 = i16p_src0[0] + (i16p_src0[1]<<16);
I tried to look at the PTX file but without an assembler manual, I can’t tell what the instructions are doing in detail.
Has anyone else seen this?
Spencer