Data alignment for 32 bit reads?

I ran into a data alignment issue when trying to read 32 bit integer from global memory. I am not sure whether this is a compiler issue or a GPU architecture issue or just my stupidity…

Basically, I have declared a union like this to consolidate the memory accesses since the code does a lot of byte level reads from global memory…

typedef union {

  uint32_t i32;

  uint16_t i16[2];

  uint8_t plane[4];

} u_plane;

Now if I read a 32 bit int from global memory, depending on the alignment, I may not get the 32 bits I am expecting

uint32_t *ip_src0 = (uint32_t *) &src0[xx];

data0.i32 = *ip_src0;

This code only seems to work when (&src0[xx] MOD 4)==0. If (&src0[xx] MOD 4)==2, I get something different (I don’t know what I got but it is wrong on the GPU and it is too much work to figure out what was actually read)

The annoying thing is that this works fine in emulation mode…

So I end up doing this instead which runs slower but at least works correctly.

uint16_t *i16p_src0 = (uint16_t *) &src0[xx];

data0.i32 = i16p_src0[0] + (i16p_src0[1]<<16);

I tried to look at the PTX file but without an assembler manual, I can’t tell what the instructions are doing in detail.

Has anyone else seen this?


What type is src0 and what are the values of xx ?

The compiler is free to align access to uchar to 1 byte, so I hope your xx is a multiple of 4 in case src0 is uchar, right?


Hi Peter,

uint8_t src0 which points to somewhere inside a memory block allocated using cudaMalloc(). xx is basically 2thread_id.

Did I miss something in the programming guide which said 32 reads on 16 boundaries are illegal?

If the H/W does not support 32 bit reads on 16 bit alignments, I would have expected that the compiler to do something like this “i32 = i16[0] + i16[1]<<16)” on it’s own (or generate a compile error). It does appear to support 8 bit reads on byte boundaries properly based on an earlier but slower version of my kernel that worked (and from reading the ptx file).

Video codecs does a lot of ops at byte level so I can’t just rewrite the code to get rid of them.


You can read 32bits from any alignment, but in the code above the increment of src0 is resolved by the compiler as incrementing a uint8 pointer, which is xx * sizeof(unit8). So you will end up with a pointer that does not point to the beginning of a u_plane if xx isn’t a multiple of 4. The operator has precedence over the address operator. This is true for C as it is for CUDA.


That is correct and it is what I need. ip_src is a pointer into src0 that will be either 32 bit-aligned or 16 bit-aligned depending on xx (which is thread_id*2). u_plane is intended to be a register copy of what is in src0[xx].

I wanted to read all 32 bits in 1 read if possible rather then 2 16 bit reads which I am doing (or 4 byte sized reads). After this read, I operate on a byte-basis on the (register) copy of the data stored in u_plane.

Originally, when I ran into this problem, my code looked like this…

typedef union {

  uint32_t i32;

  uint16_t i16[2];

  uint8_t plane[4];

} u_plane;

uint8_t *src0 = a computed index (16 byte aligned) into a malloc'ed buffer;

u_plane data0;

uint32_t *ip_src0 = (uint32_t *) &src0[xx];

data0.i32 = *ip_src0;

unsigned int t0 = data0.plane[0] + data0.plane[1]

[... and so on as the code operate on each byte...]

The ip_src pointer is pointing into the right place. I was just not getting the right 32 bits on the GPU but works on the CPU.


Very strange. I did a small test.

typedef unsigned int   uint32_t;

typedef unsigned short uint16_t;

typedef unsigned char  uint8_t;

typedef union {

 uint32_t i32;

 uint16_t i16[2];

 uint8_t plane[4];

} u_plane;

__global__ void t(uint8_t *src0, uint32_t* out)


	int xx = threadIdx.x*2;

	u_plane data0;

	uint32_t *ip_src0 = (uint32_t *) &src0[xx];

	data0.i32 = *ip_src0;

	unsigned int t0 = data0.plane[0] + data0.plane[1];

	out[xx] = t0;


Which compiles to the following assembler code:

.entry t


	.reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7,$r8,$r9,


	.param .u32 %parm_src0;

	.param .u32 %parm_out;

 #  .loc	2	11	0

 #   7   uint16_t i16[2];

 #   8   uint8_t plane[4];

 #   9  } u_plane;

 #  10  

 #  11  __global__ void t(uint8_t *src0, uint32_t* out)


 #  .loc	2	17	0

 #  13  	int xx = threadIdx.x*2;

 #  14  

 #  15  	u_plane data0;

 #  16  	uint32_t *ip_src0 = (uint32_t *) &src0[xx];

 #  17  	data0.i32 = *ip_src0;

	cvt.u32.u16  $r1, %tid.x;      #  

	ld.param.u32  $r2, %parm_src0;	#  id:19 %parm_src0+0x0

	mul.lo.u32  $r3, $r1, 2;      	#  

	add.u32  $r4, $r2, $r3;        #  $r5, [$r4+0];  	#  id:20

 #  .loc	2	20	0

 #  18  	unsigned int t0 = data0.plane[0] + data0.plane[1];

 #  19  

 #  20  	out[xx] = t0;

	and.u32  $r6, $r5, 255;        #  

	mov.u32  $r7, $r6;            	#  

	and.u32  $r8, $r5, 65280;      #  

	shr.u32  $r9, $r8, 8;          #  

	add.s32  $r10, $r7, $r9;      	#  

	ld.param.u32  $r11, %parm_out;	#  id:21 %parm_out+0x0

	mul.lo.u32  $r12, $r1, 8;      #  

	add.u32  $r13, $r11, $r12;    	#  [$r13+0], $r10;	#  id:22

	exit;                          #  

	} # t

	.version 1.1

IMHO this is totally correct. The data is read into a u32 register and then the output is assembled by masking with 0xff and adding the value masked with 0xff00 shift right 8. This is what I expect. The 8800 is little endian as is your PC processor, so this should give the same result. :blink:



I think the bug/error is not where you think it is. It is futher up at line 16 & 17 because when I replace the “data0.i32 = *ip_src0” with 2 16 bits reads, the code works perfectly.

Here is a chunk of the ptx output using both 32 bits and 16 bits memory fetches

#  53      data1.i32 = *ip_src1;

	add.u32  $r30, $r27, $r23;    	#  

	add.u32  $r31, $r25, $r30;    	#  $r32, [$r31+0];	#  id:179

	st.local.u32  data1$4[0], $r32;	#  id:180 data1$4+0x0

#  62      data1.i32 = i16p_src1[0] + (i16p_src1[1]<<16);  $r40, [$r31+0];	#  id:185  $r41, [$r31+2];	#  id:186

	shl.s32  $r42, $r41, 16;      	#  

	add.s32  $r43, $r40, $r42;    	#  

They both look like they should work because the 32bit version loads the required 32 bits into $r32 while the 16 bit version loads 2 16 bit value into $r40 & $r41 and add them together after a 16 bit shift of one of them.

But when I run the code, the 32 bit version gives me the wrong result compared to the reference solution.

I suppose I should really sit down and write a specific test case to try to isolate this problem…


Can you please file a bug on the registered developer site if you determine it is not a problem in your own code?


Yeah. Both versions do what they are supposed to. At least on this assembler level. Looks indeed like a lowlevel bug to me.