Possible Alignment Issue

Hello,

I have an array of bytes in shared memory which appears to be correct looking at the values using NSIGHT parallel.

unsigned char b[16];

bytes 0-1 represent a short value
byte 2 represent a char value
byt3 3-4 represent a short value

By casting the values I am able to get the interpreted values, usually…

unsigned short val1= *((unsigned short *)&b[0]);
unsigned char val2 = b[2];

The value of val2 ends up being b[0]. I do not know how to use CUDA to interpret the bytes of a buffer.

Hello,

I have an array of bytes in shared memory which appears to be correct looking at the values using NSIGHT parallel.

unsigned char b[16];

bytes 0-1 represent a short value
byte 2 represent a char value
byt3 3-4 represent a short value

By casting the values I am able to get the interpreted values, usually…

unsigned short val1= *((unsigned short *)&b[0]);
unsigned char val2 = b[2];

The value of val2 ends up being b[0]. I do not know how to use CUDA to interpret the bytes of a buffer.

i don’t think you can actually access memory on any thing less than a 32-bit boundary.

so i’m guessing that your pointer is getting truncated down to the 32-bit boundary (the lowest 2 bits are getting ignored), and hence you’re getting b[0].

i’d rearrange the data, such that chars are arranged in memory in groups of 4, and short ints in group of 2. then you can use the vector types like char4, etc. and access the data that way.

(fyi the gpus don’t support vector arithmetic operations natively (save the usual “warp” style multiprocessing), so you won’t benefit that way, the only benefit is reducing memory pressure).

alternatively you could use bit masks and shifts, but that would probably be slower. in any case i believe memory access granularity is 32-bit (4 byte) .

i don’t think you can actually access memory on any thing less than a 32-bit boundary.

so i’m guessing that your pointer is getting truncated down to the 32-bit boundary (the lowest 2 bits are getting ignored), and hence you’re getting b[0].

i’d rearrange the data, such that chars are arranged in memory in groups of 4, and short ints in group of 2. then you can use the vector types like char4, etc. and access the data that way.

(fyi the gpus don’t support vector arithmetic operations natively (save the usual “warp” style multiprocessing), so you won’t benefit that way, the only benefit is reducing memory pressure).

alternatively you could use bit masks and shifts, but that would probably be slower. in any case i believe memory access granularity is 32-bit (4 byte) .

I tried to compile the following code:

__shared__ unsigned char b[16];

	if (threadIdx.x<16)

		b[threadIdx.x]=unsigned char(threadIdx.x);

	__syncthreads();

	unsigned short val1= *((unsigned short *)&b[0]);

	unsigned char val2=b[2];

	__syncthreads();

However it failed with a message:

“Error: Unaligned memory accesses not supported”

The thing is, that when you declare a shared char[b]; there is no guarantee that it is going to be aligned to anything more than the __alignof(char). It could happen, that the address of your array b is for example 0x00000101. If that happens, your assignment to val1 will be invalid.

Try the following:

__shared__ __align__(2) unsigned char b[16];

	if (threadIdx.x<16)

		b[threadIdx.x]=unsigned char(threadIdx.x+1);

	__syncthreads();

	unsigned short val1= *((unsigned short *)&b[0]);

	unsigned char val2=b[2];

	__syncthreads();

When I ordered the GPU to then print the values val1 and val2, all threads reported:

val1=513, val2=3

which is the correct result.

Note, I am using 1.3 device, but I believe it should work for 2.0 as well.

I am not using NSIGHT.

I tried to compile the following code:

__shared__ unsigned char b[16];

	if (threadIdx.x<16)

		b[threadIdx.x]=unsigned char(threadIdx.x);

	__syncthreads();

	unsigned short val1= *((unsigned short *)&b[0]);

	unsigned char val2=b[2];

	__syncthreads();

However it failed with a message:

“Error: Unaligned memory accesses not supported”

The thing is, that when you declare a shared char[b]; there is no guarantee that it is going to be aligned to anything more than the __alignof(char). It could happen, that the address of your array b is for example 0x00000101. If that happens, your assignment to val1 will be invalid.

Try the following:

__shared__ __align__(2) unsigned char b[16];

	if (threadIdx.x<16)

		b[threadIdx.x]=unsigned char(threadIdx.x+1);

	__syncthreads();

	unsigned short val1= *((unsigned short *)&b[0]);

	unsigned char val2=b[2];

	__syncthreads();

When I ordered the GPU to then print the values val1 and val2, all threads reported:

val1=513, val2=3

which is the correct result.

Note, I am using 1.3 device, but I believe it should work for 2.0 as well.

I am not using NSIGHT.