Shared Memory (Unaligned Memory access)

hamada989 · December 21, 2009, 1:27pm

Hi friends,

If any one could help me out here it would be appreciated:

I have a code that looks like this:

 __shared__ BYTE sh[16];
 sh[0] =  PT[idx * 16 + 0];    //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory
 sh[1] = PT[idx * 16 + 1];
 sh[2] = PT[idx * 16 + 2];
 sh[3] = PT[idx * 16 + 3];
 sh[4] = PT[idx * 16 + 4];
 sh[5] = PT[idx * 16 + 5];
 sh[6] = PT[idx * 16 + 6];
 sh[7] = PT[idx * 16 + 7];
 sh[8] = PT[idx * 16 + 8];
 sh[9] = PT[idx * 16 + 9];
 sh[10] = PT[idx * 16 + 10];
 sh[11] = PT[idx * 16 + 11];
 sh[12] = PT[idx * 16 + 12];
 sh[13] = PT[idx * 16 + 13];
 sh[14] = PT[idx * 16 + 14];
 sh[15] = PT[idx * 16 + 15];

 R3 = K[3] ^ BSWAP(((u32*)sh)[3]);
 R2 = K[2] ^ BSWAP(((u32*)sh)[2]);
 R1 = K[1] ^ BSWAP(((u32*)sh)[1]);
 R0 = K[0] ^ BSWAP(((u32*)sh)[0]);

 ........(some processing)........

 ((u32*)sh)[3] = BSWAP(R1 ^ K[7]);
 ((u32*)sh)[2] = BSWAP(R0 ^ K[6]);
 ((u32*)sh)[1] = BSWAP(R3 ^ K[5]);
 ((u32*)sh)[0] = BSWAP(R2 ^ K[4]);

 PT[idx * 16 + 0] = sh[0];
 PT[idx * 16 + 1] = sh[1];
 PT[idx * 16 + 2] = sh[2];
 PT[idx * 16 + 3] = sh[3];
 PT[idx * 16 + 4] = sh[4];
 PT[idx * 16 + 5] = sh[5];
 PT[idx * 16 + 6] = sh[6];
 PT[idx * 16 + 7] = sh[7];
 PT[idx * 16 + 8] = sh[8];
 PT[idx * 16 + 9] = sh[9];
 PT[idx * 16 + 10] = sh[10];
 PT[idx * 16 + 11] = sh[11];
 PT[idx * 16 + 12] = sh[12];
 PT[idx * 16 + 13] = sh[13];
 PT[idx * 16 + 14] = sh[14];
 PT[idx * 16 + 15] = sh[15];

When I compile, I get “Unaligned Memory access”

I know it’s because of casting shared memory to u32, and I tried declaring shared memory as u32 from the beginning but I just get wrong results. Any work around this problem??? More specifically can I cast shared memory in any other way such that I get the desired output???

LSChien · December 21, 2009, 2:49pm

I have some question about your kernel, you have race condition on following code

__shared__ BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

because every thread writes to the same shared memory.

hamada989 · December 22, 2009, 7:45am

I have some question about your kernel, you have race condition on following code

__shared__ BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

because every thread writes to the same shared memory.

thanks for replying,

But I declare the shared memory inside the kernal, so every thread has it’s own 16 byte array shared memory.

Sarnath · December 22, 2009, 8:26am

Shared memory is shared by all threads…inside the block…

If you remove the shared prefix, then it becomes local memory to each thread… Since u ue constant indices to acces, they will be translated to registers…

hamada989 · December 22, 2009, 9:17am

So what you are saying is … If I remove the shared prefix from BYTE sh[16] it will reside on the registers. Doesn’t that contradict with :

If you declare a variable inside the kernel without an identifier it resides in registers except arrays. (Documentation)

Will it only reside on registers just because “constant indices to access” ??

Thanks again

LSChien · December 22, 2009, 9:42am

if you declare “BYTE sh[16]”

BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

then compiler can map sh[i] to i-th register, then transalte the code to

mov r0  PT[idx * 16 + 0]

mov r1  PT[idx * 16 + 1]

....

However if your code is

BYTE sh[16];

sh[ idx ] = PT[idx * 16 + 1];

then compiler does not know what is idx, then register binding is impossible,

hence compiler would put “BYTE sh[16]” into local memory and translate the code to

mov  r0  PT[idx * 16 + 1]; 

mov  sh[ idx ] r0;

hamada989 · December 22, 2009, 9:51am

if you declare “BYTE sh[16]”

BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

then compiler can map sh[i] to i-th register, then transalte the code to

mov r0  PT[idx * 16 + 0]

mov r1  PT[idx * 16 + 1]

....

However if your code is

BYTE sh[16];

sh[ idx ] = PT[idx * 16 + 1];

then compiler does not know what is idx, then register binding is impossible,

hence compiler would put “BYTE sh[16]” into local memory and translate the code to

mov  r0  PT[idx * 16 + 1]; 

mov  sh[ idx ] r0;

Thank you so much Mr LSChien. Your reply has been so informative :)

Thank you too Mr Sarnath

Topic		Replies	Views
Error: "unaligned memory accesses not supported"? CUDA Programming and Performance	6	11204	April 2, 2009
Memory Access CUDA Programming and Performance	2	1147	June 22, 2009
Unaligned memory access not supported CUDA Programming and Performance	12	3227	September 8, 2010
Shared memory coherance problem within a thread CUDA Programming and Performance	3	816	November 17, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	2299	October 8, 2020
pointer to shared memory compiler problems CUDA Programming and Performance	19	14843	June 7, 2008
Shared memory question CUDA Programming and Performance	27	7675	June 23, 2008
COMPUTE-SANITIZER " Invalid __shared__ write of size 16 bytes " CUDA Programming and Performance	3	1278	August 17, 2022
shared memory wrong allocation? CUDA Programming and Performance	2	880	July 29, 2009
Shared Memory Again What is happening Here CUDA Programming and Performance	11	2271	June 5, 2009

Shared Memory (Unaligned Memory access)

Related topics