Shared Memory (Unaligned Memory access)

Hi friends,

If any one could help me out here it would be appreciated:

I have a code that looks like this:

 __shared__ BYTE sh[16];
 sh[0] =  PT[idx * 16 + 0];    //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory
 sh[1] = PT[idx * 16 + 1];
 sh[2] = PT[idx * 16 + 2];
 sh[3] = PT[idx * 16 + 3];
 sh[4] = PT[idx * 16 + 4];
 sh[5] = PT[idx * 16 + 5];
 sh[6] = PT[idx * 16 + 6];
 sh[7] = PT[idx * 16 + 7];
 sh[8] = PT[idx * 16 + 8];
 sh[9] = PT[idx * 16 + 9];
 sh[10] = PT[idx * 16 + 10];
 sh[11] = PT[idx * 16 + 11];
 sh[12] = PT[idx * 16 + 12];
 sh[13] = PT[idx * 16 + 13];
 sh[14] = PT[idx * 16 + 14];
 sh[15] = PT[idx * 16 + 15];

 R3 = K[3] ^ BSWAP(((u32*)sh)[3]);
 R2 = K[2] ^ BSWAP(((u32*)sh)[2]);
 R1 = K[1] ^ BSWAP(((u32*)sh)[1]);
 R0 = K[0] ^ BSWAP(((u32*)sh)[0]);

 ........(some processing)........

 ((u32*)sh)[3] = BSWAP(R1 ^ K[7]);
 ((u32*)sh)[2] = BSWAP(R0 ^ K[6]);
 ((u32*)sh)[1] = BSWAP(R3 ^ K[5]);
 ((u32*)sh)[0] = BSWAP(R2 ^ K[4]);

 PT[idx * 16 + 0] = sh[0];
 PT[idx * 16 + 1] = sh[1];
 PT[idx * 16 + 2] = sh[2];
 PT[idx * 16 + 3] = sh[3];
 PT[idx * 16 + 4] = sh[4];
 PT[idx * 16 + 5] = sh[5];
 PT[idx * 16 + 6] = sh[6];
 PT[idx * 16 + 7] = sh[7];
 PT[idx * 16 + 8] = sh[8];
 PT[idx * 16 + 9] = sh[9];
 PT[idx * 16 + 10] = sh[10];
 PT[idx * 16 + 11] = sh[11];
 PT[idx * 16 + 12] = sh[12];
 PT[idx * 16 + 13] = sh[13];
 PT[idx * 16 + 14] = sh[14];
 PT[idx * 16 + 15] = sh[15];

When I compile, I get “Unaligned Memory access”

I know it’s because of casting shared memory to u32, and I tried declaring shared memory as u32 from the beginning but I just get wrong results. Any work around this problem??? More specifically can I cast shared memory in any other way such that I get the desired output???

I have some question about your kernel, you have race condition on following code

__shared__ BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

because every thread writes to the same shared memory.

thanks for replying,

But I declare the shared memory inside the kernal, so every thread has it’s own 16 byte array shared memory.

Shared memory is shared by all threads…inside the block…

If you remove the shared prefix, then it becomes local memory to each thread… Since u ue constant indices to acces, they will be translated to registers…

So what you are saying is … If I remove the shared prefix from BYTE sh[16] it will reside on the registers. Doesn’t that contradict with :

If you declare a variable inside the kernel without an identifier it resides in registers except arrays. (Documentation)

Will it only reside on registers just because “constant indices to access” ??

Thanks again

if you declare “BYTE sh[16]”

BYTE sh[16];

sh[0] = PT[idx * 16 + 0]; //PT is of type BYTE ///idx is the id of the thread // trying to copy data from global memory to shared memory

sh[1] = PT[idx * 16 + 1];

sh[2] = PT[idx * 16 + 2];

sh[3] = PT[idx * 16 + 3];

sh[4] = PT[idx * 16 + 4];

sh[5] = PT[idx * 16 + 5];

sh[6] = PT[idx * 16 + 6];

sh[7] = PT[idx * 16 + 7];

sh[8] = PT[idx * 16 + 8];

sh[9] = PT[idx * 16 + 9];

sh[10] = PT[idx * 16 + 10];

sh[11] = PT[idx * 16 + 11];

sh[12] = PT[idx * 16 + 12];

sh[13] = PT[idx * 16 + 13];

sh[14] = PT[idx * 16 + 14];

sh[15] = PT[idx * 16 + 15];

then compiler can map sh[i] to i-th register, then transalte the code to

mov r0  PT[idx * 16 + 0]

mov r1  PT[idx * 16 + 1]

....

However if your code is

BYTE sh[16];

sh[ idx ] = PT[idx * 16 + 1];

then compiler does not know what is idx, then register binding is impossible,

hence compiler would put “BYTE sh[16]” into local memory and translate the code to

mov  r0  PT[idx * 16 + 1]; 

mov  sh[ idx ] r0;

Thank you so much Mr LSChien. Your reply has been so informative :)

Thank you too Mr Sarnath