Trouble with memory coalescing in an AES implementation Also, register usage is higher than expected

I’m currently attempting a naive implementation of AES so as to learn more about CUDA and cryptography in general, but I keep hitting a brick wall when it comes to a particular section of the main kernel, listed below.

__global__ void encrypt128( char *file, int file_size, uchar4 *round_keys, uint8_t *sbox )

{

	__shared__ uchar4  sh_round_keys[44 * sizeof(uchar4)];

	__shared__ uint8_t	 sh_sbox[256 * sizeof(uint8_t)];

	

	int id = NUM_THREADS * blockIdx.x + threadIdx.x;

	

	for (int i = 0; i < 44; ++i) {

		sh_round_keys[i].x = round_keys[i].x;

		sh_round_keys[i].y = round_keys[i].y;

		sh_round_keys[i].z = round_keys[i].z;

		sh_round_keys[i].w = round_keys[i].w;

	}

	// This forces the number of threads per block to be 256

	sh_sbox[threadIdx.x] = sbox[threadIdx.x];

	

	__syncthreads();

	

	if (id < file_size / 16) {

		uchar4 state[4];

		

		state[0].x = file[16 * id +  0];

		state[0].y = file[16 * id +  1];

		state[0].z = file[16 * id +  2];

		state[0].w = file[16 * id +  3];

		

		state[1].x = file[16 * id +  4];

		state[1].y = file[16 * id +  5];

		state[1].z = file[16 * id +  6];

		state[1].w = file[16 * id +  7];

		

		state[2].x = file[16 * id +  8];

		state[2].y = file[16 * id +  9];

		state[2].z = file[16 * id + 10];

		state[2].w = file[16 * id + 11];

		

		state[3].x = file[16 * id + 12];

		state[3].y = file[16 * id + 13];

		state[3].z = file[16 * id + 14];

		state[3].w = file[16 * id + 15];

		

		encryptBlock128(state, sh_round_keys, sh_sbox);

		

		file[16 * id +  0] = state[0].x;

		file[16 * id +  1] = state[0].y;

		file[16 * id +  2] = state[0].z;

		file[16 * id +  3] = state[0].w;

		

		file[16 * id +  4] = state[1].x;

		file[16 * id +  5] = state[1].y;

		file[16 * id +  6] = state[1].z;

		file[16 * id +  7] = state[1].w;

		

		file[16 * id +  8] = state[2].x;

		file[16 * id +  9] = state[2].y;

		file[16 * id + 10] = state[2].z;

		file[16 * id + 11] = state[2].w;

		

		file[16 * id + 12] = state[3].x;

		file[16 * id + 13] = state[3].y;

		file[16 * id + 14] = state[3].z;

		file[16 * id + 15] = state[3].w;

	}

}

At first, I was almost certain that this kernel would read from the global memory (particularly file) in a coalesced fashion, since it is a big chunk of memory that is read sequentially, without skipping or crossing anything (as depicted in the programming guide). Thus, I was taken aback when I saw that there was not a single coalesced memory operation - either read or write - which most likely is significantly slowing down the kernel.

I’m using a 9800M GS, which has compute capability 1.1, and CUDA 2.3. According to the programming guide I would need to read/write 4-byte words, adding up to 64 bytes per half-warp, in order to enjoy the benefits of memory coalescing. In AES 128-bits each state has 16 bytes, so I thought that wouldn’t be a problem, and tweaked the kernel to read 4 ints instead of 16 chars, but to no avail. How do I go about fixing this?

Also, I checked the visual profiler, and it’s giving me a lousy occupancy of 33%, with 24 registers being used. Apparently, 19-20 of those are used solely in the updating of file. Why is that so? I’m quite new to programming, and this makes no sense at all to me.

Finally, compilation of the code (~800 lines) takes hideously long (> 10 minutes), but if I comment out the 16 lines below encryptBlock128 and decryptBlock128 (precisely the same thing), it takes at most 10 seconds. Again, I can’t help but wonder what sort of deep optimization the compiler is doing that takes so much time. Could anyone shed some light on this?

In addition to that, I have two basic unrelated questions, and would be very grateful if anyone would be so kind as to answer them:

In that for loop, is the memory read once per block or per thread? The round keys weight in at 176 bytes, so I don’t believe I can load them into the shared memory in the same fashion as I load the sbox (256 elements - exactly the number of threads per block). I tried padding them, since nothing beyond 176 bytes is ever read, but I’m not sure if that’s a good way to do it. How should I deal with this?

I’m also worried about bank conflicts, specially because I don’t understand them well enough. Are there any on this kernel?

Thank you very much for your time!

I modify your code (see following) such that I can compile it

#include <cuda_runtime_api.h>

	#include <cutil.h>

	#include <cutil_inline.h>

	 

#define NUM_THREADS  32	 

	 

__device__ void  encryptBlock128(uchar4 *state, uchar4 *sh_round_keys )

{

#pragma unroll

	for( int i = 0; i < 4; i++){

		state[i].x += sh_round_keys[i].x;

	  state[i].x += sh_round_keys[i].y;

	  state[i].x += sh_round_keys[i].z;

	  state[i].x += sh_round_keys[i].w;

  }

}

__global__ void encrypt128( char *file, int file_size, uchar4 *round_keys)

{

	__shared__ uchar4  sh_round_keys[44 * sizeof(uchar4)];

  //  __shared__ uint8_t	 sh_sbox[256 * sizeof(uint8_t)];

	

	int id = NUM_THREADS * blockIdx.x + threadIdx.x;

	

	for (int i = 0; i < 44; ++i) {

		sh_round_keys[i].x = round_keys[i].x;

		sh_round_keys[i].y = round_keys[i].y;

		sh_round_keys[i].z = round_keys[i].z;

		sh_round_keys[i].w = round_keys[i].w;

	}

	// This forces the number of threads per block to be 256

//	sh_sbox[threadIdx.x] = sbox[threadIdx.x];

	

	__syncthreads();

	

 //   if (id < file_size / 16) {

		uchar4 state[4];

		

		state[0].x = file[16 * id +  0];

		state[0].y = file[16 * id +  1];

		state[0].z = file[16 * id +  2];

		state[0].w = file[16 * id +  3];

		

		state[1].x = file[16 * id +  4];

		state[1].y = file[16 * id +  5];

		state[1].z = file[16 * id +  6];

		state[1].w = file[16 * id +  7];

		

		state[2].x = file[16 * id +  8];

		state[2].y = file[16 * id +  9];

		state[2].z = file[16 * id + 10];

		state[2].w = file[16 * id + 11];

		

		state[3].x = file[16 * id + 12];

		state[3].y = file[16 * id + 13];

		state[3].z = file[16 * id + 14];

		state[3].w = file[16 * id + 15];

		

		__syncthreads();

	//	encryptBlock128(state, sh_round_keys, sh_sbox);

		encryptBlock128(state, sh_round_keys);

		__syncthreads();

		file[16 * id +  0] = state[0].x;

		file[16 * id +  1] = state[0].y;

		file[16 * id +  2] = state[0].z;

		file[16 * id +  3] = state[0].w;

		

		file[16 * id +  4] = state[1].x;

		file[16 * id +  5] = state[1].y;

		file[16 * id +  6] = state[1].z;

		file[16 * id +  7] = state[1].w;

		

		file[16 * id +  8] = state[2].x;

		file[16 * id +  9] = state[2].y;

		file[16 * id + 10] = state[2].z;

		file[16 * id + 11] = state[2].w;

		

		file[16 * id + 12] = state[3].x;

		file[16 * id + 13] = state[3].y;

		file[16 * id + 14] = state[3].z;

		file[16 * id + 15] = state[3].w;

 //   }

}

If you use decuda to deassembly the .cubin file, then

int id = NUM_THREADS * blockIdx.x + threadIdx.x;

	...

		uchar4 state[4];

		

		state[0].x = file[16 * id +  0];

		state[0].y = file[16 * id +  1];

		state[0].z = file[16 * id +  2];

		state[0].w = file[16 * id +  3];

		

		state[1].x = file[16 * id +  4];

		state[1].y = file[16 * id +  5];

		state[1].z = file[16 * id +  6];

		state[1].w = file[16 * id +  7];

		

		state[2].x = file[16 * id +  8];

		state[2].y = file[16 * id +  9];

		state[2].z = file[16 * id + 10];

		state[2].w = file[16 * id + 11];

		

		state[3].x = file[16 * id + 12];

		state[3].y = file[16 * id + 13];

		state[3].z = file[16 * id + 14];

		state[3].w = file[16 * id + 15];

would be translated into

cvt.u32.u16 $r0, $r0.lo		// r0 <-- threadIdx.x

mad24.lo.u32.u16.u16 $r0, s[0x000c], 0x0020  // r0 <-- id = NUM_THREADS * blockIdx.x + threadIdx.x

shl.u32 $r0, $r0, 0x00000004   // r0 <-- 16 * id

add.u32 $r16, s[0x0010], $r0   // r16 <-- &file[16*id] 

mov.s8 $r15, g[$r16]		   // r15 = state[0].x  <-- file[16 * id +  0];

add.b32 $r0, $r16, 0x00000001  // r0 <--  &file[16*id+1] 

mov.s8 $r14, g[$r0]			// r14 = state[0].y  <-- file[16 * id +  0];

add.b32 $r0, $r16, 0x00000002  // r0 <--  &file[16*id+2] 

mov.s8 $r13, g[$r0]			// r13 = state[0].z  <-- file[16 * id +  0];

add.b32 $r0, $r16, 0x00000003  // r0 <--  &file[16*id+3] 

mov.s8 $r12, g[$r0]			// r12 = state[0].w  <-- file[16 * id +  0];

add.b32 $r0, $r16, 0x00000004

mov.s8 $r0, g[$r0]

add.b32 $r1, $r16, 0x00000005

mov.s8 $r1, g[$r1]

add.b32 $r2, $r16, 0x00000006

mov.s8 $r2, g[$r2]

add.b32 $r3, $r16, 0x00000007

mov.s8 $r3, g[$r3]

add.b32 $r4, $r16, 0x00000008

mov.s8 $r4, g[$r4]

add.b32 $r5, $r16, 0x00000009

mov.s8 $r5, g[$r5]

add.b32 $r6, $r16, 0x0000000a

mov.s8 $r6, g[$r6]

add.b32 $r7, $r16, 0x0000000b

mov.s8 $r7, g[$r7]

add.b32 $r8, $r16, 0x0000000c

mov.s8 $r8, g[$r8]

add.b32 $r9, $r16, 0x0000000d

mov.s8 $r9, g[$r9]

add.b32 $r10, $r16, 0x0000000e

mov.s8 $r10, g[$r10]

add.b32 $r11, $r16, 0x0000000f

mov.s8 $r11, g[$r11]

Hence your code does not have coalesced property.

it is good to hear

because I have no idea about algorithm of AES and I have no information about

device function “encryptBlock128”, hence I don’t know if your algorithm is computational-bound or not.

Sorry, I should have given a quick overview of the algorithm. encryptBlock128 calls 4 kernels about 10 times each:

    addRoundKey128 XORs the state (uchar4[4]) with one of the round_keys (uchar4[4], 11 of them in total) - 16 sequential reads from the shared memory.

    subBytes128 is a table lookup that substitutes each byte from the state with another from the sbox (uint8_t[256]) - 16 potentially non-sequential reads from the shared memory.

    shiftRows128 shifts the rows from the state 1, 2, and 3 elements to the left - doesn’t read from the shared memory.

    mixColumns128 can be seen as a matrix multiplication in the Galois field - doesn’t read from the shared memory.

Given the number of operations in each of these kernels, I thought the algorithm would be computationally-bound, but removing the global memory access greatly speeds up its execution (I believe it’s at least 6x faster). Therefore, I’m worried about the effects of non-coalesced memory access.

I shall post the entire code once I get it to decrypt correctly!

I have two comments

  1. you say “I thought the algorithm would be computationally-bound”, however

if we define

R = total time of read/write of global memory and

C = total time of computation, then

from

we have C/(R+C) = 1/6, it means that R = 5*C, this number is large, I suspect that your algorithm is computational-bound.

  1. in device function “encryptBlock128”, you must take care read/write of shared memory, avoid bank-conflict.

As promised, here is a first draft of the implementation. It should compile and work relatively well.

Sorry if I don’t follow you, but isn’t that why the algorithm is memory-bound, namely that accessing global memory takes so much more time than doing the calculations?

I’m worried about the sbox, since I can’t guarantee that there won’t be any bank conflicts as access to it is pretty much random. Should I spread out the elements (i.e. one per bank)?

One thing just to check (looking at the generated PTX): nvcc aggressively optimizes away code blocks which have no effect. If you remove a global memory write, it is possible for the compiler to remove all the code that write depended on.

To test the effect of writes on performance, I instead do the write twice in a row and time the kernel again.