I’m currently attempting a naive implementation of AES so as to learn more about CUDA and cryptography in general, but I keep hitting a brick wall when it comes to a particular section of the main kernel, listed below.
__global__ void encrypt128( char *file, int file_size, uchar4 *round_keys, uint8_t *sbox )
{
__shared__ uchar4 sh_round_keys[44 * sizeof(uchar4)];
__shared__ uint8_t sh_sbox[256 * sizeof(uint8_t)];
int id = NUM_THREADS * blockIdx.x + threadIdx.x;
for (int i = 0; i < 44; ++i) {
sh_round_keys[i].x = round_keys[i].x;
sh_round_keys[i].y = round_keys[i].y;
sh_round_keys[i].z = round_keys[i].z;
sh_round_keys[i].w = round_keys[i].w;
}
// This forces the number of threads per block to be 256
sh_sbox[threadIdx.x] = sbox[threadIdx.x];
__syncthreads();
if (id < file_size / 16) {
uchar4 state[4];
state[0].x = file[16 * id + 0];
state[0].y = file[16 * id + 1];
state[0].z = file[16 * id + 2];
state[0].w = file[16 * id + 3];
state[1].x = file[16 * id + 4];
state[1].y = file[16 * id + 5];
state[1].z = file[16 * id + 6];
state[1].w = file[16 * id + 7];
state[2].x = file[16 * id + 8];
state[2].y = file[16 * id + 9];
state[2].z = file[16 * id + 10];
state[2].w = file[16 * id + 11];
state[3].x = file[16 * id + 12];
state[3].y = file[16 * id + 13];
state[3].z = file[16 * id + 14];
state[3].w = file[16 * id + 15];
encryptBlock128(state, sh_round_keys, sh_sbox);
file[16 * id + 0] = state[0].x;
file[16 * id + 1] = state[0].y;
file[16 * id + 2] = state[0].z;
file[16 * id + 3] = state[0].w;
file[16 * id + 4] = state[1].x;
file[16 * id + 5] = state[1].y;
file[16 * id + 6] = state[1].z;
file[16 * id + 7] = state[1].w;
file[16 * id + 8] = state[2].x;
file[16 * id + 9] = state[2].y;
file[16 * id + 10] = state[2].z;
file[16 * id + 11] = state[2].w;
file[16 * id + 12] = state[3].x;
file[16 * id + 13] = state[3].y;
file[16 * id + 14] = state[3].z;
file[16 * id + 15] = state[3].w;
}
}
At first, I was almost certain that this kernel would read from the global memory (particularly file) in a coalesced fashion, since it is a big chunk of memory that is read sequentially, without skipping or crossing anything (as depicted in the programming guide). Thus, I was taken aback when I saw that there was not a single coalesced memory operation - either read or write - which most likely is significantly slowing down the kernel.
I’m using a 9800M GS, which has compute capability 1.1, and CUDA 2.3. According to the programming guide I would need to read/write 4-byte words, adding up to 64 bytes per half-warp, in order to enjoy the benefits of memory coalescing. In AES 128-bits each state has 16 bytes, so I thought that wouldn’t be a problem, and tweaked the kernel to read 4 ints instead of 16 chars, but to no avail. How do I go about fixing this?
Also, I checked the visual profiler, and it’s giving me a lousy occupancy of 33%, with 24 registers being used. Apparently, 19-20 of those are used solely in the updating of file. Why is that so? I’m quite new to programming, and this makes no sense at all to me.
Finally, compilation of the code (~800 lines) takes hideously long (> 10 minutes), but if I comment out the 16 lines below encryptBlock128 and decryptBlock128 (precisely the same thing), it takes at most 10 seconds. Again, I can’t help but wonder what sort of deep optimization the compiler is doing that takes so much time. Could anyone shed some light on this?
In addition to that, I have two basic unrelated questions, and would be very grateful if anyone would be so kind as to answer them:
In that for loop, is the memory read once per block or per thread? The round keys weight in at 176 bytes, so I don’t believe I can load them into the shared memory in the same fashion as I load the sbox (256 elements - exactly the number of threads per block). I tried padding them, since nothing beyond 176 bytes is ever read, but I’m not sure if that’s a good way to do it. How should I deal with this?
I’m also worried about bank conflicts, specially because I don’t understand them well enough. Are there any on this kernel?
Thank you very much for your time!