GPU cryptography speedup

i have implemented a simple RC4 function on the GPU but it runs much slower than the CPU implementation.

would anyone be able to provide advice as to why this is and how i could speed it up? (i believe the slow part is rc4_init)


attached code:

in main:

dim3 dimBlock( 256, 1);

dim3 dimGrid( 256, 1);

rc4_benchmark<<< dimGrid, dimBlock >>>(d_odata);

#define u_char unsigned char

struct rc4_state {

	u_char	perm[256];

	u_char	index1;

	u_char	index2;


__device__ void

swap_bytes(u_char *a, u_char *b)


	u_char temp;

	temp = *a;

	*a = *b;

	*b = temp;


__device__ void 

rc4_init(struct rc4_state *const state, const u_char *key, int keylen)


	u_char j;

	int i;

	/* Initialize state with identity permutation */

	for (i = 0; i < 256; i++)

  state->perm[i] = (u_char)i; 

	state->index1 = 0;

	state->index2 = 0;


	/* Randomize the permutation using key data */

	for (j = i = 0; i < 256; i++) {

  j += state->perm[i] + key[i % keylen]; 

  swap_bytes(&state->perm[i], &state->perm[j]);



__device__ void

rc4_crypt(struct rc4_state *const state,

	const u_char *inbuf, u_char *outbuf, int buflen)


	int i;

	u_char j;

	for (i = 0; i < buflen; i++) {

 /* Update modification indicies */


  state->index2 += state->perm[state->index1];

 /* Modify permutation */



 /* Encrypt/decrypt next byte */

  j = state->perm[state->index1] + state->perm[state->index2];

  outbuf[i] = inbuf[i] ^ state->perm[j];



__device__ int dmemcmp(void *a, void *b, int size) {

	unsigned char *x = (unsigned char *) a;

	unsigned char *y = (unsigned char *) b;

	int i;

	for (i=0;i<size;i++){

  if (x[i]!=y[i])

  	return 1;


	return 0;


__device__ void dmemcpy(void *dst, void *src, int size) {

	unsigned char *x = (unsigned char *) dst;

	unsigned char *y = (unsigned char *) src;

	int i;

	for (i=0;i<size;i++){




__constant__ u_char d_rcryptbuf[8];

#define keysize 2 

#define size 8

__global__ void rc4_benchmark(char *odata) {

	unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

	unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

	u_char crack_key[16];

	u_char outbuf;

	int i = xIndex;

	struct rc4_state state;

	crack_key[0] = i & 255;

	crack_key[1] = (i>>8) & 255;

	crack_key[2] = (i>>16) & 255;

	crack_key[3] = (i>>24) & 255;





Have you run it through the Visual Profiler yet? That should give you some idea of why it’s not running quickly (uncoalesced memory reads, too many divergent branches, etc.)

RC4 is not very good choice for GPU due to its heavy memory usage. I’ve been able to tweak my implementation to be 3x-5x faster than CPU but that’s far from GPU potential. I’ll check my source code when I get back to office.

I tried the visual profiler, but I’m not sure if I was using it correctly, this is the only results I got:

rc4_benchmark 1 101582 495.12 58.66

memcopy 1 71570 41.33

Which isn’t very helpful. Apparently that is all the information I can get with Windows Vista.

Ah, I see thanks for your comments. I would be interested to hear how you managed to have a better-than CPU performance with your code.

I would have expected that the GPU would still perform better considering that the GPU memory is faster than CPU memory.

In general, you want to avoid loops in your kernels, if possible.

I see that your “dmemcmp” and “dmemcpy” kernels use loops.

Instead, you may want to rethink the algorithm, having 1 thread copying/comparing 1 element; like in:

__global__ void  MemCopy (int *DestBuffer, const int *SourceBuffer, unsigned const int iSize)


  unsigned int	index = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

 if (index < iSize)

   DestBuffer[index] = SourceBuffer[index];


In the caller code:

#define THREADSxBLOCK       256



dim3  	dimBlock (THREADSxBLOCK,1,1);

dim3  	dimGrid;



dimGrid  	= (dim3) make_uint3(iArraySize/THREADSxBLOCK,1,1);



MemCopy<<<dimGrid, dimBlock>>>(devDestBuffer, devSourceBuffer, iArraySize);

cudaResult = cudaGetLastError();

if (cudaResult != cudaSuccess)


// handle error




This way, you also enjoy coalesced memory accesses.


Thanks Andrei, I’ve now made some small modifications to the code and it is now also 4.2x faster than the CPU version.

Thanks for your comments XFer, but I think making those changes throughout the code would require a very major rewrite.

For future reference, the only major change I made was “shared struct rc4_state” and reworking the code to accomodate the shared memory structures.