GPU cryptography speedup

not8 · October 7, 2008, 12:30pm

i have implemented a simple RC4 function on the GPU but it runs much slower than the CPU implementation.

would anyone be able to provide advice as to why this is and how i could speed it up? (i believe the slow part is rc4_init)

thanks.

attached code:

in main:

dim3 dimBlock( 256, 1);

dim3 dimGrid( 256, 1);

rc4_benchmark<<< dimGrid, dimBlock >>>(d_odata);

#define u_char unsigned char

struct rc4_state {

	u_char	perm[256];

	u_char	index1;

	u_char	index2;

};

__device__ void

swap_bytes(u_char *a, u_char *b)

{

	u_char temp;

	temp = *a;

	*a = *b;

	*b = temp;

}

__device__ void 

rc4_init(struct rc4_state *const state, const u_char *key, int keylen)

{

	u_char j;

	int i;

	/* Initialize state with identity permutation */

	for (i = 0; i < 256; i++)

  state->perm[i] = (u_char)i; 

	state->index1 = 0;

	state->index2 = 0;

  

	/* Randomize the permutation using key data */

	for (j = i = 0; i < 256; i++) {

  j += state->perm[i] + key[i % keylen]; 

  swap_bytes(&state->perm[i], &state->perm[j]);

	}

}

__device__ void

rc4_crypt(struct rc4_state *const state,

	const u_char *inbuf, u_char *outbuf, int buflen)

{

	int i;

	u_char j;

	for (i = 0; i < buflen; i++) {

 /* Update modification indicies */

  state->index1++;

  state->index2 += state->perm[state->index1];

 /* Modify permutation */

  swap_bytes(&state->perm[state->index1],

      &state->perm[state->index2]);

 /* Encrypt/decrypt next byte */

  j = state->perm[state->index1] + state->perm[state->index2];

  outbuf[i] = inbuf[i] ^ state->perm[j];

	}

}

__device__ int dmemcmp(void *a, void *b, int size) {

	unsigned char *x = (unsigned char *) a;

	unsigned char *y = (unsigned char *) b;

	int i;

	for (i=0;i<size;i++){

  if (x[i]!=y[i])

  	return 1;

	}

	return 0;

}

__device__ void dmemcpy(void *dst, void *src, int size) {

	unsigned char *x = (unsigned char *) dst;

	unsigned char *y = (unsigned char *) src;

	int i;

	for (i=0;i<size;i++){

  x[i]=y[i];

	}

}

__constant__ u_char d_rcryptbuf[8];

#define keysize 2 

#define size 8

__global__ void rc4_benchmark(char *odata) {

	unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

	unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

	u_char crack_key[16];

	u_char outbuf;

	int i = xIndex;

	struct rc4_state state;

	crack_key[0] = i & 255;

	crack_key[1] = (i>>8) & 255;

	crack_key[2] = (i>>16) & 255;

	crack_key[3] = (i>>24) & 255;

	rc4_init(&state,crack_key,keysize);

	rc4_crypt(&state,d_rcryptbuf,outbuf,size);

	dmemcpy(&odata[i*8],outbuf,8);

}

jack · October 7, 2008, 1:37pm

Have you run it through the Visual Profiler yet? That should give you some idea of why it’s not running quickly (uncoalesced memory reads, too many divergent branches, etc.)

AndreiB · October 7, 2008, 5:31pm

RC4 is not very good choice for GPU due to its heavy memory usage. I’ve been able to tweak my implementation to be 3x-5x faster than CPU but that’s far from GPU potential. I’ll check my source code when I get back to office.

not8 · October 8, 2008, 4:46am

I tried the visual profiler, but I’m not sure if I was using it correctly, this is the only results I got:

rc4_benchmark 1 101582 495.12 58.66

memcopy 1 71570 41.33

Which isn’t very helpful. Apparently that is all the information I can get with Windows Vista.

Ah, I see thanks for your comments. I would be interested to hear how you managed to have a better-than CPU performance with your code.

I would have expected that the GPU would still perform better considering that the GPU memory is faster than CPU memory.

XFer · October 8, 2008, 10:28am

In general, you want to avoid loops in your kernels, if possible.

I see that your “dmemcmp” and “dmemcpy” kernels use loops.

Instead, you may want to rethink the algorithm, having 1 thread copying/comparing 1 element; like in:

__global__ void  MemCopy (int *DestBuffer, const int *SourceBuffer, unsigned const int iSize)

{	

  unsigned int	index = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

 if (index < iSize)

   DestBuffer[index] = SourceBuffer[index];

}

In the caller code:

#define THREADSxBLOCK       256

:

:

dim3  	dimBlock (THREADSxBLOCK,1,1);

dim3  	dimGrid;

:

:	

dimGrid  	= (dim3) make_uint3(iArraySize/THREADSxBLOCK,1,1);

:

:

MemCopy<<<dimGrid, dimBlock>>>(devDestBuffer, devSourceBuffer, iArraySize);

cudaResult = cudaGetLastError();

if (cudaResult != cudaSuccess)

{

// handle error

}

:

:

This way, you also enjoy coalesced memory accesses.

Fernando

not8 · October 17, 2008, 3:46am

Thanks Andrei, I’ve now made some small modifications to the code and it is now also 4.2x faster than the CPU version.

Thanks for your comments XFer, but I think making those changes throughout the code would require a very major rewrite.

For future reference, the only major change I made was “shared struct rc4_state” and reworking the code to accomodate the shared memory structures.

Topic		Replies	Views
RC6 using CUDA CUDA Programming and Performance	0	4767	October 30, 2011
Salsa 20 Ransomware Brute Force program CUDA Programming and Performance	1	1187	May 26, 2016
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6670	February 19, 2009
Please help, project is making me desperate and angry CUDA Programming and Performance	2	430	December 17, 2018
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7357	December 6, 2007
Help me understand why my first CUDA program is so slow? CUDA Programming and Performance cuda	5	1017	April 25, 2020
Solved - Kernel runs 5 times slower on GTX 960 than GTX 560 CUDA Programming and Performance	12	2964	August 1, 2015
is anyone CUDA optimized version of linux crypt(3) function? CUDA Programming and Performance	7	9608	January 22, 2010
Unix Crypt / DES Data Encryption Standard CUDA Programming and Performance	1	4318	June 19, 2009
How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s CUDA Programming and Performance	28	3228	September 14, 2010

GPU cryptography speedup

Related topics