How to parallelize DES encryption algorithm in CUDA?

Hello, can someone suggest me the logic/idea about how to optimize DES encryption algo in CUDA. I have already tried the bit-slice implementation in CUDA but i am not getting sufficient improvement in number of DES encryptions per second.

Actually i am getting 2^28 DES encryptions per second on 1 Nvidia Tesla M60 GPU but i want to achieve about minimum 2^30 encryptions per second on single GPU.Please help me.

As a general advice: Hash function code will be significantly faster if you manage to get rid of local memory and stack usage.

Arrays are often placed in local memory by the compiler where you wouldn’t want them to be. To ensure that a locally declared array variable is placed in registers, all indexing of said array must be known at compile time. Meaning that you typically have to fully unroll your loops to meet this condition (provided that the loop index variable is somehow used to index that array).

To reduce stack use, functions should be inlined (e.g. use forceinline directive)

To understand what the compiler made from your code, inspect the PTX code as well as the resulting SASS assembly code. e.g. if you find lots of ST.local and LD.local directives in the PTX you know your kernel is still using local memory (which makes you inherently bandwidth limited in the case where the L1 cache can no longer cover this).

For the purpose of crypto currency mining, specialized versions of the hash functions are implemented that operate on a message of predetermined length where typically only a small portion (e.g. a nonce field) is modified in the CUDA kernel. The resulting hashes are evaluated against a threshold condition to ensure that a proof-of-work difficulty target is met.

I do not know your use case yet and if you are using cipher block chaining mode or not. Are you processing messages of arbitrary length? If so, many thousands in parallel? Or are you processing a single large message in ECB mode?

Are you using both CUDA devices on your Tesla M60 card already? did you turn off the ECC to speed up memory operations?

There are various ways of implementing a bitslice DES. It’s hard to give advice without knowing how your code works. You may find some inspiration in this previous thread:

https://devtalk.nvidia.com/default/topic/860120/bitslice-des-optimization/

In particular you might want to check whether you can use the LOP3 instruction aggressively. I am curious to learn why anybody would be looking into DES at this point in time, other than for purposes of intellectual calisthenics. DES was a hot topic when I was in school and I am retired now. For bulk encryption, it was superseded by AES almost two decades ago.

Unless you are working with a very low-end GPU, you should be able to achieve higher throughput because these authors already achieved 373.58 million keys / second on a GTX 260 with a bit-sliced DES back in 2010:

Giovanni Agosta, Alessandro Barenghi,Fabrizio De Santis, and Gerardo Pelosi “Record Setting Software Implementation of DES Using CUDA”. In Seventh International Conference on Information Technology: New Generations (ITNG), April 2010, pp. 748-755. (online: https://home.deib.polimi.it/barenghi/files/ITNG2010.pdf)

Hey thanks for the reply.I want to generate the rainbow table that’s why i am trying to optimize the DES algo in order to reduce the final GPU cost.

My code is a simple bit-slice implementation for doing DES encryption of 64-bit plain-text message in ECB mode. I have comprised whole DES encryption algorithm in a global device function and doing a kernel launch with n blocks and m threads where each thread is performing encryption of each 64-bit block of plain-text (there is a large plain-text and each thread working on a particular 64-bit block of that plain-text).

I am seeing that the 16 rounds of DES encryption performed can’t be parallelized as these are dependent. Then the only remaining way is to look over to optimize the inner “for” loops doing operations like XOR,permutation,s-boxes. Also since s-boxes are implemented using logic gates, so there is huge register utilization per CUDA thread.I want to know which section of the code to parallelize in order to achieve maximum optimization.
@cbuchner1, i will look into your advice on GPU.
@njuffa, so you are saying about optimizing s-boxes.Besides, i have already read that paper by Agosta. There is nothing much detailed information given about that.

Correct, the paper by Agosta et.al. doesn’t provide any implementation details, but it does provide an idea of how much performance you might be able to extract from modern GPUs (e.g. estimate scaling by number of SMs and their frequency; considering changes in memory throughput).

I do not have specific advice about optimizing your code (which I have not seen) except to examine the utility of the LOP3 instruction (see the PTXAS manual). Various performance aspects of bitslice DES were discussed in the longish forum thread I pointed to, I would suggest going through all messages in that thread. You may learn something new, or it may be old hat to you, as I don’t know your level of expertise.

I last did active work related to DES around 1992, when I implemented it in an FPGA (Xilinx XC 4000, as I recall), and any software optimization work in this regard predates that by several years so you would want to rely on advice from people who have worked on this a bit more recently than I have :-)