Bitslice-DES optimization

(Warning: Maxwell only!)

All codes are now open-sourced here as part of my personal fork of Meriken’s Tripcode Engine, a third-party tripcode finder for 10 character (DES crypt(3)) and 12 character (SHA-1) tripcodes.

Grab the following files if you are interested in:

UNIX DES crypt(3) with salts:

One round of DES:

Note: initial permutation and final permutation are not included in the function; you need to do it yourself. Normally they’re part of the pre-processing. Both have minimal performance impact.

Suggested register usage or launch bound is 168 regs, or__launch_bounds__(128, 3), even when you wrap things around it.

With this version, I get a performance of 950 MH/s for UNIX DES crypt(3) (or equivalently 23750 MH/s for 1 round of DES) on my reference Gigabyte GTX 980 Ti (+270 MHz). Considering hashcat’s implementation gets 165.5 MH/s on a GTX Titan X (+225 MHz), it’s a great improvement. Even my naive implementation bounded by shared memory/synchronization with old SBOXes from JtR is faster (300 MH/s on 980 Ti +300 MHz).

This further dispelled the myth that Nvidia cards are still bad at DES crypt(3); while it may hold some truth before Maxwell (or even sm_32, since it’s the first version that lift the register limit from 63 to 255) when compared against GCN, it’s no longer the case, and it all depends on the implementation.