Making bit slice DES

Does anybody had any success in implementing bit slice DES?

This is well-known method to speed up DES encryption routine:

I did almost the same as in original source code.

But, registers count is huge:

[codebox]>ptxas info : Used 60 registers, 2160+0 bytes lmem, 28+16 bytes smem[/codebox]

So, 2000 registers shifted to local memory and it (9500 GT board) works roughly as fast as Intel Core Duo. Probably because of constant global memory access.

So what can be done?

I know this is an old post but:

[codebox]ptxas info : Used 64 registers, 1056+0 bytes lmem, 64+16 bytes smem, 256 bytes cmem[0], 8 bytes cmem[1][/codebox]

The idea is to put key parts in shared/constant memory when appropriate. For the purpose of pure encryption all key(s) would be put into constant memory.

But I build a brute-force cracker so keys vary between threads and blocks, so parts of them have to reside in registers, parts in smem and parts in cmem.

And my code on GeForce 9650M GT is almost 3 times faster than Core 2 Duo 2.26 GHz (without SSE/XMM).

can anyone share code on this implemtation,