[codebox]>ptxas info : Used 60 registers, 2160+0 bytes lmem, 28+16 bytes smem[/codebox]
So, 2000 registers shifted to local memory and it (9500 GT board) works roughly as fast as Intel Core Duo. Probably because of constant global memory access.
[codebox]ptxas info : Used 64 registers, 1056+0 bytes lmem, 64+16 bytes smem, 256 bytes cmem[0], 8 bytes cmem[1][/codebox]
The idea is to put key parts in shared/constant memory when appropriate. For the purpose of pure encryption all key(s) would be put into constant memory.
But I build a brute-force cracker so keys vary between threads and blocks, so parts of them have to reside in registers, parts in smem and parts in cmem.
And my code on GeForce 9650M GT is almost 3 times faster than Core 2 Duo 2.26 GHz (without SSE/XMM).