I am doing MD5/SHA1 bruteforcer, and currently i’ve got the following results:
1260 MIPS per 1 stream processor (SP, not MP).
It appeared that 33% of all GPU time is spent in “cyclic rotate left” operation.
Without that operation I am having 1620 MIPS (which is close to theoretical performance limit of 1625MIPS as my 9600GT works at 1625Mhz). I do count logical instructions, not assembler ones (so rotate counts as 1).
Is that possible to optimize rotate operation? It is being implemented like that:
Also, here is the rest of important code
#define F(x, y, z) (((x) & (y))  ((~x) & (z)))
#define FF(step,a, b, c, d, x, s, ac) { \
(a) += (x) + F((b), (c), (d)) + (unsigned long int)(ac);\
(a) = ROTATE_LEFT ((a), (s)); \
(a) += (b); \
Â }
FF (a, b, c, d, data[0], 7, md5_const[0]); // this thing repeats 64 times with little changes
md5_const is in constant(looks like speed is the same if I leave that as constants in the code), data,a,b,c,d are in registers (i.e. local), each thread does 10’000 keys.
dim3 threads(128);//larger values does not increase speed at all
dim3 grid(128);
LOW LEVEL QUESTIONS:

Where can I take a look at the list of all GeForce low level instructions supported? That is to be sure that nvcc uses all possibilities in optimization.

Does anyone know how aggressive nvcc at optimization? I guess things like x=y+0 he does optimize. What else we can expect from nvcc?

Can we expect to see in nvcc something similar to SSE2 intrinsics in C++ for x86?
BTW My x86 version does 18’300 MIPS per 1 core (C2D 3Ghz) (which is ~5 32bit operations per clk). Not so bad for x86 SIMD :)