I am doing MD5/SHA-1 bruteforcer, and currently i’ve got the following results:
1260 MIPS per 1 stream processor (SP, not MP).
It appeared that 33% of all GPU time is spent in “cyclic rotate left” operation.
Without that operation I am having 1620 MIPS (which is close to theoretical performance limit of 1625MIPS as my 9600GT works at 1625Mhz). I do count logical instructions, not assembler ones (so rotate counts as 1).
Is that possible to optimize rotate operation? It is being implemented like that:
Also, here is the rest of important code
#define F(x, y, z) (((x) & (y)) | ((~x) & (z)))
#define FF(step,a, b, c, d, x, s, ac) { \
(a) += (x) + F((b), (c), (d)) + (unsigned long int)(ac);\
(a) = ROTATE_LEFT ((a), (s)); \
(a) += (b); \
 }
FF (a, b, c, d, data[0], 7, md5_const[0]); // this thing repeats 64 times with little changes
md5_const is in constant(looks like speed is the same if I leave that as constants in the code), data,a,b,c,d are in registers (i.e. local), each thread does 10’000 keys.
dim3 threads(128);//larger values does not increase speed at all
dim3 grid(128);
LOW LEVEL QUESTIONS:
-
Where can I take a look at the list of all GeForce low level instructions supported? That is to be sure that nvcc uses all possibilities in optimization.
-
Does anyone know how aggressive nvcc at optimization? I guess things like x=y+0 he does optimize. What else we can expect from nvcc?
-
Can we expect to see in nvcc something similar to SSE2 intrinsics in C++ for x86?
BTW My x86 version does 18’300 MIPS per 1 core (C2D 3Ghz) (which is ~5 32-bit operations per clk). Not so bad for x86 SIMD :-)