Need help on optimization int32 code, low level things

I am doing MD5/SHA-1 bruteforcer, and currently i’ve got the following results:

1260 MIPS per 1 stream processor (SP, not MP).

It appeared that 33% of all GPU time is spent in “cyclic rotate left” operation.

Without that operation I am having 1620 MIPS (which is close to theoretical performance limit of 1625MIPS as my 9600GT works at 1625Mhz). I do count logical instructions, not assembler ones (so rotate counts as 1).

Is that possible to optimize rotate operation? It is being implemented like that:

Also, here is the rest of important code

#define F(x, y, z) (((x) & (y)) | ((~x) & (z)))

#define FF(step,a, b, c, d, x, s, ac) { \

 (a) += (x) + F((b), (c), (d)) + (unsigned long int)(ac);\

 (a) = ROTATE_LEFT ((a), (s)); \

 (a) += (b); \

 Â }

FF (a, b, c, d, data[0], 7, md5_const[0]); // this thing repeats 64 times with little changes

md5_const is in constant(looks like speed is the same if I leave that as constants in the code), data,a,b,c,d are in registers (i.e. local), each thread does 10’000 keys.

dim3 threads(128);//larger values does not increase speed at all

dim3 grid(128);

LOW LEVEL QUESTIONS:

  1. Where can I take a look at the list of all GeForce low level instructions supported? That is to be sure that nvcc uses all possibilities in optimization.

  2. Does anyone know how aggressive nvcc at optimization? I guess things like x=y+0 he does optimize. What else we can expect from nvcc?

  3. Can we expect to see in nvcc something similar to SSE2 intrinsics in C++ for x86?

BTW My x86 version does 18’300 MIPS per 1 core (C2D 3Ghz) (which is ~5 32-bit operations per clk). Not so bad for x86 SIMD :-)

You might want to take a look at decuda.

As for #2:

Nvidia picked Open64 as a base for their nvopencc to be responsible for high-level compilation and

global optimization of C-code into Parallel Thread Execution (PTX) format.

To quote a document from Nvidia about the compiler chain:

(Source: http://www.capsl.udel.edu/conferences/open…/Papers/101.doc )

The PTX code is pushed along to a low-level compiler ptxas (which in the document is refered to as OCG) which handles things such as register allocation, scheduling and peephole optimization.

In my experience, nvopencc is pretty aggressive and I have experienced code where it would suddenly remove important parts of the code (see my thread http://forums.nvidia.com/index.php?showtopic=67086&hl= ).

For the sort of optimization you ask for, it’s definitely the responsibility of the global optimization and I would recommend you to compile with “–opencc-options -LIST:source=on” to see what’s going on in the PTX source.

Nice, with --opencc-options i see that rotate left is done without cyclic rotate operation (i guess it is not implemented in hardware, right?):

shl.b32  %r153, %r152, 17; �  �  // 

shr.u32  %r154, %r152, 15; �  �  // 

or.b32  %r155, %r153, %r154; �  //

Not sure why it does 2 moves per value from global array (data is local array):

//  68 �  for(int i=0;i<4;i++)data[i]=data_d->sample_data[i];

	ld.global.s32 Â %r1, [%rd1+4]; Â // id:927

	mov.s32  %r2, %r1; �  �  �  �  �  �  // 

	ld.global.s32 Â %r3, [%rd1+8]; Â // id:928

	mov.s32  %r4, %r3; �  �  �  �  �  �  // 

	ld.global.s32 Â %r5, [%rd1+12];	// id:929

	mov.s32  %r6, %r5; �  �  �  �  �  �  // 

	ld.global.s32 Â %r7, [%rd1+16];	// id:930

	mov.s32  %r8, %r7; �  �  �  �  �  �  //

%r8 is not used in code at all, only %r7

Not sure why it load everything from the constants array to registers in the beginning of thread function (I guess that causes registers overuse, +64registers per thread, where only 1 set per MP is needed, or even 0 registers if we are using constants/immediate values). That was autogenerated (i.e. I was not copying constants to registers):

ld.const.u32 Â %r15, [md5_const+248];	// id:857 md5_const+0xf8

	ld.const.u32 Â %r16, [md5_const+244];	// id:858 md5_const+0xf4

	ld.const.u32 Â %r17, [md5_const+240];	// id:859 md5_const+0xf0

	ld.const.u32 Â %r18, [md5_const+236];	// id:860 md5_const+0xec

... 64 constants

I was thinking that constants are fast enough to use them directly in computations without reloading to registers. Probably immediate value could be better when there is lack of registers.

Everything else looks perfect.

Any ideas about causes for these 3 pieces of code?

Nice :-)
Code with immediate values instead of constants array works 1.2% faster + alot of saved registers. Not sure why we need constants in this case.

Funny thing, compiler love to do subtraction instead of addition for immediate values :-)

I.e. instead of add r,123 it does
sub r,(-123)

Negative values have a by one larger range with a fixed number of bits, e.g. if the instruction allows for a 8 bit signed constant,

add r, a

works for a up to 127, whereas

sub r, -a

works for numbers up to 128.

Obviously for negative constants, the compiler should be using add preferably.

That is crazy.

When I do for(int i=0;i<4;i++)data[i]=sample_data[i];

Where sample_data[i]; is constant, I have 167M keys/sec.

for(int i=0;i<4;i++)data[i]=123; gives 168M keys/sec.

When I do for(int i=0;i<4;i++)data[i]=data_d->sample_data[i];

Where data_d is device, I have 188M keys/sec.

Variant with constants produce MORE “garbage” registers, so occupancy decreases, looks like that is the reason of slowdown. When lowering number of threads/grid speed with constants increases back to 187M keys/sec. (the same for 123 constant)

I guess this shows that nvcc uses good open64 code optimization base, but there is no optimization after code generation. This leads to registers which are being written but never red. I was expecting a little more :-)

Check it out:

 ld.const.s32  %r1, [sample_data+0];	// id:805 sample_data

	mov.s32  %r2, %r1;            	// 

	ld.const.s32  %r3, [sample_data+4];	// id:806 sample_data

	mov.s32  %r4, %r3;            	// 

	ld.const.s32  %r5, [sample_data+8];	// id:807 sample_data

	mov.s32  %r6, %r5;            	// 

	ld.const.s32  %r7, [sample_data+12];	// id:801 sample_data

	mov.s32  %r8, %r7;            	// 

10 commands later, lets make some more copies

	mov.s32  %r16, %r6;            // 

	mov.s32  %r17, %r4;            // 

	mov.s32  %r18, %r2;            // 

50 commands later

	ld.const.s32 Â %r55, [sample_data+12];	// id:801 sample_data

	add.s32  %r56, %r54, %r55; �  �  //

r7, r8 are never used. r6 used only once (value copied to r16), e.t.c.

Probably I am missing some optimization settings or something…

In the NVCC manual (.doc) i’ve found this
“provide compilation/optimization options to Cg compiler or ptx optimizing assembler”. But I have no idea where to find these ptx optimizing options.

The only optimization options I know are -maxrregcount (which is not very helpful as it pushes things to global memory) and -opt-level (default level 4- what is the range and any info what happens on different levels?)