Maximum optimization settings

Can you suggest what are the options to turn on maximum optimization options?

As I already mentioned at I set PTXAS_FLAGS += --opt-level=1009 (yeah I tried different numbers), and I still see very stupid .ptx code (too many redundant registers, code with global memory reads is faster than code with constant)

There should be my mistake somewhere, I can’t believe nvcc is that bad :-)

ptx is not re-using registers yet, they use a new register everytime. In the conversion of ptx to cubin the register-optimization is performed.

Thanks alot. In decude-ed cubin file I see that redundant registers are removed.

It is still not clear how it can be possible that code with 1 global read could be faster in comparison to 1 constant read or 1 immediate value (speed is the same when I reducing threads/grid).

Which code are you exactly talking about? If you can show the two codes next to eachother, maybe someone can shed a light?

#define F(x, y, z) (((x) & (y)) | ((~x) & (z)))

#define FF2(a, b, c, d, x, s, ac, ic,id) { DO(F,a,b,c,d,x,s,ac/*md5_const[ic]*/);}

�   const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	int data[4];

	for(int i=0;i<4;i++)data[i]=sample_data[i];//

	int a,b,c,d;

	const int len = code_len[0];


	for(int loop=0;loop<20000;loop++){


	FF2 (a, b, c, d, data[0], S11, 0xd76aa478,0,0); /* 1, 63 that lines more */

 �  �  �  }

in this case code uses sample_data&code_len[0] which are constant and can run at full speed at grid=64, thread 128

COde with data_d->sample_data[i] can run with fullspeed at grid=128 thread 128. data_d is device

COde with data[i]=123; also runs as slow as with constant.

Execution results are the same the only difference is speed

Here are benchmark results, numbers a millions of hashes/sec checked:

grid=128, thread=128

global: 188.317368

constant: 167.413727

immediate value: 168.691376

grid=64, thread=128

global: 188.188232

constant: 188.084717

immediate value: 189.572815

moving this “const int len = code_len[0];” to constants does not affect performance.

Well, apart from the fact that I completely cannot understand what you mean from this code snippet there are a few things that stand out:

len is defined, but I do not see it used. But if you read len from global memory, and afterwards use it in a for loop of 2000 big, the difference might be:

  • len read from global memory will be in a register afterwards. So that might be enough faster to hide the extra time spent fetching from global memory.

  • you have a different grid size, that may change things. A grid of 64 blocks is not that big, depends on the amount of blocks you can run at the same time for your code. 16 MP * 6 blocks is the maximum amount of blocks when you have 128 threads per block, so if you are not using too many registers you need 96 blocks to fill all MP’s only once.

As a side note, check if your kernel is using local memory. If so, try to put #pragma unroll 4 before the for loop. If that does not help, do something like:

int data0 = sample_data[0];

int data1 = sample_data[1];

int data2 = sample_data[2];

int data3 = sample_data[3];

Actually I don’t see why you don’t use sample_data directly.

Thanks for the reply.

I am going to post bigger piece of code to make it clearer.

Yes, in ptx code len is stored in register, as well as data.

sample_data->data loop is unrolled automatically, in ptx code there are 4 assignments for all 3 variants.

Right now it is possible to use sample_data directly, but this is going to be different for all threads soon.

So, looks like code with immediate value & constants consume more registers, that’s why I have to reduce number of threads/grid. It is unclear how it is possible.

There is something strange in the cuda compiler…

I am using shared memory instead of registers and a simple kernel uses 38 registers…
Changing the shared memory variables by direct float/uint registers the kernel is reduced to 24 registers…

On the other hand to use const float4& and const float seems to increase the regcount vs direct float4/float (no const, no ref) … very strange.

And btw… if you specify the -Xptas -v,-keep,-O2 you can see in the .pxt that the compiler uses the O3 optimization… not the O2 like I set… I think it currently just ignores the optimization level and assumes O3 always…