Multiproccesor occupancy for a flop intensive kernel .. how bad is 25% ?

I have a double precision kernel with approx 100 flops/thread…

Given the higher register pressure (63 per thread :no: ) and approx ~7kb shared memory (I can use more but it reduces my occupancy drastically further) and 128 threads per block, I can get 25% occupancy on the multi processor(calculated using nvidia spread sheet).

I can reduce the register usage by forcing the compiler with the --maxxregcount flag, but that would increase my lmem usage by large amounts… which is a big no no :thumbsdown: as already I am using quite a chunk of it.

I understand that we don’t require 100% occupancy to get full performance but is 25 % below normal given … such flop intensive kernel ?

MY previous experience with 20 % occupancy for a double precision NBody kernel achieved a very good perforamnce near – 65 -70 % theoretical peak… but that was a realtively simple kernel… hence am not sure if I can rely on that fact for this kernel.

Also as a general thumb of rule how much occupancy is good?

Thanks all…

N I T I N,

Make sure u have at least 192 active threads running… 256 is a better number.

The reason for having this is to avoid register-dependenceis in pipeline… But the DP pipeline is heavily contended by all cores in an MP and I am not sure how the 192 thread benefit would really be reaped by DP intensive kernels.

But 192/256 active threads would help u hide gmem latencies… hopefully.

128 threads per block would not be bad. I suppose that you are using 1.1 hardware. Simply upgrading to 1.2 or higher will help you reach 50% occupancy (16K registers per MP), but that will not be twice faster.

Note that 25% occupancy does not mean 25% of max speed.

There are some tips for squeezing the number of registers in use without degrading performance:

  • Reuse temporary variables

  • Don’t unroll loops (#pragma unroll 1)

Thanks guys for the fast replies…

@CVNGUYEN: Am using C1060 hence 1.3 compute capability. Double precision kernels use twice as many registers hence I have low occupancy. I tried that #pragma unroll 1 to see if something can be done. No cake :no: .

@ SARNATH: I have 256 ACTIVE threads / multiprocessor as I can run 2 threads blocks per multiprocessor so that’s good to see.

I also tried noinline but that didn’t help either.

Am also worried about the DP register pressure between the 8 cores… lets hope “256” active threads :unsure: do the trick.

Looks like I will have to do that -maxxregcount flag only and tweak the register count manually :thumbsdown:

Declaring some local variables “volatile” can sometimes reduce the effective register count of the PTX and the resulting cubin.

I’ve been able to increase occupancy for certain kernels just with this simple trick.

Sometimes its even beneficial to break a register bounded kernel into several smaller kernels. That way you can, for example,

write KernelA with fewer registers, write temporary data to gmem, launch KernelB (again with fewer registers), read the temporaries and

finialize the computation.

That way you might be able to balance registers/shared mem and bandwidth.

For me it helped a lot. Even if the first kernel wrote a lot of data to gmem and the second one read it.

eyal

Wow thanks :thumbup: … that helped.

My register usage went down to… 49 from 63. I made a local array which was

double fx[9]

to volatile…

But… sadly the multiprocessor occupancy is still the same as now its bounded by shared memory which is 6192 bytes for 128 threads per block. :(

Now I have 2 questions here

  1. how did the number of registers reduced to 49 from 63 … ? (is nvcc doing some fancy register reuse with volatile “ON” ?)

  2. Now is that even beneficial to do (using volatile) ? as my occupancy is the same then I should use 63 registers in that case? (register’s are faster I guess)

Thanks for the idea… I will surly try that along with this kernel to see which one is the best approach.

Your approach mite help me reduce my local memory pressure also… interesting, will think today more on this…

I guess, it will mainly depend on my kernel execution time vs kernel launch overhead and kernel resource usage.

I can only implement it in a couple days time as I will have access to c1060 machine tomm evening. Will keep this thing in loop until then :) .

Thanks…

NA

Using volatile on local variables that are placed in registers by the compiler has no negative effects on speed. However if this variable would be stored in local memory, it will slow you down.

The array double fx[9]; would go into local memory I think, unless you’re always accessing each element using indices KNOWN AT COMPILE TIME. Using volatile here would have a negative effect, resulting to more memory accesses than really needed.

Why does the volatile trick work? NVCC does often implicitly inline expressions

int x = threadIdx.x + 7;

for (...some loop...) {

	out[x] = (in[x] + in[x+1]) / 2;

}

in the above fictional example the int x may be inlined into the loop below, resulting in

for (...some loop...) {

	out[threadIdx.x+7] = (in[threadIdx.x+7] + in[threadIdx.x+7+1]) / 2;

}

So if you’re unlucky in the PTX it will compute threadIdx.x + 7 three times, burning several registers

to re-load the constant “7” and performing the addition. That adds 6 registers to the PTX.

The volatile statement will prevent this and makes the compiler compute threadIdx.x + 7 immediately

and place it in a register where it will be re-used three times in the loop.

NOTE: this also works for constants. like

volatile float const_one = 1.0f;

you will find that using const_one instead of specifying 1.0f during will reduce the register count in

the PTX if const_one is used twice or more. One example:

cos(x+1.0f) - sin(x-1.0f)  // uses more registers in PTX than

cos(x+const_one) - sin(x-const_one)

Yes, there will be an optimization phase later on where PTXAS tries to reduce the register count when generating the final cubin, but that works sub-optimally (it may be some kind of peep-hole optimizer, i.e. not looking at the full code). Hence the manual optimization with “volatile” often yield better results.

Hmm thanks… interesting information… it actually helped. I was able to reduce the register usage further by 4.

I have a question:

if my multiprocessor occupancy is bounded by shared memory. And further shared memory usage is bound by number of threads then decreasing register usage wont buy you anything rite? Cause if I use more threads/block then I use more shared memory hence the occupancy remains the same. (I checked this with nvidia spreadsheet)

…or is there something else which increases performance ( other advantages’s? ) due to less register usage ? Will it reduce load on the DP register pipeline in the multiprocessor?

Thanks

NA

Yes you can gain: the reduced instruction count in some tight loop can buy you performance, unless that loop’s performance is mostly bounded by memory latency and/or throughput (in which case the reduced number of instructions hardly matter)

Yeah makes sense… I do have couple of tight loops with 3 levels… so it should by me some performance.

Thanks for answering my questions… :)