I have a double precision kernel with approx 100 flops/thread…
Given the higher register pressure (63 per thread External Image ) and approx ~7kb shared memory (I can use more but it reduces my occupancy drastically further) and 128 threads per block, I can get 25% occupancy on the multi processor(calculated using nvidia spread sheet).
I can reduce the register usage by forcing the compiler with the --maxxregcount flag, but that would increase my lmem usage by large amounts… which is a big no no External Image as already I am using quite a chunk of it.
I understand that we don’t require 100% occupancy to get full performance but is 25 % below normal given … such flop intensive kernel ?
MY previous experience with 20 % occupancy for a double precision NBody kernel achieved a very good perforamnce near – 65 -70 % theoretical peak… but that was a realtively simple kernel… hence am not sure if I can rely on that fact for this kernel.
Also as a general thumb of rule how much occupancy is good?
Make sure u have at least 192 active threads running… 256 is a better number.
The reason for having this is to avoid register-dependenceis in pipeline… But the DP pipeline is heavily contended by all cores in an MP and I am not sure how the 192 thread benefit would really be reaped by DP intensive kernels.
But 192/256 active threads would help u hide gmem latencies… hopefully.
128 threads per block would not be bad. I suppose that you are using 1.1 hardware. Simply upgrading to 1.2 or higher will help you reach 50% occupancy (16K registers per MP), but that will not be twice faster.
Note that 25% occupancy does not mean 25% of max speed.
There are some tips for squeezing the number of registers in use without degrading performance:
@CVNGUYEN: Am using C1060 hence 1.3 compute capability. Double precision kernels use twice as many registers hence I have low occupancy. I tried that #pragma unroll 1 to see if something can be done. No cake External Image .
@ SARNATH: I have 256 ACTIVE threads / multiprocessor as I can run 2 threads blocks per multiprocessor so that’s good to see.
I also tried noinline but that didn’t help either.
Am also worried about the DP register pressure between the 8 cores… lets hope “256” active threads :unsure: do the trick.
Looks like I will have to do that -maxxregcount flag only and tweak the register count manually External Image …
My register usage went down to… 49 from 63. I made a local array which was
double fx[9]
to volatile…
But… sadly the multiprocessor occupancy is still the same as now its bounded by shared memory which is 6192 bytes for 128 threads per block. :(
Now I have 2 questions here
how did the number of registers reduced to 49 from 63 … ? (is nvcc doing some fancy register reuse with volatile “ON” ?)
Now is that even beneficial to do (using volatile) ? as my occupancy is the same then I should use 63 registers in that case? (register’s are faster I guess)
Thanks for the idea… I will surly try that along with this kernel to see which one is the best approach.
Your approach mite help me reduce my local memory pressure also… interesting, will think today more on this…
I guess, it will mainly depend on my kernel execution time vs kernel launch overhead and kernel resource usage.
I can only implement it in a couple days time as I will have access to c1060 machine tomm evening. Will keep this thing in loop until then :) .
Using volatile on local variables that are placed in registers by the compiler has no negative effects on speed. However if this variable would be stored in local memory, it will slow you down.
The array double fx[9]; would go into local memory I think, unless you’re always accessing each element using indices KNOWN AT COMPILE TIME. Using volatile here would have a negative effect, resulting to more memory accesses than really needed.
Why does the volatile trick work? NVCC does often implicitly inline expressions
int x = threadIdx.x + 7;
for (...some loop...) {
out[x] = (in[x] + in[x+1]) / 2;
}
in the above fictional example the int x may be inlined into the loop below, resulting in
So if you’re unlucky in the PTX it will compute threadIdx.x + 7 three times, burning several registers
to re-load the constant “7” and performing the addition. That adds 6 registers to the PTX.
The volatile statement will prevent this and makes the compiler compute threadIdx.x + 7 immediately
and place it in a register where it will be re-used three times in the loop.
NOTE: this also works for constants. like
volatile float const_one = 1.0f;
you will find that using const_one instead of specifying 1.0f during will reduce the register count in
the PTX if const_one is used twice or more. One example:
cos(x+1.0f) - sin(x-1.0f) // uses more registers in PTX than
cos(x+const_one) - sin(x-const_one)
Yes, there will be an optimization phase later on where PTXAS tries to reduce the register count when generating the final cubin, but that works sub-optimally (it may be some kind of peep-hole optimizer, i.e. not looking at the full code). Hence the manual optimization with “volatile” often yield better results.
Hmm thanks… interesting information… it actually helped. I was able to reduce the register usage further by 4.
I have a question:
if my multiprocessor occupancy is bounded by shared memory. And further shared memory usage is bound by number of threads then decreasing register usage wont buy you anything rite? Cause if I use more threads/block then I use more shared memory hence the occupancy remains the same. (I checked this with nvidia spreadsheet)
…or is there something else which increases performance ( other advantages’s? ) due to less register usage ? Will it reduce load on the DP register pipeline in the multiprocessor?
Yes you can gain: the reduced instruction count in some tight loop can buy you performance, unless that loop’s performance is mostly bounded by memory latency and/or throughput (in which case the reduced number of instructions hardly matter)