First of all thanks to NVIDIA guys for CUDA 2.1

Now the issue:

I had a kernel which was using 22-23 registers without loop unrolling and around 24 registers with loop unrolling(unroll factor >=32) ( i got signficant increase in performance after that) — this was the situation with CUDA 2.0.

I installed CUDA 2.1 – without the loop unrolling my register usage was the same. But with the loop unrolling my register use shot up to 29 :mellow: , hence the block size which i was using previously (320 – multiple of 32) wouldn’t allow the kernel to run as the total number registers available are less ( 8192 on 8600GT ).

And if I do a smaller loop unrolling(unroll factor <32) - the kernel runs but performance is not that impressive.

Is this some kind of new register usage in CUDA 2.1 that we should know about ?


Nittin arora

Try the -maxrregcount option to nvcc to reduce register count to what you need. When doing that, make sure to keep track of lmem usage (if it goes up, you’re spilling registers to global memory).


Great… it works… but why… does the new CUDA 2.1 allocates more registers ?

Any reasons?


How about your lmem usage Nitin? Has it increased ?? We are also using unrolling to max out performance…

btw, Will this be fixed in the next official release?

Also my lmem usage did go up… I was not using it previously… now it shows 8 bytes. which i guess is not that much. and also I got a little performance increase as I can run the full block size of 512 threads now.

but “how much” lmem is bad?



“How much” does NOT matter. But how frequently you use them in your program-execution time matters…

The more freq u use (inside for loops etc…) – the more penalty you pay.

I would assume that the compiler would automatically choose to move the least-used local variable to lmem…

Yes… unfortunately, it increased… i was using 0 lmem previous now with max unrolling its using 8-16 bytes (depending on the size of block). But still I see an increase in performance, as i can execute 512 threads per block.

I think they (Nvidia) are just trying to max the use of registers, hence the complier is deploying more registers.



Sometimes you may want to manually change the ptx code to reduce register pressure, because the register allocation & instruction scheduling in both nvopencc and ptxas suck. For example, when your program accesses many different arrays, they tend to group the address calculation together instead of interleaving address calculation and load/store, which increases register pressure.

I am sorry I should not have jumped to the conclusion so quickly. Grouping address calculation together has it own advantage in that instruction sequence tends to be irrelavent;And interleaving address calculation and load/store introduces dependence between instructions. For the former case, the calculation of following array’s address can be issued before the last calculation finishes;but for the latter case, load/store has to wait for the address calculation to complete.

Am I right? Anyone wants to comment on this?

Latency due to register-register dependencies is covered if you run 192 or more threads per SM (this is somewhere in the Programming Guide, perhaps search for 192).

Grouping or batching several gmem or tex accesses can have a performance benefit due to another reason - multiple memory accesses can be in flight at the same time (the classic pipeline benefit), because a thread doesn’t block on a memory access, it will block on an instruction that depends on an outstanding memory access. One consequence is increased register use, since the registers of the pipelined accesses cannot be reused until they complete.


I found it on page 67 of the Programming Guide: “The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them.”(I did searched 192. Thanks. Otherwise it would not be easy to find.) So it’s actually 192/8=24 cycles, or 6 warps or 6 independent instructions if only one warp.

1.I think your pipeline is very very long, because the latency indicates the number of stages between RF and EXE, am I right?So, if I have an instruction sequence of “add r1, r2, r3;add r4, r2, r1” and there’s only one warp, that means I lost 20 cycles between the two instructions?

2.Multiple memory accesses in flight is important, thanks. Are instructions executed out-of-order or in-order?If out-of-order, and the address calculation does not take many instructions(one or two, for example), then interleaving can be more or less the same in latency, right? But that may require there be more phsical registers than logical registers?

You can search for pipeline, there has been a long topic on this. I believe it was around 20-24 cycles deep. Cannot remember exactly.

I did not find that post :">

Yeah, the forum search is not the best, I alsways search with google ;)…8&hl=6+warp…warps&st=40

Thanks a lot!