CUDA BETA 2.1 REGISTER PROBLEM Help Please !!

nitin.life · November 22, 2008, 12:39am

First of all thanks to NVIDIA guys for CUDA 2.1

Now the issue:

I had a kernel which was using 22-23 registers without loop unrolling and around 24 registers with loop unrolling(unroll factor >=32) ( i got signficant increase in performance after that) — this was the situation with CUDA 2.0.

I installed CUDA 2.1 – without the loop unrolling my register usage was the same. But with the loop unrolling my register use shot up to 29 :mellow: , hence the block size which i was using previously (320 – multiple of 32) wouldn’t allow the kernel to run as the total number registers available are less ( 8192 on 8600GT ).

And if I do a smaller loop unrolling(unroll factor <32) - the kernel runs but performance is not that impressive.

Is this some kind of new register usage in CUDA 2.1 that we should know about ?

Thanks.

Nittin arora

paulius · November 22, 2008, 3:38am

Try the -maxrregcount option to nvcc to reduce register count to what you need. When doing that, make sure to keep track of lmem usage (if it goes up, you’re spilling registers to global memory).

Paulius

nitin.life · November 22, 2008, 5:38am

Great… it works… but why… does the new CUDA 2.1 allocates more registers ?

Any reasons?

Thanks

Sarnath · November 22, 2008, 5:40am

How about your lmem usage Nitin? Has it increased ?? We are also using unrolling to max out performance…

btw, Will this be fixed in the next official release?

nitin.life · November 22, 2008, 5:40am

Also my lmem usage did go up… I was not using it previously… now it shows 8 bytes. which i guess is not that much. and also I got a little performance increase as I can run the full block size of 512 threads now.

but “how much” lmem is bad?

Thanks…

Nittin

Sarnath · November 22, 2008, 5:42am

“How much” does NOT matter. But how frequently you use them in your program-execution time matters…

The more freq u use (inside for loops etc…) – the more penalty you pay.

I would assume that the compiler would automatically choose to move the least-used local variable to lmem…

nitin.life · November 22, 2008, 6:42am

Yes… unfortunately, it increased… i was using 0 lmem previous now with max unrolling its using 8-16 bytes (depending on the size of block). But still I see an increase in performance, as i can execute 512 threads per block.

I think they (Nvidia) are just trying to max the use of registers, hence the complier is deploying more registers.

Thanks,

Nittin

sheepy13 · December 4, 2008, 3:41am

Sometimes you may want to manually change the ptx code to reduce register pressure, because the register allocation & instruction scheduling in both nvopencc and ptxas suck. For example, when your program accesses many different arrays, they tend to group the address calculation together instead of interleaving address calculation and load/store, which increases register pressure.

sheepy13 · December 4, 2008, 9:55am

I am sorry I should not have jumped to the conclusion so quickly. Grouping address calculation together has it own advantage in that instruction sequence tends to be irrelavent;And interleaving address calculation and load/store introduces dependence between instructions. For the former case, the calculation of following array’s address can be issued before the last calculation finishes;but for the latter case, load/store has to wait for the address calculation to complete.

Am I right? Anyone wants to comment on this?

paulius · December 8, 2008, 9:01pm

Latency due to register-register dependencies is covered if you run 192 or more threads per SM (this is somewhere in the Programming Guide, perhaps search for 192).

Grouping or batching several gmem or tex accesses can have a performance benefit due to another reason - multiple memory accesses can be in flight at the same time (the classic pipeline benefit), because a thread doesn’t block on a memory access, it will block on an instruction that depends on an outstanding memory access. One consequence is increased register use, since the registers of the pipelined accesses cannot be reused until they complete.

Paulius

sheepy13 · December 9, 2008, 6:48am

I found it on page 67 of the Programming Guide: “The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them.”(I did searched 192. Thanks. Otherwise it would not be easy to find.) So it’s actually 192/8=24 cycles, or 6 warps or 6 independent instructions if only one warp.

1.I think your pipeline is very very long, because the latency indicates the number of stages between RF and EXE, am I right?So, if I have an instruction sequence of “add r1, r2, r3;add r4, r2, r1” and there’s only one warp, that means I lost 20 cycles between the two instructions?

2.Multiple memory accesses in flight is important, thanks. Are instructions executed out-of-order or in-order?If out-of-order, and the address calculation does not take many instructions(one or two, for example), then interleaving can be more or less the same in latency, right? But that may require there be more phsical registers than logical registers?

E.D_Riedijk · December 9, 2008, 8:33am

You can search for pipeline, there has been a long topic on this. I believe it was around 20-24 cycles deep. Cannot remember exactly.

sheepy13 · December 11, 2008, 9:45am

I did not find that post :">

E.D_Riedijk · December 11, 2008, 10:35am

Yeah, the forum search is not the best, I alsways search with google ;)

http://www.anandtech.com/video/showdoc.aspx?i=3336

http://forums.nvidia.com/index.php?showtop…8&hl=6+warp

http://forums.nvidia.com/index.php?showtop…warps&st=40

sheepy13 · December 11, 2008, 7:38pm

Thanks a lot!

Topic		Replies	Views
register count explodes with CUDA 1.1 CUDA Programming and Performance	2	7355	December 12, 2007
Register usage problem CUDA Programming and Performance	7	3013	March 27, 2009
Bug in register usage, CUDA 2.1 CUDA Programming and Performance	0	3364	April 29, 2009
Ways to reduce number of registers CUDA Programming and Performance	14	20081	January 16, 2008
Puzzling register usage by nvcc nvcc appears to not use a freely available register CUDA Programming and Performance	4	1130	March 10, 2011
too many registers issue with memory writes and registers CUDA Programming and Performance	7	2057	July 13, 2011
Too high register usage for a simple problem CUDA Programming and Performance	1	710	May 4, 2014
Why is this loop using so many registers? CUDA Programming and Performance	7	1214	March 3, 2023
How to force variables to be on a register, local memory or shared memory? CUDA Programming and Performance	6	7363	March 21, 2008
Regiser usage higher when using constant in loop CUDA Programming and Performance	5	4673	October 24, 2010

CUDA BETA 2.1 REGISTER PROBLEM Help Please !!

Related topics