too large kernel solutions

santyhyammer · August 31, 2008, 6:31pm

I use compute capability 1.0. My kernel uses 29 registers. I cannot optimize it more, it’s impossible(just assume it for now). With 29 registers the best occupancy I can get is 64 threads per block ( a 33% only, 4 simultaneous blocks ).

How can I get the GPU to be more efficient then? Some possible solutions:

Launching several async kernels… but I think the kernels currently are executed sequentially and blocking.
Create 2 or 3 threads for each GPU instead of one. Sometimes goes much faster… but other times the threads are waiting/locking and the performance is worse…
Force a -maxrregcount=XXX, where XXX is less than 29 registers… for example, 16… but the performance suffers a lot.
Try to split the kernel in various parts. Well… for my case, unfortunately, I cannot.

Any other solutions, pls?
thx

StickGuy · August 31, 2008, 7:04pm

I’m a bit confused. With compute capability 1.0, each multiprocessor has 8192 registers. If your kernel uses 29, you should be able to launch 256 threads in a block.

MisterAnderson42 · August 31, 2008, 7:53pm

More occupancy doesn’t always mean more performance.

E.D_Riedijk · August 31, 2008, 8:42pm

4 blocks * 64 threads = 256 threads per MP ;)

256 threads per MP should be enough to hide latency as far as I remember, so getting things faster has to be done by thinking of another algorithm / coalescing your reads & writes. With 256 threads per MP, you should be able to achieve memory-bandwith.

santyhyammer · August 31, 2008, 8:43pm

Yep, but with 64 threads per block 4 blocks can active at the same time… so the kernel can run faster if you use a lot of __syncthreads()… and also depends on the shared memory restrictions. With 256 threads the used shared memory can be too much, depending on the kernel.

But the real problem is the poor 33% occupancy… and I really don’t know how to improve it without reducing the kernel complexity <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> The kernel perform ray tracing and it’s quite complex ( ray-tri, tree traversal, the next iteration depends on the hit, tons of branches, a stack… it’s very complex and cannot be reduced much more… 29 registers is very good for all the things it does ).

E.D_Riedijk · August 31, 2008, 8:51pm

I can imagine the thing about syncthreads(), but then it is probably faster to switch to blocks of 32 threads. Should give the same occupancy (an MP can hold up to 8 blocks) and remove the need for syncthreads()

Also I cannot see how you would need less shared memory per MP. Let’s say you need 100 bytes for 64 threads. Than you would need 400 bytes for 256 threads normally. And your 4 blocks would also need 400 bytes.

Actually you use less shared memory for a block of 256 threads, since the kernel arguments are also in shared memory, as are blockIdx, blockDim and gridDim, and as far as I know (hmm, is there a way to check this?), each block get’s its own copy (which is a bit of overkill really). But at least blockIdx is different for each block.

E.D_Riedijk · August 31, 2008, 9:00pm

My raytracing kernel uses like 70 registers, so you can be very happy (it’s not like normal raytracing though and I use kd-tree, but still you can be proud). But what you need to realize is: 33% occupancy is OK, really, it is often enough to hide the latency.

Occupancy tells you nothing about performance, it only tells you how much chance the hardware has to hide memory-latency for you. And if you have enough calculations per memory-access (don’t have numbers sorry), 33% should be sufficient.

Did you run your kernel through the profiler? How many divergent branches do you get? It might be that that is what is actually killing your performance.

Did you calculate how many GB/s your kernel is doing? It also might be that you are just memory-bandwidth bound.

santyhyammer · August 31, 2008, 10:35pm

I wish… but cannot. I got a -96 profiler error. I’m packing coherent rays so the divergence should not be a problem ( there aren’t much incoherent rays, all is pretty coherent by the moment ).

Btw… if I create two threads for each CUDA device then I got a 100% speed increment sometimes… but other times the speed drops… it’s very strange. I also tried to fire two two async kernels… but I think they are executed sequentially and blocking ).

alex_dubinsky · September 1, 2008, 8:14pm

The GPU can only execute one kernel at once, no matter if you launch it asynch or from different threads. Your unpredictable 2x increase is probably from a mistake in timing.

256 simultaneous threads is fine. (256 is the magic threshold, and it’s only relevant when you’re doing many global mem reads.) If you get a GTX 260 you can run 512 threads if that’s what worries you.

In your list of solutions you must add, “optimize kernel’s memory accesses by making them coallesced or moving them into shared/constant/textured memory.” That’s always the ticket with CUDA.

santyhyammer · September 1, 2008, 9:46pm

Yep, yep, I agree… but I was focusing more on this thought:

“my kernel cannot be optimized more… what can I do then”.

Sarnath · September 2, 2008, 10:22am

To whoever thinking about 32 thread blocks:

If you check your CUDA occupany calculator, the number of registers per block is calculated as 64*

So, it is as good as having 64 threads per block – as far as register usage is concerned!

But 32 does have some advantage – no __syncthreads; no double buffering and so on. If you use it with care and if it suits your computation – you will get very good performance.

E.D_Riedijk · September 2, 2008, 7:39pm

Do you get the same occupancy reported in the profiler btw? I still have no clue whether this is a bug in the occupancy calculator (although it does not look like a bug, because it is a more difficult computation this way)

Topic		Replies	Views
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5917	July 25, 2007
better performance from underpopulated warps CUDA Programming and Performance	6	2459	June 28, 2008
Few performance questions occupancy,active threads,cta_launch CUDA Programming and Performance	4	4759	January 30, 2009
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	4986	June 30, 2008
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4469	February 3, 2009
What would happen with my program on Kepler with 1024 threads, 8 blocks, 32 reg max? CUDA Programming and Performance	16	4309	June 13, 2012
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2213	December 9, 2009
how to reduce the number of registers CUDA Programming and Performance	5	8922	July 8, 2010
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1499	January 7, 2022
Question regarding warp efficiency... CUDA Programming and Performance	9	15120	March 13, 2007

too large kernel solutions

Related topics