too large kernel solutions

I use compute capability 1.0. My kernel uses 29 registers. I cannot optimize it more, it’s impossible(just assume it for now). With 29 registers the best occupancy I can get is 64 threads per block ( a 33% only, 4 simultaneous blocks ).

How can I get the GPU to be more efficient then? Some possible solutions:

  1. Launching several async kernels… but I think the kernels currently are executed sequentially and blocking.

  2. Create 2 or 3 threads for each GPU instead of one. Sometimes goes much faster… but other times the threads are waiting/locking and the performance is worse…

  3. Force a -maxrregcount=XXX, where XXX is less than 29 registers… for example, 16… but the performance suffers a lot.

  4. Try to split the kernel in various parts. Well… for my case, unfortunately, I cannot.

Any other solutions, pls?

I’m a bit confused. With compute capability 1.0, each multiprocessor has 8192 registers. If your kernel uses 29, you should be able to launch 256 threads in a block.

More occupancy doesn’t always mean more performance.

4 blocks * 64 threads = 256 threads per MP ;)

256 threads per MP should be enough to hide latency as far as I remember, so getting things faster has to be done by thinking of another algorithm / coalescing your reads & writes. With 256 threads per MP, you should be able to achieve memory-bandwith.

Yep, but with 64 threads per block 4 blocks can active at the same time… so the kernel can run faster if you use a lot of __syncthreads()… and also depends on the shared memory restrictions. With 256 threads the used shared memory can be too much, depending on the kernel.

But the real problem is the poor 33% occupancy… and I really don’t know how to improve it without reducing the kernel complexity <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> The kernel perform ray tracing and it’s quite complex ( ray-tri, tree traversal, the next iteration depends on the hit, tons of branches, a stack… it’s very complex and cannot be reduced much more… 29 registers is very good for all the things it does ).

I can imagine the thing about syncthreads(), but then it is probably faster to switch to blocks of 32 threads. Should give the same occupancy (an MP can hold up to 8 blocks) and remove the need for syncthreads()

Also I cannot see how you would need less shared memory per MP. Let’s say you need 100 bytes for 64 threads. Than you would need 400 bytes for 256 threads normally. And your 4 blocks would also need 400 bytes.

Actually you use less shared memory for a block of 256 threads, since the kernel arguments are also in shared memory, as are blockIdx, blockDim and gridDim, and as far as I know (hmm, is there a way to check this?), each block get’s its own copy (which is a bit of overkill really). But at least blockIdx is different for each block.

My raytracing kernel uses like 70 registers, so you can be very happy (it’s not like normal raytracing though and I use kd-tree, but still you can be proud). But what you need to realize is: 33% occupancy is OK, really, it is often enough to hide the latency.

Occupancy tells you nothing about performance, it only tells you how much chance the hardware has to hide memory-latency for you. And if you have enough calculations per memory-access (don’t have numbers sorry), 33% should be sufficient.

Did you run your kernel through the profiler? How many divergent branches do you get? It might be that that is what is actually killing your performance.

Did you calculate how many GB/s your kernel is doing? It also might be that you are just memory-bandwidth bound.

I wish… but cannot. I got a -96 profiler error. I’m packing coherent rays so the divergence should not be a problem ( there aren’t much incoherent rays, all is pretty coherent by the moment ).

Btw… if I create two threads for each CUDA device then I got a 100% speed increment sometimes… but other times the speed drops… it’s very strange. I also tried to fire two two async kernels… but I think they are executed sequentially and blocking ).

The GPU can only execute one kernel at once, no matter if you launch it asynch or from different threads. Your unpredictable 2x increase is probably from a mistake in timing.

256 simultaneous threads is fine. (256 is the magic threshold, and it’s only relevant when you’re doing many global mem reads.) If you get a GTX 260 you can run 512 threads if that’s what worries you.

In your list of solutions you must add, “optimize kernel’s memory accesses by making them coallesced or moving them into shared/constant/textured memory.” That’s always the ticket with CUDA.

Yep, yep, I agree… but I was focusing more on this thought:

“my kernel cannot be optimized more… what can I do then”.

To whoever thinking about 32 thread blocks:

If you check your CUDA occupany calculator, the number of registers per block is calculated as 64*

So, it is as good as having 64 threads per block – as far as register usage is concerned!

But 32 does have some advantage – no __syncthreads; no double buffering and so on. If you use it with care and if it suits your computation – you will get very good performance.

Do you get the same occupancy reported in the profiler btw? I still have no clue whether this is a bug in the occupancy calculator (although it does not look like a bug, because it is a more difficult computation this way)