I use compute capability 1.0. My kernel uses 29 registers. I cannot optimize it more, it’s impossible(just assume it for now). With 29 registers the best occupancy I can get is 64 threads per block ( a 33% only, 4 simultaneous blocks ).
How can I get the GPU to be more efficient then? Some possible solutions:
Launching several async kernels… but I think the kernels currently are executed sequentially and blocking.
Create 2 or 3 threads for each GPU instead of one. Sometimes goes much faster… but other times the threads are waiting/locking and the performance is worse…
Force a -maxrregcount=XXX, where XXX is less than 29 registers… for example, 16… but the performance suffers a lot.
Try to split the kernel in various parts. Well… for my case, unfortunately, I cannot.
Any other solutions, pls?