Where's my bottleneck

I’ve been trying to optimise a kernel - but I’m not sure which direction to take at the moment as I’m not sure what’s bottlenecking it. I wonder if anybody here can help.

The kernel currently runs only 875 threads, split into 14 blocks containing 64 threads each. I am aware that this is rather shockingly low, but the data must be on the GPU for other operations, and copying the memory takes much longer than executing the kernel… so it’s unavoidable. Occupancy is at 0.5.

Everything is double precision, running on a GTX 260.

Some observations:

*Increasing/decreasing the amount of data to be processed causes very little change in runtime.
This suggests to me that the kernel isn’t bandwidth bound, nor is it actually filling the GPU. Both these observations are backed up by calculation - bandwidth should be ~1GB/s and 14 blocks of 64 threads with occpancy 0.5 isn’t going to fill the GPU.

*The kernel only does a few arithmetic operations (one MAD per coaleseced load, one subraction per uncoalesced store).
From what I’ve read this means it is ulikely to be compute bound.

*There are ~50,000 instructions per call. 50,000,000 per second. (According to the visual profiler)
I have kernels with higher instruction issue rates, but this is up there near the top (slightly below the SDK reduction kernel). I have tried reducing this with unrolling with no luck so far. This is my most likely suspect so far. Does anybody know anything about this? How would I go about reducing it?

*High level of branching.
I’ve written it quite generically, though templating it has reduced the branching somewhat. From what I understand this is not a problem.

*Very low divergant branch rate. It’s non-zero, but tiny.
Almost certain it’s not a problem.

One other thought is that perhaps it’s memory latency that’s killing me. Section 5.2 recommends twice as many blocks as multiprocessors (I have 14 blocks, and 24 MPs) to hide device memory latency (and thread sync, but I don’t have any of that). If this is the problem, would the be any way around it? also recomments at least 192 threads per MP, something which I’m also woefully lacking in.

Any thoughts?

you’ve already identified your problem correctly–the number of threads. with 14 blocks, you’re not coming close to filling the card, and with 64 threads, you have two warps per SM only. as a result, one when warp stalls on a memory load, the other will switch in, and you’re not going to have a few hundred cycles’ worth of work to hide memory latency effectively every time.

Using double precision makes this easier to avoid, but I still don’t think you will have nearly enough threads to hide latency.