Where's my bottleneck

Tigga · August 29, 2008, 4:39pm

I’ve been trying to optimise a kernel - but I’m not sure which direction to take at the moment as I’m not sure what’s bottlenecking it. I wonder if anybody here can help.

The kernel currently runs only 875 threads, split into 14 blocks containing 64 threads each. I am aware that this is rather shockingly low, but the data must be on the GPU for other operations, and copying the memory takes much longer than executing the kernel… so it’s unavoidable. Occupancy is at 0.5.

Everything is double precision, running on a GTX 260.

Some observations:

*Increasing/decreasing the amount of data to be processed causes very little change in runtime.
This suggests to me that the kernel isn’t bandwidth bound, nor is it actually filling the GPU. Both these observations are backed up by calculation - bandwidth should be ~1GB/s and 14 blocks of 64 threads with occpancy 0.5 isn’t going to fill the GPU.

*The kernel only does a few arithmetic operations (one MAD per coaleseced load, one subraction per uncoalesced store).
From what I’ve read this means it is ulikely to be compute bound.

*There are ~50,000 instructions per call. 50,000,000 per second. (According to the visual profiler)
I have kernels with higher instruction issue rates, but this is up there near the top (slightly below the SDK reduction kernel). I have tried reducing this with unrolling with no luck so far. This is my most likely suspect so far. Does anybody know anything about this? How would I go about reducing it?

*High level of branching.
I’ve written it quite generically, though templating it has reduced the branching somewhat. From what I understand this is not a problem.

*Very low divergant branch rate. It’s non-zero, but tiny.
Almost certain it’s not a problem.

One other thought is that perhaps it’s memory latency that’s killing me. Section 5.2 recommends twice as many blocks as multiprocessors (I have 14 blocks, and 24 MPs) to hide device memory latency (and thread sync, but I don’t have any of that). If this is the problem, would the be any way around it? 5.1.2.6 also recomments at least 192 threads per MP, something which I’m also woefully lacking in.

Any thoughts?

tmurray · August 29, 2008, 5:33pm

you’ve already identified your problem correctly–the number of threads. with 14 blocks, you’re not coming close to filling the card, and with 64 threads, you have two warps per SM only. as a result, one when warp stalls on a memory load, the other will switch in, and you’re not going to have a few hundred cycles’ worth of work to hide memory latency effectively every time.

Using double precision makes this easier to avoid, but I still don’t think you will have nearly enough threads to hide latency.

Topic		Replies	Views
Best performance with strange settings CUDA Programming and Performance	4	3473	May 20, 2009
too large kernel solutions CUDA Programming and Performance	11	4280	September 2, 2008
Finding performance bottlenecks CUDA Programming and Performance	0	4772	May 28, 2007
better performance from underpopulated warps CUDA Programming and Performance	6	2439	June 28, 2008
About latency bounded kernel CUDA Programming and Performance	1	706	October 29, 2014
Max blocks per SM less than expected CUDA Programming and Performance	5	1330	May 16, 2017
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7009	January 30, 2008
Hide latency CUDA Programming and Performance	3	487	June 9, 2023
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4489	October 24, 2008
Maximum number of queued kernels CUDA Programming and Performance	21	7736	September 3, 2008

Where's my bottleneck

Related topics