Gap between measured perf. and peak

I’m working on motion estimation, which is composed of very regular computation of blocks of data, like a typical image processing algorithm.

The measured performance on GeForce 8800GTX is around 40 GFLOPS, 1/10 of the peak. I checked the assembly code, and found that a large part of the instructions is computing memory index, branch, and other stuff I don’t fully understand. Only 1/3 of the code is doing real floating point computation desired.

So I’m not sure if those index and other integer operations can be done in parallel with floating point calculations. Seems Intel CPU has a superscalar architecture that parallelize integer and floating point computation. What’s the case for GPU? If it’s not superscalar, is it correct to say non-floating point operations takes up roughly 2/3 of the total time, if they counts for 2/3 of the total # of instructions?


The stream processors in the GPU appear to be very simple, with an ALU that cannot multiplex float and integer instructions. This is how NVIDIA was able to pack 128 of them onto 1 chip.

If you have to do a lot of index calculation with integer multiplication, take a look at the __umul24() function, which does unsigned, 24-bit multiplication. This function is 4 times faster than standard 32-bit multiplication, and is sufficient for computing indices if your arrays have less than 16 million elements.

But before you fuss with that too much, you should use the CUDA profiler to make sure your memory accesses are coalesced. Counting instructions won’t tell you how much time your kernel spends sitting around, waiting for memory reads to finish. A sub-optimal memory access pattern is the main cause of poor CUDA performance, so you should rule that out first.

You could also already be hitting the peak memory throughput of 70 GiB/s. If you are memory limited, it doesn’t matter one bit if your kernel “wastes” ALU operations, since they are all hidden within the memory latency.

Thanks for reply.

So what I’m working on is motion estimation , basically typical image processing which I expect good performance on GPU:)

I almost did not use global memory, most data stored in texutre and shared memory. Very few global memory data are accessed in coalesced way, I checked that from visual profiler.

So now it’s still far from peak, I guess the kind of ‘indexing’ and ‘polling’ instructions are taking too much cycles. I also use short for indice calculation.

any further ideas how to improve this? And I cannot find any reported achieved performance for block motion estimation, on G80. Such information is also welcomed.


It’s not memory limited, because most intermediate results are manipulated in shared memory. Only two images of small size are read from texture memory, for once or twice per pixel. Global memory is almost unused.

I think any further advice would require seeing C and the corresponding PTX output.

Possible shared memory bank conflicts could be a thing to look into. Another thing that could reduce performance is non optimal occupancy caused by high register count/high shared memory usage. If you didn’t already you could trouble the occupancy calculator regarding that matter. The profiler will give you information about occupancy too.

@seibert: thanks for the advice about __umul24() - I was not aware of that and I think I can exploit this in my kernels.

Thanks for suggestion.

As far as I know, occupancy only matters when there is memory latency to hide, right?..

The occupancy is not high though(0.125), but I supposed memory lantency is not a big issue, due to using a lot of shared memory and texture memory. Maybe you are right, I’ll increase the occupancy see what happens…

News flash! Texture memory is device memory and is subject to the same 70 GiB/s throughput limit.

The only way you are going to know if you are reaching the limit is by counting the number of memory accesses and calculating an effective GiB/s.