Gap between measured perf. and peak

Hella_Yu · March 18, 2008, 12:07am

I’m working on motion estimation, which is composed of very regular computation of blocks of data, like a typical image processing algorithm.

The measured performance on GeForce 8800GTX is around 40 GFLOPS, 1/10 of the peak. I checked the assembly code, and found that a large part of the instructions is computing memory index, branch, and other stuff I don’t fully understand. Only 1/3 of the code is doing real floating point computation desired.

So I’m not sure if those index and other integer operations can be done in parallel with floating point calculations. Seems Intel CPU has a superscalar architecture that parallelize integer and floating point computation. What’s the case for GPU? If it’s not superscalar, is it correct to say non-floating point operations takes up roughly 2/3 of the total time, if they counts for 2/3 of the total # of instructions?

Thanks.

seibert · March 18, 2008, 1:50am

The stream processors in the GPU appear to be very simple, with an ALU that cannot multiplex float and integer instructions. This is how NVIDIA was able to pack 128 of them onto 1 chip.

If you have to do a lot of index calculation with integer multiplication, take a look at the __umul24() function, which does unsigned, 24-bit multiplication. This function is 4 times faster than standard 32-bit multiplication, and is sufficient for computing indices if your arrays have less than 16 million elements.

But before you fuss with that too much, you should use the CUDA profiler to make sure your memory accesses are coalesced. Counting instructions won’t tell you how much time your kernel spends sitting around, waiting for memory reads to finish. A sub-optimal memory access pattern is the main cause of poor CUDA performance, so you should rule that out first.

MisterAnderson42 · March 18, 2008, 3:42am

You could also already be hitting the peak memory throughput of 70 GiB/s. If you are memory limited, it doesn’t matter one bit if your kernel “wastes” ALU operations, since they are all hidden within the memory latency.

Hella_Yu · March 19, 2008, 11:25pm

Thanks for reply.

So what I’m working on is motion estimation , basically typical image processing which I expect good performance on GPU:)

I almost did not use global memory, most data stored in texutre and shared memory. Very few global memory data are accessed in coalesced way, I checked that from visual profiler.

So now it’s still far from peak, I guess the kind of ‘indexing’ and ‘polling’ instructions are taking too much cycles. I also use short for indice calculation.

any further ideas how to improve this? And I cannot find any reported achieved performance for block motion estimation, on G80. Such information is also welcomed.

Thanks

The stream processors in the GPU appear to be very simple, with an ALU that cannot multiplex float and integer instructions. This is how NVIDIA was able to pack 128 of them onto 1 chip.

If you have to do a lot of index calculation with integer multiplication, take a look at the __umul24() function, which does unsigned, 24-bit multiplication. This function is 4 times faster than standard 32-bit multiplication, and is sufficient for computing indices if your arrays have less than 16 million elements.

But before you fuss with that too much, you should use the CUDA profiler to make sure your memory accesses are coalesced. Counting instructions won’t tell you how much time your kernel spends sitting around, waiting for memory reads to finish. A sub-optimal memory access pattern is the main cause of poor CUDA performance, so you should rule that out first.

[snapback]344578[/snapback]

Hella_Yu · March 19, 2008, 11:27pm

It’s not memory limited, because most intermediate results are manipulated in shared memory. Only two images of small size are read from texture memory, for once or twice per pixel. Global memory is almost unused.

seibert · March 19, 2008, 11:37pm

I think any further advice would require seeing C and the corresponding PTX output.

seb · March 19, 2008, 11:40pm

Possible shared memory bank conflicts could be a thing to look into. Another thing that could reduce performance is non optimal occupancy caused by high register count/high shared memory usage. If you didn’t already you could trouble the occupancy calculator regarding that matter. The profiler will give you information about occupancy too.

@seibert: thanks for the advice about __umul24() - I was not aware of that and I think I can exploit this in my kernels.

Hella_Yu · March 20, 2008, 12:58am

Thanks for suggestion.

As far as I know, occupancy only matters when there is memory latency to hide, right?..

The occupancy is not high though(0.125), but I supposed memory lantency is not a big issue, due to using a lot of shared memory and texture memory. Maybe you are right, I’ll increase the occupancy see what happens…

MisterAnderson42 · March 20, 2008, 1:01am

News flash! Texture memory is device memory and is subject to the same 70 GiB/s throughput limit.

The only way you are going to know if you are reaching the limit is by counting the number of memory accesses and calculating an effective GiB/s.

Topic		Replies	Views
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1860	February 3, 2012
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2563	March 5, 2013
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7890	August 16, 2007
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2207	December 9, 2009
Speed improvement CUDA Programming and Performance	18	8267	December 5, 2008
Is my kernel too simple to get a speed increase from CUDA? CUDA Programming and Performance	18	3822	February 2, 2010
Putting the GPU at work CUDA Programming and Performance	21	20172	July 5, 2007
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Runtinme occupancy CUDA Programming and Performance	5	1850	January 9, 2009

Gap between measured perf. and peak

Related topics