A maximum performance of 823 GFlops meseared for GTX 295 with mad+muls

Hi all, I’m interested in finding out the reachable maximum performance of GT200 GPUs so I wrote a small program to count for it (the source code is listed as follows), and the software ‘decuda’ was used to avoid compiler’s thicks.

My GPU used in the test is a GTX 295 and a maximum performance of 823 Gflops is obtained, while the well-known theoretical peak performance is 1.2422403=894 GFlops. So the absolute efficiency of the test is about 92%.

I want to know if it is possible to revise this program to get higer performances. Any suggestion is appreciated.

Update: a better performance was obtained on 10# (843.2GFlops on GTX 295, 94.3% to the peak)

#include "cutil_inline.h"

#define M	240

#define N	64

#define K	2000000 

#define II	10

#define NF	3

#define TYPE float

TYPE res[M*N], *dres;

__global__ void test1(TYPE *res) {

	int inx=blockDim.x*blockIdx.x+threadIdx.x;

	TYPE d=inx*0.1f, s=0.f;

#pragma unroll 1000

	for(int i=0; i<K; i++) {






int main() {

	unsigned int t1;


	cudaMalloc((void**)&dres, M*N*sizeof(TYPE));




	for(int ii=0; ii<II; ii++) test1<<<M,N>>>(dres);



	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(res, dres, M*N*sizeof(float), cudaMemcpyDeviceToHost);

	printf("dt=%f, %fGflops\n", dt, (1E-9*M*N*K*II*NF)/dt);


I would question the validity of your timing - those cutil library timers are only host side timers that use standard the standard OS clock. I doubt the timer precision is sufficient to say whether the overall efficiency is 92% or 98%. You should probably use device set event timers for this sort of thing.

AFAIK G80 could never reach 98% of the peak flops computed as 3 ops per clock (MAD + MUL), dual issue wasn’t this effective. They could easily reach 95-98% of peak MAD flops using a series of MADs, which is 2/3 of the advertised peak.

With your code I’m topping at about 85% of 3-op peak on my 8800 GTS. Setting unroll to 250 helped to squeeze some performance out, 1000 is not necessarily the best number there. How much one should unroll depends on the device’s instruction cache size. When you unroll too much you can actually lose performance.

Thanks for your comments. The precision of timing function cutGetTimerValue() is meseared to be about 0.2us on my machine and the total execute time of my above test program is on the order of second. So I think this is not a matter.

Yes, the value of ‘98%’ I refered is compared to the peak value of “Clock freq x Num of cores x 2” with G80.

I have changed the unroll number from 100 - 2000 in my test and it seems 1000 is a good value for my machine, maybe g80 is different.

I want to measure my GFlops graphic card, so I use almost the same program as yours. But I have very strange results : dt=12.607983, 73.096547Gflops.

My graphic card is a NVIDIA Quadro FX 1700… Please, can you explain me what is wrong for me??


Your card has about 10 times lower peak global memory bandwidth and 7 times fewer processors than the GT200 class GPU the original poster is benchmarking. Also, because it is based on the previous generation of GPU, It also has a smaller register file and scheduler, and cannot schedule as many blocks and threads simultaneously. The performance you are seeing is probably about right for the class of card you are using.

Hi, all
I used almost the same code with GTX275, following is the results got from my desktop


However, I can only get about 60% performance.

any tricks?


This result is strange. What is your CUDA version? my result is obtained under CUDA 2.3, and you can modify the value of M to see if the number of gflops will grow up.

Recently I found another mad+mul instruction combination that was able to reach 843.2GFLOPS on my GTX 295 (94.3% to the peak).

Here is it:

mad.rn.f32 $r4, $r2, $r2, $r4

mul.rn.f32 $r2, $r3, 0x3f7d70a4

mad.rn.f32 $r4, $r2, $r2, $r4

mul.rn.f32 $r3, $r2, 0x3f7d70a4

The performance doesn’t change if the two float constants are different.

That’s it. When I change M=256, N=1024, the performance is 832.5 GFlops.

does it mean that more data wanted to make pipeline works?


For anyone who loves to make things run well I got a 8-16 fold speed up on parsing text data files by using Cuda, compared to using fscanf, I think there is a potential to double that by someone who is more familiar with C than I am.
Useful if you have an application that needs to read in 100k+ real numbers from a text file.


PS My post title says AsciiGrid but it is really for any file containing numbers in text form that you want to parse into a numeric type.