A maximum performance of 823 GFlops meseared for GTX 295 with mad+muls

cuda2010 · February 17, 2010, 12:45pm

Hi all, I’m interested in finding out the reachable maximum performance of GT200 GPUs so I wrote a small program to count for it (the source code is listed as follows), and the software ‘decuda’ was used to avoid compiler’s thicks.

My GPU used in the test is a GTX 295 and a maximum performance of 823 Gflops is obtained, while the well-known theoretical peak performance is 1.2422403=894 GFlops. So the absolute efficiency of the test is about 92%.

I want to know if it is possible to revise this program to get higer performances. Any suggestion is appreciated.

Update: a better performance was obtained on 10# (843.2GFlops on GTX 295, 94.3% to the peak)

#include "cutil_inline.h"

#define M	240

#define N	64

#define K	2000000 

#define II	10

#define NF	3

#define TYPE float

TYPE res[M*N], *dres;

__global__ void test1(TYPE *res) {

	int inx=blockDim.x*blockIdx.x+threadIdx.x;

	TYPE d=inx*0.1f, s=0.f;

#pragma unroll 1000

	for(int i=0; i<K; i++) {

		s+=d*d;

		d*=(TYPE)0.99f;

	}

	res[inx]=s+d;

}

int main() {

	unsigned int t1;

	cutCreateTimer(&t1);

	cudaMalloc((void**)&dres, M*N*sizeof(TYPE));

	test1<<<M,N>>>(dres);

	cudaThreadSynchronize();

	cutStartTimer(t1);

	for(int ii=0; ii<II; ii++) test1<<<M,N>>>(dres);

	cudaThreadSynchronize();

	cutStopTimer(t1);

	float dt=cutGetTimerValue(t1)/1000.0f;

	cudaMemcpy(res, dres, M*N*sizeof(float), cudaMemcpyDeviceToHost);

	printf("dt=%f, %fGflops\n", dt, (1E-9*M*N*K*II*NF)/dt);

}

avidday · February 17, 2010, 1:06pm

I would question the validity of your timing - those cutil library timers are only host side timers that use standard the standard OS clock. I doubt the timer precision is sufficient to say whether the overall efficiency is 92% or 98%. You should probably use device set event timers for this sort of thing.

_Big_Mac · February 17, 2010, 1:30pm

AFAIK G80 could never reach 98% of the peak flops computed as 3 ops per clock (MAD + MUL), dual issue wasn’t this effective. They could easily reach 95-98% of peak MAD flops using a series of MADs, which is 2/3 of the advertised peak.

With your code I’m topping at about 85% of 3-op peak on my 8800 GTS. Setting unroll to 250 helped to squeeze some performance out, 1000 is not necessarily the best number there. How much one should unroll depends on the device’s instruction cache size. When you unroll too much you can actually lose performance.

cuda2010 · February 17, 2010, 1:33pm

Thanks for your comments. The precision of timing function cutGetTimerValue() is meseared to be about 0.2us on my machine and the total execute time of my above test program is on the order of second. So I think this is not a matter.

cuda2010 · February 17, 2010, 1:45pm

Yes, the value of ‘98%’ I refered is compared to the peak value of “Clock freq x Num of cores x 2” with G80.

I have changed the unroll number from 100 - 2000 in my test and it seems 1000 is a good value for my machine, maybe g80 is different.

spulvera · February 17, 2010, 2:10pm

I want to measure my GFlops graphic card, so I use almost the same program as yours. But I have very strange results : dt=12.607983, 73.096547Gflops.

My graphic card is a NVIDIA Quadro FX 1700… Please, can you explain me what is wrong for me??

Thks

avidday · February 17, 2010, 2:14pm

Your card has about 10 times lower peak global memory bandwidth and 7 times fewer processors than the GT200 class GPU the original poster is benchmarking. Also, because it is based on the previous generation of GPU, It also has a smaller register file and scheduler, and cannot schedule as many blocks and threads simultaneously. The performance you are seeing is probably about right for the class of card you are using.

fishbupt · March 5, 2010, 5:55am

Hi, all
I used almost the same code with GTX275, following is the results got from my desktop

However, I can only get about 60% performance.

any tricks?

thanks

cuda2010 · March 5, 2010, 1:47pm

This result is strange. What is your CUDA version? my result is obtained under CUDA 2.3, and you can modify the value of M to see if the number of gflops will grow up.

cuda2010 · March 5, 2010, 1:53pm

Recently I found another mad+mul instruction combination that was able to reach 843.2GFLOPS on my GTX 295 (94.3% to the peak).

Here is it:

mad.rn.f32 $r4, $r2, $r2, $r4

mul.rn.f32 $r2, $r3, 0x3f7d70a4

mad.rn.f32 $r4, $r2, $r2, $r4

mul.rn.f32 $r3, $r2, 0x3f7d70a4

The performance doesn’t change if the two float constants are different.

fishbupt · March 8, 2010, 5:20am

That’s it. When I change M=256, N=1024, the performance is 832.5 GFlops.

does it mean that more data wanted to make pipeline works?

thx!

kbam · March 8, 2010, 11:57pm

For anyone who loves to make things run well I got a 8-16 fold speed up on parsing text data files by using Cuda, compared to using fscanf, I think there is a potential to double that by someone who is more familiar with C than I am.
Useful if you have an application that needs to read in 100k+ real numbers from a text file.

[url=“http://forums.nvidia.com/index.php?showtopic=105782&hl=”]http://forums.nvidia.com/index.php?showtopic=105782&hl=[/url]

PS My post title says AsciiGrid but it is really for any file containing numbers in text form that you want to parse into a numeric type.

Topic		Replies	Views
GTX280/GT200 GPU Can you really reach 1TFLOP/s? CUDA Programming and Performance	6	10151	June 19, 2008
gigaflops CUDA Programming and Performance	16	16418	September 11, 2008
Measuring FLOPS CUDA Programming and Performance	8	14166	January 19, 2010
GPU running time is not stable CUDA Programming and Performance	5	3058	April 24, 2010
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28214	August 15, 2010
Theoretical FLOP speed Need clarification(s) CUDA Programming and Performance	8	28352	March 19, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37433	August 30, 2009
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6555	February 19, 2009
Raw speed for CUDA apps What is the fastest card at present? CUDA Programming and Performance	7	8838	February 6, 2008
FMA regression with CUDA 3.1/3.2: 17% slower than 3.0? CUDA Programming and Performance	10	8785	January 25, 2011

A maximum performance of 823 GFlops meseared for GTX 295 with mad+muls

Related topics