Hi all, I’m interested in finding out the reachable maximum performance of GT200 GPUs so I wrote a small program to count for it (the source code is listed as follows), and the software ‘decuda’ was used to avoid compiler’s thicks.
My GPU used in the test is a GTX 295 and a maximum performance of 823 Gflops is obtained, while the well-known theoretical peak performance is 1.2422403=894 GFlops. So the absolute efficiency of the test is about 92%.
I want to know if it is possible to revise this program to get higer performances. Any suggestion is appreciated.
Update: a better performance was obtained on 10# (843.2GFlops on GTX 295, 94.3% to the peak)
#include "cutil_inline.h"
#define M 240
#define N 64
#define K 2000000
#define II 10
#define NF 3
#define TYPE float
TYPE res[M*N], *dres;
__global__ void test1(TYPE *res) {
int inx=blockDim.x*blockIdx.x+threadIdx.x;
TYPE d=inx*0.1f, s=0.f;
#pragma unroll 1000
for(int i=0; i<K; i++) {
s+=d*d;
d*=(TYPE)0.99f;
}
res[inx]=s+d;
}
int main() {
unsigned int t1;
cutCreateTimer(&t1);
cudaMalloc((void**)&dres, M*N*sizeof(TYPE));
test1<<<M,N>>>(dres);
cudaThreadSynchronize();
cutStartTimer(t1);
for(int ii=0; ii<II; ii++) test1<<<M,N>>>(dres);
cudaThreadSynchronize();
cutStopTimer(t1);
float dt=cutGetTimerValue(t1)/1000.0f;
cudaMemcpy(res, dres, M*N*sizeof(float), cudaMemcpyDeviceToHost);
printf("dt=%f, %fGflops\n", dt, (1E-9*M*N*K*II*NF)/dt);
}
I would question the validity of your timing - those cutil library timers are only host side timers that use standard the standard OS clock. I doubt the timer precision is sufficient to say whether the overall efficiency is 92% or 98%. You should probably use device set event timers for this sort of thing.
AFAIK G80 could never reach 98% of the peak flops computed as 3 ops per clock (MAD + MUL), dual issue wasn’t this effective. They could easily reach 95-98% of peak MAD flops using a series of MADs, which is 2/3 of the advertised peak.
With your code I’m topping at about 85% of 3-op peak on my 8800 GTS. Setting unroll to 250 helped to squeeze some performance out, 1000 is not necessarily the best number there. How much one should unroll depends on the device’s instruction cache size. When you unroll too much you can actually lose performance.
Thanks for your comments. The precision of timing function cutGetTimerValue() is meseared to be about 0.2us on my machine and the total execute time of my above test program is on the order of second. So I think this is not a matter.
Your card has about 10 times lower peak global memory bandwidth and 7 times fewer processors than the GT200 class GPU the original poster is benchmarking. Also, because it is based on the previous generation of GPU, It also has a smaller register file and scheduler, and cannot schedule as many blocks and threads simultaneously. The performance you are seeing is probably about right for the class of card you are using.
This result is strange. What is your CUDA version? my result is obtained under CUDA 2.3, and you can modify the value of M to see if the number of gflops will grow up.
For anyone who loves to make things run well I got a 8-16 fold speed up on parsing text data files by using Cuda, compared to using fscanf, I think there is a potential to double that by someone who is more familiar with C than I am.
Useful if you have an application that needs to read in 100k+ real numbers from a text file.