Low performance on SGEMV

Hi all,

I was running SGEMV from CUBLAS and the peak performance I got, for n = 4096 and n = 8192, is only 13GFlops, quite low compared to the over 300GFlops peak of the board. The corresponding bandwidth is about 25GB/s, 3.6 times lower than the peak 90GB/s of the board. Using the profiler, I could see the occupancy is only 0.333, regardless of n. Does anyone know why? In other thread I read the new version of the toolkit will come with an improved SGEMM. Will it come with an improved SGEMV as well?

Thanks,
Serban

Hi gserban, the same problem, i’m testing Sgemm in very simple test:
just added time measurement in simpleCUBLAS project

#define N (275*8)
#define M (100)

gettimeofday (&tv1, NULL);
for(q=0;q<M;q++){
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
}
gettimeofday (&tv2, NULL);
MILLISEC_DIFF(tv2, tv1, delta);
delta=delta/1000;    
printf("geforce execu1tion time = %e secs.\n",delta);
sz=(3+2*N)*n2*M;
printf("geforce performance =   %e Flops.\n", sz/delta);

result is
geforce execu1tion time = 3.355400e+01 secs.
geforce performance = 2.229902e+07 Flops.

Whats problem, maybe wrong sz parameter?

New results under windows:

geforce time 33563
geforce performance 1.835684e-002 gflops
Test PASSED

code:

#define N (275*8)
#define M (100)

  double sz=N*N*(N*2+3)*M;



   T1=GetTickCount();
/* Performs operation using cublas */
for(j=0;j<M;j++){
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
}
T2=GetTickCount();
printf("geforce time %d\n",T2-T1);
printf("geforce performance %e gflops\n",(sz/((double)(T2-T1)/1000))/1E9);

What’s comments, previous absurd results linux problem?
When nvidia developers will pay attention to the linux?