Low performance on SGEMV

Hi all,

I was running SGEMV from CUBLAS and the peak performance I got, for n = 4096 and n = 8192, is only 13GFlops, quite low compared to the over 300GFlops peak of the board. The corresponding bandwidth is about 25GB/s, 3.6 times lower than the peak 90GB/s of the board. Using the profiler, I could see the occupancy is only 0.333, regardless of n. Does anyone know why? In other thread I read the new version of the toolkit will come with an improved SGEMM. Will it come with an improved SGEMV as well?

Thanks,
Serban

1 Like

Hi gserban, the same problem, i’m testing Sgemm in very simple test:
just added time measurement in simpleCUBLAS project

#define N (275*8)
#define M (100)

gettimeofday (&tv1, NULL);
for(q=0;q<M;q++){
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
}
gettimeofday (&tv2, NULL);
MILLISEC_DIFF(tv2, tv1, delta);
delta=delta/1000;    
printf("geforce execu1tion time = %e secs.\n",delta);
sz=(3+2*N)*n2*M;
printf("geforce performance =   %e Flops.\n", sz/delta);

result is
geforce execu1tion time = 3.355400e+01 secs.
geforce performance = 2.229902e+07 Flops.

Whats problem, maybe wrong sz parameter?

New results under windows:

geforce time 33563
geforce performance 1.835684e-002 gflops
Test PASSED

code:

#define N (275*8)
#define M (100)

  double sz=N*N*(N*2+3)*M;



   T1=GetTickCount();
/* Performs operation using cublas */
for(j=0;j<M;j++){
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
}
T2=GetTickCount();
printf("geforce time %d\n",T2-T1);
printf("geforce performance %e gflops\n",(sz/((double)(T2-T1)/1000))/1E9);

What’s comments, previous absurd results linux problem?
When nvidia developers will pay attention to the linux?