I was running SGEMV from CUBLAS and the peak performance I got, for n = 4096 and n = 8192, is only 13GFlops, quite low compared to the over 300GFlops peak of the board. The corresponding bandwidth is about 25GB/s, 3.6 times lower than the peak 90GB/s of the board. Using the profiler, I could see the occupancy is only 0.333, regardless of n. Does anyone know why? In other thread I read the new version of the toolkit will come with an improved SGEMM. Will it come with an improved SGEMV as well?