CUBLAS terrible timings sgemm timing is very bad

so my problem is not really a problem it is more something like what is not so correct.
i have taken the cublas matrix multiplication sgemm and surrounded it with some code to handle my data.

something like that.
call function mamul{ //i"ll give the values in two arrays allocated with cudaMalloc and save it in a cudaMalloc array
-start timer
-call sgemm

my vector1 is something like 2262144 float and vector2 is something like 2621441 float(for the first iteration). so the estimated time is now 56,9ms. this is pretty high against matlab with 6ms. so where is the error or time leak? should i convert the data to matrix format or any other thing?

I am not really sure how you calculate a matrix matrix multiplication using SGEMM having only vectors ?

Can I see your Matlab code ?

Okay I havent read your code properly, the allocation is done in a proper way I think. But how big are your matrices A = (1024 , 512) and B = (512 , 512) ?

Mister Anderson is right, you should have run sgemm many times. You can also leave the first calculation out, and consider only the rest of the calculation time.

greetz, cem

Are you timing the very first call to sgemm in your entire program? Then you are also timing initialization overhead. Proper benchmarking should look like this:

prepare data

call sgemm once

start timer

call sgemm 100's of times (up to 5s of total run time or more is best)

stop timer