Here I got a little excerpt from the cudaprof
(functionname, calls, GPU time, CPU time, %)
matrix_vector_multiply_generated 400 206162 4566 54.44 matrix_vector_multiply 400 54548.8 4495 14.4
and what I find strange, the GPU Time is smaller than cpu time. How can this happen?
(This is machine at work GTX280, 32Bit 2.0b CUDA environment ,
at home with 64bit 1.1 CUDA environment on GTX8800 512 the output works as expected, with CPU time is approx. GPU time + 20 us overhead per call).
Anybody knows what happens?