How to get the kernel computation time?

Hi,
I want to get the kernel computation time.
unsigned int timer1;
cutCreateTimer(&timer1);
cutStartTimer(timer1);
Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
cutStopTimer(timer1);

Since the kernel call is asynchronous I have to put cutStopTimer behind cudaMemcpy.
Is there another way to get the kernel time?

Best,
Yixun

Call cudaThreadSynchronize() before any timer measurement.

Better yet, use CUDA events for timing. See, for example, simpleStreams sample in the SDK. Events won’t include driver overhead, since they are recorded on the GPU. Resolution is the period of a GPU clock tick, so works much better for short kernels.

Paulius