CUDA timing

I have a question about timing the kernel. Here is the deal.

I have a process that does several passes using texture memory and CUDAarrays. The code looks like this:

for(int i = 0; i < loops; ++i)


   CUDA_SAFE_CALL(cudaMemcpyToArray(d_array_idata, 0, 0, d_odata, sizeof(float4)*width*height, cudaMemcpyDeviceToDevice));

   cuda_kernel<<< grid, threads >>>( d_odata, width, height);


So basically what I am doing is that I copy data to my cudaarray (that is bound to a texture), run the kernel, copy that output data to the array (updating the texture) and runs the kernel again. Over and over again.

Here is the thing:

  • If I time the whole deal I get one time value, say 10 ms (X)

  • If I time only the cuda_kernel call I get another value, say 5ms (Y)

  • If I time only the cudaMemcpyToArray call I get a third value, say 7ms (Z)

These does not add up right: X != Y + Z.

So my question is then:

Is there any other way to time this? (General hints when timing cuda performance maybe?)

Or should I maybe do something other than my texture-copy method?

kernel calls are asynchronous. To perform accurate wall clock timings, you must precede any call to a clock function with cudaThreadSynchronize().

Okay, but that is also happening. But what about cudaMemcpyToArray()?

I think device to device memcpys are asynchronous too… I’ve never actually checked.

And something unrelated to timing: Are you using the CUDA 2.0 beta? Versions prior to that have a bug that results in extremely poor performance for cudaMemcpyToArray.

Yes, I am using CUDA 2.0 beta so the performance shouldn’t be that bad then right?

Question is: Is there any better way to do it than do copy the output to an array and bind that to a texture? Since it seems like the cudaMemcpyToArray is taking up quite a lot of time then.