So basically what I am doing is that I copy data to my cudaarray (that is bound to a texture), run the kernel, copy that output data to the array (updating the texture) and runs the kernel again. Over and over again.
Here is the thing:
If I time the whole deal I get one time value, say 10 ms (X)
If I time only the cuda_kernel call I get another value, say 5ms (Y)
If I time only the cudaMemcpyToArray call I get a third value, say 7ms (Z)
These does not add up right: X != Y + Z.
So my question is then:
Is there any other way to time this? (General hints when timing cuda performance maybe?)
Or should I maybe do something other than my texture-copy method?
I think device to device memcpys are asynchronous too… I’ve never actually checked.
And something unrelated to timing: Are you using the CUDA 2.0 beta? Versions prior to that have a bug that results in extremely poor performance for cudaMemcpyToArray.
Yes, I am using CUDA 2.0 beta so the performance shouldn’t be that bad then right?
Question is: Is there any better way to do it than do copy the output to an array and bind that to a texture? Since it seems like the cudaMemcpyToArray is taking up quite a lot of time then.