I was going through some of the sdk examples , in those they only seem to time the kernel execution not the memory copy or memory allocation. Why is that ?
I would be thankful if some one can give me reason for this.
Also, can we expect a new Tesla card in near future which would be much faster for double precision ?
Which kind of timing is relevant depends on your application - sometimes it is not necessary to transfer the results of all intermediate steps back to the CPU, and sometimes the results are displayed directly with no read-back.
I agree that perhaps we should be more consistent in the SDK as to whether we include the memory copies in the timing.
We can’t talk about future products, but double precision performance will certainly increase in the long term.
I am still little confused whether I should include memcopy operations in timing or not. Take for instance a kernel which is launched 100 times but memcopy is just done once, so in that case it would fine to leave the memcpy out. But if the kernel is just launched once then memcpy miite become significant. It also depends on the amount of data being transferred to the GPU
Keeping all the above in mind , I think memcpy should be timed for consistency and fairness in speed up computations.