in my cuda application, i measure the execution time of different parts of code, and in particular of the kernels of my code through time.h library.
When I run my application on the gtx480 and C2050 architectures it seems the C2050 is little faster than gtx480 comparing total execution time of the hole application BUT comparing execution time of kernels in gtx480 and C2050 the first ones seem to be faster!!!
For execution time of kernels the transeferring in memory from device to host and/or from host to device is not included.
As a consequence one may think that this transfering maybe makes the difference! But how can that be? The C2050 has lower bandwidth compared to gtx480.
Summarizing, total application faster in C2050 but all kernels slower in gtx480 … ???
Can anybody suggest any possible explanation?
Thank you in advance.