I am new for CUDA, and I want to ask some basic questions:
a) I see some topics about performance comparison with CPU, and some analysis of the performance. Is there any tool or means that I can analysis or show the consuming time distribution inside a kernel running? For example, can I show the time for the thread to copy the data from global memory to shared memory?
b) I am focused on performance optimization, can anyone show me any source code about image processing? expecially comparing on CPU and GPU. Any help or suggestions will be appreciate.
Thanks a lot