analysis inside kernel

I am new for CUDA, and I want to ask some basic questions:

a) I see some topics about performance comparison with CPU, and some analysis of the performance. Is there any tool or means that I can analysis or show the consuming time distribution inside a kernel running? For example, can I show the time for the thread to copy the data from global memory to shared memory?

b) I am focused on performance optimization, can anyone show me any source code about image processing? expecially comparing on CPU and GPU. Any help or suggestions will be appreciate.

Thanks a lot

a) Tried NVIDIA Compute Visual Profiler?

Thanks for reply, now we use the profiler and can analysis