Using nvprof to identify bottlenecks in code. Basically focused on 3 cuda API calls - cudaMemcpy (DeviceToHost, DeviceToDevice), cudaEventSynchronize,. Here are my questions.
- Is there a way easy way to identify which cudaMemcpy in the nvprof output corresponds to which call in code ?
- nvprof seems to indicted that a int memcpy from device to host taking msec ! But when I comment that cudaMemcpy out - there is no change in execution time and the msec latency moves to cudaFree !
- Is cudaMemcpy single threaded or parallelized ?