nvprof --print-api-trace - puzzling outputs.

Using nvprof to identify bottlenecks in code. Basically focused on 3 cuda API calls - cudaMemcpy (DeviceToHost, DeviceToDevice), cudaEventSynchronize,. Here are my questions.

  1. Is there a way easy way to identify which cudaMemcpy in the nvprof output corresponds to which call in code ?
  2. nvprof seems to indicted that a int memcpy from device to host taking msec ! But when I comment that cudaMemcpy out - there is no change in execution time and the msec latency moves to cudaFree !
  3. Is cudaMemcpy single threaded or parallelized ?