nvprof --print-api-trace - puzzling outputs.

whatdhack · January 7, 2020, 8:10pm

Using nvprof to identify bottlenecks in code. Basically focused on 3 cuda API calls - cudaMemcpy (DeviceToHost, DeviceToDevice), cudaEventSynchronize,. Here are my questions.

Is there a way easy way to identify which cudaMemcpy in the nvprof output corresponds to which call in code ?
nvprof seems to indicted that a int memcpy from device to host taking msec ! But when I comment that cudaMemcpy out - there is no change in execution time and the msec latency moves to cudaFree !
Is cudaMemcpy single threaded or parallelized ?

Topic		Replies	Views
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3410	December 28, 2017
preview of NVIDIA Visual Profiler CUDA Programming and Performance	76	89346	May 18, 2010
Time of API calls in nvprof's output is consumed in GPU or CPU Jetson TX2	2	602	October 18, 2021
help me understanding the report of Profiler about reading the Profiler report CUDA Programming and Performance	1	1082	December 23, 2008
nvprof never returns CUDA Programming and Performance	8	6383	March 30, 2016
How to explain the performance difference? CUDA Programming and Performance	7	3569	March 26, 2008
nvprof and difference in time reported CUDA Programming and Performance	4	1155	September 16, 2017
Updated beta visual profiler v0.2 CUDA Programming and Performance	0	2913	April 23, 2008
Time To Profile CUDA Programming and Performance	11	5712	October 20, 2011
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	4	933	March 27, 2023

nvprof --print-api-trace - puzzling outputs.

Related topics