I need to profiling host-to-device and device-to-host memory copy consume,but I get a simple result that only contains a simple timeline instead of multi threading. The application I used was a complicated tensorflow model developed by google for image caption. And nvvp gives the indication that no GPU is used. But when running, I check it use command “nvidia-smi”, it shows that GPU is used when running.
Another simple matrix multiply application written in tensorflow give detailed information, I wonder why this happens.
Is it possible that your GPU do not support this complication model running ?
Have you tried running the model seperately without profiler, and which gpu is using while running ?
Thanks for your reply.
It seems when running this complicated model, only interface calling can be profiled. The model is supported by GPU device when not profiling, so I guess the prossible reson is that detailed information is not accessible cause the profile result show the timeline of calling socalled ‘cuDevicePrimaryCtxRetain’, which I can’t understand what exactlly it was.
Is it possible that the running actually do not trigger any kernel launch ?
You can use nvprof --kernels XXX --analysis-metrics -o analysis.nvprof ./application to generate result file and then import to nvvp to check if any result.
Thanks again, I will try again as you say.