Nvidia-smi -lms 1 and runtime

I use nvidia-smi -lms 1 to see if I’m getting concurrency in my CUDA app. I have timing statements in the code. I’ll start the app and then start nvidia-smi -lms 1. I’m consistently getting a slower run time when nvidia-smi is also running and appear to have lost the concurrency that I was getting. Is this unexpected? Has this always been the case and I’m just noticing it?

Thanks, Roger

I’ve never seen nvidia-smi perturb an app. I wouldn’t be able to explain it. I don’t think I’ve ever run it that fast, however. You might want to use a profiler for detailed inspection of your application behavior.

I have run nvprof on the app. How can I determine the level of concurrency from the nvprof output?

Thanks, Roger

It will be much easier to use nsight systems - the GUI can give you a visual timeline of the operations in your app. For nvprof by itself, you would probably use the --print-gpu-trace option, and then inspect the timestamped output to determine which operations were running concurrently.

(It’s puzzling to me, because I definitely don’t know how to ascertain concurrency from nvidia-smi)

I’m working on an AWS instance that does not do display or remote desktop etc. Also, when I run the nsight cli it says I do not have permissions to access the GPU registers. So I’m kinda stuck with nvprof.

In the nvidia-smi output, if I see two or more GPUs at 100% during the same 1 ms time interval and that occurs for multiple 1ms samples then I say the app is processing currently. Is there a flaw in that analysis?

Well, if you don’t wish to work on getting your environment improved, then it is possible to reconstruct a timeline using nvprof. You can do it manually, yourself, like, with graph paper, by using the --print-gpu-trace option, and parsing the output yourself. Each item in the report will have a starting point identified (timestamp) and a duration. You can sketch that out yourself, as much as you like.

You might also be able to export a file from nvprof that can be used via import into the visual profiler. You would have to have an installation of the visual profiler on another machine, windows or linux, that had a display. It does not necessarily have to have a GPU for the import function. The CUDA toolkit can be installed on a machine for development purposes even if it has no GPU.

no, not really. I was thinking of concurrency on an individual device, such as copy/compute overlap. If that is what you are after, then it should be a valid method to show that two devices are both “at work” in a given time interval.

Well, I would also like to know about concurrency on a single device.

When I try the --print-gpu-trace option I get an error from nvprof.

I don’t have control over the environment. So I can wish all I want but it ain’t going improve anything.

Yes, I’m not surprised that if you have trouble using nsight systems cli, you will have trouble using nvprof.

I wasn’t suggesting that wishing was going to improve anything. Let me give you an example. In an academic setting, there is someone (e.g. sysadmins) that control the computing resource. Students with difficulty may be able to get sysadmins to make changes to the computing resource, if the case is properly presented. This sort of thing happens all the time.

I don’t know the details of your AWS usage, but if you don’t control the environment then I imagine someone else does. If I were working in that environment, I would try to open discussions with the people in control of the environment to see if I could get the development situation improved.

That’s what I would do. I don’t know what options you have available to you. I don’t know how to get the profilers running in a fully useful way without some control over the environment, in this situation.

Good luck!