Profiling production server while it serves live requests

Hi everyone. We’re running servers that run code on gpus via cuda, and we’d like to be able to profile them while they’re running in production handling live requests. Actual server load is difficult for us to replicate in staging environments.

We’d like to be able to do this in a way that doesn’t increase our downtime, and that seems to be the tricky part. With nsys profile, we can specify --delay and --duration to schedule a 5 minute time slice. This would be good enough for our needs, but the profiler always exits after --duration is complete, and we need the application to continue running. Setting --kill=none keeps the application going, but then it dies the moment it tries to call into the cuda code without the profiler still running. I also couldn’t find a way to do this with interactive options like nsys start, launch, stop, but maybe I missed something?

Does anyone have any suggestions for how we could extract a little profiling data from a live application whose uptime we need to maximize? Thanks in advance!

The crash on reentering CUDA code should not be happening. I seem to recall that we had an issue with this before. Could you tell me what version of Nsight systems you are using?

Likewise, is Nsys the only profiling tool you are using, or have you also put in your own CUPTI hooks?

12.0.0 is the version. We haven’t done anything with CUPTI hooks yet.

Just knowing that it shouldn’t crash on reentry is super helpful. I notice that we’re compiling with version 12.1.1 even though our driver & nsys are both 12.0.0 - I’ll try compiling with 12.0.0 tomorrow and see what happens then.

@liuyis do you remember the details of the issue?

@anitet.wheeler while Nsys is shipped with the CTK, it isn’t directly dependent on a specific version. I would recommend installing the latest Nsight Systems (2023.4.1) from Nsight Systems | NVIDIA Developer | NVIDIA Developer. I think that the bug I am remembering was fixed there.

I don’t have too much impression on this issue, and didn’t find much by searching our internal tickets…

I couldn’t reproduce the issue locally using simple cuda samples like vectorAdd modified to loop forever, so this might be application-specific. I tried the Nsys release from CTK 12.0 as well as our internal ToT build.

@anitet.wheeler As Holly suggested it will be helpful to try our latest website release and see if it works. If the issue persists we need to look into it.

I tried with nsys 2023.4.1, same result. Here’s the command I’m using:
nsys profile -f true -o profile/test -t cuda --stats=true --duration=1 --kill=none <application>

Let’s try two things.

Try running with a bare bones profile instead of default features. This command will not do any CPU sampling and will not do the NVTX and OS runtime trace that are part of the default.

nsys profile --sample=none --trace=cuda -f true -o profile/test -t cuda --stats=true --duration=1 --kill=none

I’ll be honest, I don’t expect this to work, but it narrows down the issues (and sometimes OSRT is an issue).

The second thing would be if you could get us a reproducer.