Hi,
I am trying to profile some CUDA code on an A6000 in CoreWeave and am having trouble. Nsight Systems isn’t starting up, so I tried to run a very simple case instead:
/home/fsuser/nsight-systems-2023.2.1/bin/nsys profile python3 ./inference/temp.py
Where ./inference/temp.py
is simply the code print(1)
.
nsys hangs for about 1 min, and then I get the following error message:
/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/AgentAPI/Src/SessionImpl.cpp(18): rpc Start(.Agent.EmptyMessage) returns (.Agent.EmptyMessage);
is canceled because the timeout period is expired
I created a log-file as suggested here (I also tried restarting, reinstalling CUDA and then restarting, etc, and nothing solved it): Linux: Cannot Start Profile, Cannot Start Daemon
And got the following log-file (nsight-sys.log (271.6 KB)).
Would appreciate your assistance!
I think your test may actually be too short, print(1) takes nanoseconds to run and I am guessing that by the time nsys starts getting CUDA data from the underlying CUPTI, you are already done.
Can you put in a big sleep and try it?
Alternatively what is happening when it doesn’t start up, what is the command line you are running?
Hey, thanks for the prompt reply.
I tried running a much larger program (an LLM server), but it also just hangs for 1 min and then gives the same output.
Also, the print(1)
never prints (does nsys not pipe the output to the shell?).
Edit 1: Another intersting side-effect is that when I run the above, during the 1 min that it hangs, it takes a really long time for nvidia-smi
to run.
Edit 2: Sure, what sleep would you want me to try and run? Just sleep inside the python before the print?
Edit 3: It just doesn’t start. For full context, this is what I see:
/home/fsuser/nsight-systems-2023.2.1/bin/nsys profile python3 ./inference/temp.py
/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/AgentAPI/Src/SessionImpl.cpp(18): rpc Start(.Agent.EmptyMessage) returns (.Agent.EmptyMessage);
is canceled because the timeout period is expired
Where it takes roughly 1 min for the error to be shown (before that the output is empty).
If you use “profile” then the console is held by Nsys and not returned (fire and forget). If you need the console returned, you’ll want to use the interactive mode - see User Guide :: Nsight Systems Documentation