Thanks for the experiment, that shows the longer-than-expected delay for cuInit
is indeed the root cause.
Per internal discussion we’ve only hit this previously when there’s a bad GPU device, do you know if that could be the case on your system?
Can you run strace -T -o /tmp/strace.txt matrixMul
(matrixMul
can be any simple CUDA program) and send us strace.txt
? That can help us find which GPU node causes the long delay.
If we can identify a problematic GPU node, the suggestion would be disabling that node before running Nsys. Or if that’s not possible, we will need to add a way in Nsys to allow extending the timeout to workaround it, but that won’t be available to you until next public release which is a few monthes later, unless you or your company has NDA with NVIDIA in which case we can share you an internal build.