As the title mentioned, when I use nsys to profile two pytorch processes simultaneously within an A100, I find that Graph executing of each process is completely interlaced like the following pic. Can you tell me the reasons? Thank a lot.
@liuyis cna you respond to this?
@lvcunchi Could you share your report file?
I’d also suggest trying a few more profiling options to get more insights:
-
Try adding the option
--cuda-graph-trace=node
to see the node-level details of each graph execution. -
Try
--gpu-metrics-devices=all
option to enable GPU metrics sampling feature and observe the GPU metrics during these graph execution. -
Try
--gpuctxsw=true
option to enable the GPU context switch trace - there were probably GPU context switcing happening between the two processes.