How to get the bytes read/write sum about Memory access between GPUs?

Hello everyone,
I want to profile the LLM models with mutiple GPUs, but when I use ncu, it will hang at this:

==PROF== Profiling "ncclKernel_Broadcast_RING_LL_..." - 27 (28/200): 0%....50%
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, 
terminate the profile and re-try by profiling the range of all related launches 
using '--replay-mode range'.

then I use the range like this:
sudo ncu -o report-nccl/test_nvtx --replay-mode app-range --target-processes all -c 20 -s 64 -f python3 -m torch.distributed.run --nproc_per_node 2 7b-2mp-example_text_completion.py

        torch.cuda.cudart().cudaProfilerStart()
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        torch.cuda.cudart().cudaProfilerStop()

But it ran for a while and then gave error and exit, like this :

==PROF== Profiling "range" - 0: 0%....50%....100% - 39 passes
==PROF== Profiling "range" - 1: 0%....50%....100% - 39 passes

==ERROR== An error was reported by the driver

==ERROR== Cannot capture API call for CUDA event recorded outside the range (cuEventQuery)
==ERROR== Skipping invalid capture range (RecordStatusUnsupportedApi).

==ERROR== An error was reported by the driver

==ERROR== Cannot capture API call for CUDA event recorded outside the range (cuEventQuery)
==ERROR== Skipping invalid capture range (RecordStatusUnsupportedApi).
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.

I can use the Nsight System to profile some metrics, but I don’t find the bytes read/write sum about Memory access between GPUs .So, how Can I get it?

My devices are NV3090*2,
ncu version is 2023.3.0.0.
ubuntu version is Ubuntu 22.04 LTS.

Hi, @ttrbuaa

Sorry for the issue you met.
Range replay supports a subset of the CUDA API for capture and replay. This page lists the supported functions as well as any further, API-specific limitations that may apply. If an unsupported API call is detected in the captured range, an error is reported and the range cannot be profiled. Details please see Kernel Profiling Guide :: Nsight Compute Documentation

Thanks, I think the nccl kernel may can’t be profile by Nsight Compute.
My question is belong to Stream Memory Operations? So I can’t get it with ncu, right?
So whether there is a method that can get the bytes read/write sum about Memory access between GPUs?For example, LLM running in the two GPUs, it must occurs data exchange, I want to get the bytes flow between two GPUs, how can I do this? Thank you so much!

Perhaps it’s not what you’re wanting, but is not the “Peer Memory” traffic on the Memory Chart what you are after?

I have been profile this metrics, but it always gets 0, no matter what replay mode is application or range, I don’t know if there’s something wrong with my operation, or it doesn’t work?
Please help me with how to get the peer memory,thank you so much!

Otherwise,because the nccl kernel can’t be profiled, I have profiled other kernels except nccl, the peer memory is always 0, so I think the data exchange in GPUs may all occurs in nccl kernels? but I can’t run with range nvtx, it make me sad.

Hi, @ttrbuaa

Please check if Nsight System can meet your requirement. User Guide — nsight-systems 2024.1 documentation

Note Nsight Systems GPU Metrics is only available for Linux targets on x86-64 and aarch64, and for Windows targets. It requires NVIDIA Turing architecture or newer.