How to get the bytes read/write sum about Memory access between GPUs?

ttrbuaa · December 20, 2023, 9:26am

Hello everyone,
I want to profile the LLM models with mutiple GPUs, but when I use ncu, it will hang at this:

==PROF== Profiling "ncclKernel_Broadcast_RING_LL_..." - 27 (28/200): 0%....50%
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, 
terminate the profile and re-try by profiling the range of all related launches 
using '--replay-mode range'.

then I use the range like this:
sudo ncu -o report-nccl/test_nvtx --replay-mode app-range --target-processes all -c 20 -s 64 -f python3 -m torch.distributed.run --nproc_per_node 2 7b-2mp-example_text_completion.py

        torch.cuda.cudart().cudaProfilerStart()
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        torch.cuda.cudart().cudaProfilerStop()

But it ran for a while and then gave error and exit, like this :

==PROF== Profiling "range" - 0: 0%....50%....100% - 39 passes
==PROF== Profiling "range" - 1: 0%....50%....100% - 39 passes

==ERROR== An error was reported by the driver

==ERROR== Cannot capture API call for CUDA event recorded outside the range (cuEventQuery)
==ERROR== Skipping invalid capture range (RecordStatusUnsupportedApi).

==ERROR== An error was reported by the driver

==ERROR== Cannot capture API call for CUDA event recorded outside the range (cuEventQuery)
==ERROR== Skipping invalid capture range (RecordStatusUnsupportedApi).
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.

I can use the Nsight System to profile some metrics, but I don’t find the bytes read/write sum about Memory access between GPUs .So, how Can I get it?

My devices are NV3090*2,
ncu version is 2023.3.0.0.
ubuntu version is Ubuntu 22.04 LTS.

veraj · December 20, 2023, 9:39am

Hi, @ttrbuaa

Sorry for the issue you met.
Range replay supports a subset of the CUDA API for capture and replay. This page lists the supported functions as well as any further, API-specific limitations that may apply. If an unsupported API call is detected in the captured range, an error is reported and the range cannot be profiled. Details please see Kernel Profiling Guide :: Nsight Compute Documentation

ttrbuaa · December 20, 2023, 10:04am

Thanks, I think the nccl kernel may can’t be profile by Nsight Compute.
My question is belong to Stream Memory Operations? So I can’t get it with ncu, right?
So whether there is a method that can get the bytes read/write sum about Memory access between GPUs?For example, LLM running in the two GPUs, it must occurs data exchange, I want to get the bytes flow between two GPUs, how can I do this? Thank you so much!

rs277 · December 20, 2023, 6:08pm

Perhaps it’s not what you’re wanting, but is not the “Peer Memory” traffic on the Memory Chart what you are after?

ttrbuaa · December 21, 2023, 1:47am

I have been profile this metrics, but it always gets 0, no matter what replay mode is application or range, I don’t know if there’s something wrong with my operation, or it doesn’t work?
Please help me with how to get the peer memory,thank you so much!

ttrbuaa · December 21, 2023, 2:09am

Otherwise，because the nccl kernel can’t be profiled, I have profiled other kernels except nccl, the peer memory is always 0, so I think the data exchange in GPUs may all occurs in nccl kernels? but I can’t run with range nvtx, it make me sad.

veraj · February 22, 2024, 6:05am

Hi, @ttrbuaa

Please check if Nsight System can meet your requirement. User Guide — nsight-systems 2024.1 documentation

Note Nsight Systems GPU Metrics is only available for Linux targets on x86-64 and aarch64, and for Windows targets. It requires NVIDIA Turing architecture or newer.

veraj · March 20, 2024, 9:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about profiling nccl kernels with Nsight Compute Nsight Compute	19	4353	August 24, 2023
Range profiling: "No ranges were profiled." Nsight Compute	2	1349	August 7, 2024
Failed to access the following 9 metrics Nsight Compute	2	400	March 27, 2024
Random Freezing Trying to Profile Megatron-LM on Multiple GPUs Nsight Compute	9	693	July 22, 2024
Profiling fails on more than one gpu device Nsight Compute	9	907	November 15, 2023
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1416	February 13, 2023
Nsight compute hanging issue Nsight Compute kernel	7	701	March 11, 2024
Profiling one application having two concurent kernels Nsight Compute	3	586	June 8, 2023
Is not there a replay-mode option? Nsight Compute	1	786	July 24, 2019
How can I profile both kernel and cuda APIs hardware usage and application total duration Nsight Compute	5	414	March 27, 2024

How to get the bytes read/write sum about Memory access between GPUs?

Related topics