Hi,
I can’t get the Nsight Compute profiler to capture ranges, because calls to cudaProfilerStart
/cudaProfilerStop
seem to be ignored. Regular profiling with ncu works otherwise, and nsys can intercept the profiler calls correctly. What am I doing wrong? Is there a workaround?
I want to profile 2 kernels that must run in parallel.
Thanks a lot
Minimum example:
// hello_cuda.cu
#include <iostream>
#include "cuda_profiler_api.h"
__global__ void cuda_hello(){
printf("Hello World from GPU!\n");
}
int main() {
cudaProfilerStart();
cuda_hello<<<1,1>>>();
cudaDeviceSynchronize();
cudaProfilerStop();
printf("Hello from CPU\n");
return 0;
}
Output:
$ nvcc -arch=sm_80 hello_cuda.cu -o hello_cuda
$ CUDA_VISIBLE_DEVICES=0 TMPDIR=. /scratch/XXXX/NVIDIA-Nsight-Compute-2023.1/ncu --export report.ncu-rep --force-overwrite --replay-mode range ./hello_cuda
==WARNING== Please consult the documentation for current range-based replay mode limitations and requirements.
==PROF== Connected to process 322593 (/scratch/XXXX/xformers/scripts/hello_cuda)
Hello World from GPU!
Hello from CPU
==PROF== Disconnected from process 322593
==WARNING== No ranges were profiled.
==WARNING== Profiling ranges launched by child processes requires the --target-processes all option.
Range-profiling with nsys working:
$ nsys profile --capture-range=cudaProfilerApi ./hello_cuda
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
Capture range started in the application.
Hello World from GPU!
Generating '/tmp/nsys-report-4070.qdstrm'
Capture range ended in the application.
[1/1] [========================100%] report1.nsys-rep
Generated:
/scratch/XXXX/xformers/scripts/report1.nsys-rep
Setup:
$ nvidia-smi -i 0
Thu Mar 2 15:55:29 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:10:1C.0 Off | 0 |
| N/A 27C P0 52W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0