Range profiling: "No ranges were profiled."

Hi,

I can’t get the Nsight Compute profiler to capture ranges, because calls to cudaProfilerStart/cudaProfilerStop seem to be ignored. Regular profiling with ncu works otherwise, and nsys can intercept the profiler calls correctly. What am I doing wrong? Is there a workaround?
I want to profile 2 kernels that must run in parallel.

Thanks a lot

Minimum example:

// hello_cuda.cu
#include <iostream>
#include "cuda_profiler_api.h"

__global__ void cuda_hello(){
    printf("Hello World from GPU!\n");
}

int main() {
    cudaProfilerStart();
    cuda_hello<<<1,1>>>();
    cudaDeviceSynchronize();
    cudaProfilerStop();
    printf("Hello from CPU\n");
    return 0;
}

Output:

$ nvcc -arch=sm_80 hello_cuda.cu -o hello_cuda
$ CUDA_VISIBLE_DEVICES=0 TMPDIR=. /scratch/XXXX/NVIDIA-Nsight-Compute-2023.1/ncu --export report.ncu-rep --force-overwrite --replay-mode range ./hello_cuda
==WARNING== Please consult the documentation for current range-based replay mode limitations and requirements.
==PROF== Connected to process 322593 (/scratch/XXXX/xformers/scripts/hello_cuda)
Hello World from GPU!
Hello from CPU
==PROF== Disconnected from process 322593
==WARNING== No ranges were profiled.
==WARNING== Profiling ranges launched by child processes requires the --target-processes all option.

Range-profiling with nsys working:

$ nsys profile --capture-range=cudaProfilerApi ./hello_cuda
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
Capture range started in the application.
Hello World from GPU!
Generating '/tmp/nsys-report-4070.qdstrm'
Capture range ended in the application.
[1/1] [========================100%] report1.nsys-rep
Generated:
    /scratch/XXXX/xformers/scripts/report1.nsys-rep

Setup:

$ nvidia-smi -i 0
Thu Mar  2 15:55:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:10:1C.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The issue is likely that no CUDA context is active on this thread at the point where cudaProfilerStart is called. As suggested in the range replay documentation, you can try using driver API calls cuProfilerStart/Stop instead.

When use cuda driver api, it also does not work. Could you please give some other advice?