I wish I can ignore the return value, but here’s an interesting bit: when I run the code under Nsight Systems, the error code changes to -1, but Nsight Systems seems to capture the regions correctly!
wp11@ufront:~/work/testcodes/cpp-cu-ddot$ nsys profile -o cuda-ddot --stats=true --trace=nvtx ./ddot.x 1000000
Using device NVIDIA A100-PCIE-40GB
NVTX Error: -1
h_A[999999] = 999999
h_B[999999] = 2e+06
NVTX Error: -1
NVTX Error: -1
NVTX Error: -1
NVTX Error: -1
Kernel 1 (ddot), workspace size = 4096
Using grid: 3907, 1, 1
Using block: 256, 1, 1
Kernel 2 (reduceblocks), workspace size = 4096, filled = 3907
Using grid: 16, 1, 1
Using block: 256, 1, 1
Kernel 2 (reduceblocks), workspace size = 16, filled = 16
Using grid: 1, 1, 1
Using block: 256, 1, 1
NVTX Error: -1
NVTX Error: -1
NVTX Error: -1
Success! Result = 6.66666e+17
Generating '/tmp/nsys-report-2ed1.qdstrm'
[1/3] [========================100%] cuda-ddot.nsys-rep
[2/3] [========================100%] cuda-ddot.sqlite
[3/3] Executing 'nvtx_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ------- -----
76.0 4,806,275 1 4,806,275.0 4,806,275.0 4,806,275 4,806,275 0.0 PushPop init
22.1 1,396,473 1 1,396,473.0 1,396,473.0 1,396,473 1,396,473 0.0 PushPop h2d
1.6 100,161 1 100,161.0 100,161.0 100,161 100,161 0.0 PushPop calc
0.4 24,500 1 24,500.0 24,500.0 24,500 24,500 0.0 PushPop d2h
Generated:
/home/wp11/work/testcodes/cpp-cu-ddot/cuda-ddot.nsys-rep
/home/wp11/work/testcodes/cpp-cu-ddot/cuda-ddot.sqlite