Unable to Capture CUDA API calls when invoking Cuda code from golang code


I am unable to capture CUDA API calls when invoking CUDA code from golang. The timeline in Nsight Systems (image1) does not show a GPU row, but I am sure that CUDA code was invoked since my program returns the correct output (confirmed via the stdout file captured by Nsight Systems)

I am trying to run a simple golang+Cgo+CUDA example from here For reference here is the .cu file and .go file (image2)

I compile these as follows: (image3)

Then I run the compiled binary (image4) from Nsight Systems which gives my the above timeline:

The diagnostics summary says that “No CUDA events collected” (image5):

The stdout log file shows the expected output and the stderr log file doesn’t contain any errors (image6):

I am able to capture API traces in Nsight Systems when I try running cuda-samples (GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit)

Any idea why CUDA api traces are not being captured when invoked via golang?