==ERROR== Unsupported API during capture (cuInit)

I’m currently facing this issue below

When I’m using this code,
mymain.txt (20.9 KB)

only when I using this nvtx (“MHA_addmm_qkv”)
It returns me error and get shut down
image

I’m curious that I use same Conv1D code right below (named conv1d_attn_last)
It works well
image

this is my command line
/tmp/var/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export /tmp/var/128GPT2_MHA --force-overwrite --replay-mode range --section-folder /tmp/var/sections --set full --call-stack --nvtx --nvtx-include “MHA_transpose_cat/” --nvtx-include “MHA_transpose_stack_for_next/” --nvtx-include “MHA_merge/” --nvtx-include “MHA_last_matmul/” --nvtx-include “MHA_qk/” --nvtx-include “MHA_invsqrt/” --nvtx-include “MHA_masking/” --nvtx-include “MHA_softmax/” --nvtx-include “MHA_av/” --nvtx-include “MHA_addmm_qkv/” --nvtx-include “MHA_addmm_last/” /home/oem/anaconda3/envs/ghlee/bin/python /home/oem/ghlee/gpt-2-Pytorch/main.py --text “s”

I tried it on terminal and got the same result.

I really want to know what is wrong, and how to fix it.

thank you!

As the error message tells you, cuInit is not supported within a range defined for range replay, see 2. Kernel Profiling Guide — NsightCompute 12.4 documentation.
You can have the application initialize CUDA prior to your range, using whatever mechanism your application or framework provides for this, or you can use app-range replay, if your application is re-startable in a deterministic way.

@felix_dt

As you can see mymain.txt,
I already declared the cudaProfilerStart, in main
I thought initializing CUDA worked

app-range way takes so long time for profiling and even after profiling, my GUI cannot work well with lagging