NSYS launch Julia causes conflicts in certain .dll files on win11

When I tried run the following command to launch julia through nsys
nsys launch [path\to\julia],
I successfully launch the Julia REPL. And when I try
using CUDA
it gives the following warning

┌ Warning: CUDA runtime library cupti64_120.dll was loaded from a system path. This may cause errors.
│ Ensure that you have not set the LD_LIBRARY_PATH environment variable, or that it does not contain paths to CUDA libraries.
└ @ CUDA C:\Users\hugo1.julia\packages\CUDA\tVtYo\src\initialization.jl:173

And if I continue setting up an CUDA array and do some basic computation,

a = CUDA.rand(5);
sin.(a)

then it returns a long error message

ERROR: Failed to compile PTX code (ptxas exited with code 3221225477)
Invocation arguments: --generate-line-info --verbose --gpu-name sm_86 --output-file C:\Users\hugo1\AppData\Local\Temp\jl_1adCNWXW9k.cubin C:\Users\hugo1\AppData\Local\Temp\jl_ZydkBwTu7O.ptx
ptxas info    : 24 bytes gmem
ptxas info    : Compiling entry function '_Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE3sinS4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EEEES6_' for 'sm_86'
ptxas info    : Function properties for _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE3sinS4_I8ExtrudedIS0_IS1_Li1ELi1EES4_I4BoolES4_IS6_EEEES6_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 456 bytes cmem[0], 8 bytes cmem[2]
If you think this is a bug, please file an issue and attach C:\Users\hugo1\AppData\Local\Temp\jl_ZydkBwTu7O.ptx
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] compile(job::GPUCompiler.CompilerJob)
    @ CUDA C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\compiler\compilation.jl:188
  [3] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler C:\Users\hugo1\.julia\packages\GPUCompiler\YO8Uj\src\execution.jl:125
  [4] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler C:\Users\hugo1\.julia\packages\GPUCompiler\YO8Uj\src\execution.jl:103
  [5] macro expansion
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\compiler\execution.jl:318 [inlined]
  [6] macro expansion
    @ .\lock.jl:267 [inlined]
  [7] cufunction(f::GPUArrays.var"#broadcast_kernel#26", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(sin), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\compiler\execution.jl:313
  [8] cufunction
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\compiler\execution.jl:310 [inlined]
  [9] macro expansion
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\compiler\execution.jl:104 [inlined]
 [10] #launch_heuristic#1080
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\gpuarrays.jl:17 [inlined]
 [11] launch_heuristic
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\gpuarrays.jl:15 [inlined]
 [12] _copyto!
    @ C:\Users\hugo1\.julia\packages\GPUArrays\5XhED\src\host\broadcast.jl:65 [inlined]
 [13] copyto!
    @ C:\Users\hugo1\.julia\packages\GPUArrays\5XhED\src\host\broadcast.jl:46 [inlined]
 [14] copy
    @ C:\Users\hugo1\.julia\packages\GPUArrays\5XhED\src\host\broadcast.jl:37 [inlined]
 [15] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(sin), Tuple{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast .\broadcast.jl:873
 [16] top-level scope
    @ REPL[2]:1
 [17] top-level scope
    @ C:\Users\hugo1\.julia\packages\CUDA\tVtYo\src\initialization.jl:185

Last but not least, I am using RTX3050 along with the following softwares

C:\Windows\System32>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_19:04:39_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

C:\Windows\System32>nsys --version
NVIDIA Nsight Systems version 2022.4.2.50-32196742v0

BTW, I also encounter the problem about returning “command ignored” when I tried launching the julia. And the workarounds mentioned in
Nsys launch julia hangs on on Windows 11
don’t always work.

Thx for the help.

@dofek, can you take a look at this?

I used the recently released Nsight Systems 2023.3.1 and managed to successfully trace Julia with the following settings:

  1. I used the std streams workaround by setting “EnableStdOutErrCapture=false” in C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\target-windows-x64\config.ini.
  2. I started an admin terminal console and used the following command line: “c:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\target-windows-x64\nsys.exe” profile -t cuda C:\Users\dofek\AppData\Local\Programs\Julia-1.9.2\bin\julia.exe

The absolute path in step #2 is necessary because nsys does not search for target paths in the Path environment variable. This is a known issue and will be fixed in a future release.