When I run the command nsys launch julia the process is forced to quit unexpectedly.
I also tried to profile a CUDA kernel written in Julia using the following command:
nsys profile --trace=cuda,nvtx julia test.jl
But the profiling report did not capture any CUDA events. The analysis report always showed the following messages:
| Analysis | 1408 | 00:04.805 | NVTX profiling might not have started correctly.
| Analysis | 1408 | 00:04.805 | No NVTX events collected. Does the process use NVTX?
| Analysis | | 00:04.805 | CUDA profiling might not have started correctly.
| Analysis | | 00:04.805 | No CUDA events collected. Does the process use CUDA?
I’m now using this version of Nsight Systems
(base) PS C:\Users\huiyu> nsys --version
NVIDIA Nsight Systems version 2024.4.2.133-244234382004v0
I also tested with a newer version of nsys, but the problem remains unchanged.
test.jl is just my simple CUDA kernel example for trying out Nsight Systems with Julia. You can write any kernel you like to test the profiling, or feel free to use mine if you don’t mind.
using CUDA
# Define the kernel to add elements of two arrays
function vector_add_kernel(a, b, c, n)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if i <= n
@inbounds c[i] = a[i] + b[i]
end
return
end
# Host code to set up and launch the kernel
function vector_add_test()
n = 1024
a = CUDA.fill(1.0f0, n) # Initialize array `a` with 1.0 (float32)
b = CUDA.fill(2.0f0, n) # Initialize array `b` with 2.0 (float32)
c = CUDA.zeros(Float32, n) # Output array `c`
# Define grid and block dimensions
threads = 256
blocks = cld(n, threads)
# Launch the kernel
@cuda threads = threads blocks = blocks vector_add_kernel(a, b, c, n)
# Transfer result back to host
result = Array(c)
# Verify the result
is_correct = all(result .== 3.0f0) # Each element should be 1.0 + 2.0 = 3.0
if is_correct
println("Test PASSED")
else
println("Test FAILED")
end
end
# Run the test
vector_add_test()
Before it is fixed, you can remove nvtx from --trace= option to WAR the issue. It will stop tracing NVTX annotations, but CUDA trace will still work. I have verified that with --trace=cuda option, the CUDA activities from the test.jl script that you shared can be successfully captured: