I’m trying to use Nsight Systems with the Julia programming language, but it looks like spawning processes does not work when running under nsys:
$ julia -e 'run(`echo 1`); println(2);'
$ nsys launch julia -e 'run(`echo 1`); println(2);'
**** launch command configuration ****
inherit-environment = true
show-output = true
trace-fork-before-exec = false
sample_cpu = true
backtrace_method = LBR
wait = all
trace_cublas = false
trace_cuda = true
trace_cudnn = false
trace_nvtx = true
trace_mpi = false
trace_openacc = false
trace_vulkan = false
trace_opengl = true
trace_osrt = true
osrt-threshold = 1000 nanoseconds
cudabacktrace = false
cudabacktrace-threshold = 0 nanoseconds
profile_processes = tree
application command = julia
application arguments = -e run(`echo 1`); println(2);
application working directory = /home/tim
NVTX profiler range trigger =
NVTX profiler domain trigger =
The target application returned non-zero exit code 11
Code 11 indicates segfault on Linux. I’m running on 64-bit Linux, using Nsight Systems 2019.5.1 on driver 435.21, testing here with Julia 1.2.
It’s possible that there’s something with how Julia launches processes, since the above works with e.g. Python, but it’s hard to debug like this.
Attaching gdb to the julia target process right before the segfault:
$ nsys launch julia -e 'run(`sleep 10`)'
$ gdb --pid=$(pidof julia)
0x00007f6a0732f436 in epoll_pwait () from /usr/lib/libc.so.6
Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
#0 0x0000000000000000 in ?? ()
#1 0x00007f6a08164bd6 in ?? () from nsight-systems-2019.5.1/target-linux-x64/libToolsInjection64.so
#2 <signal handler called>
#3 0x00007f6a0732f436 in epoll_pwait () from /usr/lib/libc.so.6
#4 0x00007f6a081e1b89 in ?? () from nsight-systems-2019.5.1/target-linux-x64/libToolsInjection64.so
Looks like a null pointer dereference in the injection libraries?
Thank you for filing this bug with detailed information. The issue is related to our Operating System Runtime (OSRT) tracing feature. It is available through the “–trace osrt” option which is enabled by default. It makes Nsight Systems trace some calls of the interesting Glibc libraries. If you want to work around the issue, just make sure to not enable OSRT tracing when using the tool. I filed an internal bug so we can try to fix the issue and I will update the progress here. The gory technical details are below, if you are interested.
Something tricky is making sure that we do not trace any Glibc calls when we are running inside of a signal handler. As specified by POSIX, a multi-threaded application can only perform async-signal-safe calls from within a signal handler. In order to be safe, we have to wrap the signals set by the application in order to disable tracing for the duration of the application signal handler call.
Julia relies on “libuv” to handle some platform-specific details as the creation of processes. They use a modified version of “libuv” that relies on “vfork” instead of “fork”. The rational is to both reduce the memory consumption of the child process and speedup the process creation.
The main differences between “fork” and “vfork” are the following.
- A process created through "vfork" share its parent memory.
- When using "vfork", the parent process is blocked until a call to "_exit" or "execve" is performed in the child process.
- You are not allowed to call anything else than "_exit" or "execve" in a process created by "vfork".
- The "pthread_atfork" callbacks are not called on "vfork".
The issue here is that Julia does not respect the specification and a lot of prohibited calls are performed during “vfork” and “execve” in the child process.
One of those things is resetting all signal handlers to the default which is a big issue for us. In order to wrap the application signals, we have to maintain a table of function pointers corresponding to the real signal handlers set by the application. When the child process created through “vfork” reset the signal handlers to the default, we will modify that table accordingly. The issue is that we are also modifying the parent process signal handler table that we maintain since it the child and parent memory are shared at this point. So the next signal received by the parent will result in a crash because the callback registered in the table is “SIG_DFL” which is “0x0” on Linux.
So for the backtrace you see in the crash:
Frame #1 is our signal handler wrapper.
Frame #0 is the call to what we consider the application signal handler which was reset to "0" by the child process created through "vfork".
We will see what we can do to WAR the issue.
Thanks for the detailed response! It seems like the use of vfork is a conscious decision, and has been in production for a while without apparent issues. For now though, I’ll stick to running without OSRT.
nsys launch + start has revolutionized how I profile my Julia + GPU code, making it possible to do profile from within an interactive programming session: https://github.com/JuliaGPU/CUDAdrv.jl/pull/166
Are there any library endpoints to work with the injected profiler? For now, I’m calling
nsys start from within the
nsys launched process, which seems to work well but is a little ugly.
You could add cudaProfilerStart/Stop APIs or add NVTX ranges around the code you want to profile. nsys CLI supports --capture-range= switch which allows you to launch the application and profiling will start when cudaProfilerStart is invoked and profiling will stop when cudaProfilerStop is invoked in the application. The same goes for NVTX ranges. Hope this helps. See the section called “Example Interactive CLI Command Sequences” in our docs https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.5.1-x86/06-cli-profiling.htm%3FTocPath%3D_____6
Sure, but then I’d still need to
nsys start -c cudaProfilerApi some time before the calling the profiler API, right? I also noticed ‘sharper’ profiles (including less time before and after the start/end of profiling) if I use
nsys stop instead of relying on the profiler API hooks.
If you want to use capture ranges, the easiest will certainly be to use “nsys profile” directly.
$ nsys profile --capture-range cudaProfilerApi --trace cuda,nvtx <Application>