Segfaults during Deep Learning Profiling [BUG?]

Software Version
Nsight Systems 2023.4 for Linux Target
Nsight GUI for macOS host

I am a very frequent user of Nsight Systems and I have to mention that it is the best tool for in-depth profiling of kernels.

Currently, my use case has been training GPT3 350M with these setups:

  1. multi-node on the Perlmutter supercomputer; 4x A100 80G per node
  2. single-node on Azure NDv2 8x V100 32G

So far, I always get memory allocation or segmentation faults, for even CUDA events in the 200k range. But my workload works fine, without profiling.

Some of the errors have been:

double free or corruption (fasttop)
malloc_consolidate(): invalid chunk size
segmentation fault (core dumped)
malloc(): smallbin double linked list corrupted

After much experimenting, the below training command is what generates the most data before nsys terminates due to the seg fault.

nsys profile --kill none --delay 120 --duration 20 -t cuda,nvtx <app>

Here are my unsuccessful approaches to this problem:

  1. Deactivate CPU profiling with -s none --cpuctxsw none
  2. Reduce --duration n to the bare minimum that allows for a meaningful trace
  3. Sweep --delay n from n = 0 to about 300.
  4. With and without --kill none
  5. nsys launch and nsys start instead of nsys profile
  6. As suggested here
    export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/

For reproducing, feel free to run the GPT3 script shared above.


  1. Is the scale of my workload too much for the default memory constraints of nsys? If so, what can be done?
  2. I have not experimented with ---cuda-flush-interval yet, will this help?

@liuyis can you please help out here.

Hi @Osayamen, there’s a recent bug that might be related to this. Could you try the following:

  1. Run nsys -z and you will see a path to a config file
  2. Open or create that config file
  3. Add a new line at the end CuptiUsePerThreadBuffer=false
  4. Repeat your profiling command and see if the issue persists

Let me know what the result or if you have any questions, thanks!

Thank you for your quick response, @liuyis

Running nsys -z returns ~/.config/NVIDIA Corporation/nsys-config.ini

Notice there is a space between NVIDIA and Corporation. Ubuntu (Linux in general?) tends to not treat spaces in folder names very well.

I tried creating the directory as suggested here but running ls shows 'NVIDIA Corporation' with the single quotes literally.

So can I try any of the following?

  1. NVIDIA_Corporation
  2. NVIDIA-Corporation
  3. NVIDIACorporation

The name has to exactly match what nsys -z prints in order for Nsys to correctly load it. Could you try wrapping the path with quotes? Here the commands that worked on my Ubuntu workstation:

liuyis@liuyis-ws-cn:~$ rm "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"
liuyis@liuyis-ws-cn:~$ ls "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"
ls: cannot access '/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini': No such file or directory
liuyis@liuyis-ws-cn:~$ touch "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"
liuyis@liuyis-ws-cn:~$ ls "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"
'/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini'
liuyis@liuyis-ws-cn:~$ vim "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"
liuyis@liuyis-ws-cn:~$ cat "/home/liuyis/.config/NVIDIA Corporation/nsys-config.ini"

In my Azure VM, using your touch command actually causes a No such file or directory because touch requires the directory NVIDIA Corporation to exist.

Nevertheless, from here, it seems ls listing the directory with quotes is just visual and nothing else.

I created the directory with mkdir "NVIDIA Corporation" and added CuptiUsePerThreadBuffer=false to nsys-config.ini

I am currently running the profiling job and will get back when I get results

So far, it seems to be working with no seg faults!

I am currently stress testing to be sure, but the current state is a huge improvement from before.

Yes, indeed it works fine now! Thanks @liuyis I marked your reply as a solution.

Multi-node profiling ✅
Single-node profiling w 1000 secs delay ✅
Single-node profiling w and w/o CPU profiling ✅

Are you aware of Nsight Compute needing a similar config like CuptiUsePerThreadBuffer=false?

Thanks for confirming. This is a bug we’ve been looking into internally. You can use this config option as a WAR before we fixing it.
This config option is specific to Nsys and does not apply for Ncu.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.