Docker image: Segmentation fault (core dumped)

I was trying to run Nvidia Modulus the way Nvidia recommends i.e. using Docker containers. However, I realised, I can’t run even a single example. I am getting Segmentation fault (core dumped) for every single example.

I am using Jupyter lab terminal of a remote PC from paperspace.com (just like Google Colab) to run the example files (.py files).

The remote computer has 30GB RAM (non ECC) and 16GB Quadro RTX 5000. Here is the output of Nvidia-smi.

I have attached the full error message.

error.txt (13.7 KB)

Looking at your logs, it seems like it’s a failure in NCCL that got caught by the UCX segfault handler. This is all running on a single node and with one GPU, correct? Can you confirm what version of the container are you using?

Also, can you re-run with export NCCL_DEBUG=INFO so we get more verbose logs for the NCCL failures?

Hi. Thanks for the reply. I am using version 21.06. Yes, this is on a single node with 1 GPU. Sorry, how to use export NCCL_DEBUG=INFO?

As I mentioned in the first post I am inside an interactive session so can’t use docker run. Also, python ldc_2d.py export NCCL_DEBUG=INFO doesn’t work.

Fortunately, the simulation is running for some reason without any error. Still, a solution would be helpful if I encounter the problem again.

Glad it’s working for you now. For future reference, if you encounter this problem again, you can turn on the NCCL debug logs by running
NCCL_DEBUG=INFO python ldc_2d.py

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.