Coredump - Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address - Quadro P6000

Hello,

Is the Quadro P6000 compatible with the NGC U-Net Indstrial described here: Building Image Segmentation Faster Using Jupyter Notebooks from NGC

I got an error when launching UNet_1GPU.sh which I solved by adding --gpus=all when calling the docker run time:

021-02-19 00:21:02.643635: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Not found: no CUDA devices found
[94deb4fc991d:00583] *** Process received signal ***
[94deb4fc991d:00583] Signal: Aborted (6)
[94deb4fc991d:00583] Signal code:  (-6)

Here’s some information from the container:

root@94deb4fc991d:/workspace/unet_industrial/scripts# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_May__6_19:09:25_PDT_2020
Cuda compilation tools, release 11.0, V11.0.167
Build cuda_11.0_bu.TC445_37.28358933_0

Also container Ubuntu version:

cat /etc/issue
Ubuntu 18.04.4 LTS \n \l

Here’s the driver/cuda infos:

nvidia-smi
Fri Feb 19 01:15:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P6000        On   | 00000000:41:00.0 Off |                  Off |
| 26%   29C    P8     8W / 250W |     73MiB / 24446MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                           
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Here’s some info about the GPU card:

41:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102GL [Quadro P6000] [10de:1b30] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GP102GL [Quadro P6000] [10de:11a0]
Flags: bus master, fast devsel, latency 0, IRQ 90
Memory at 9e000000 (32-bit, non-prefetchable) [size=16M]
Memory at 80000000 (64-bit, prefetchable) [size=256M]
Memory at 90000000 (64-bit, prefetchable) [size=32M]
I/O ports at 2000 [size=128]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Here’s the error message displayed when running the script

./UNet_1GPU.sh /results /data 1

Output:

DLL 2021-02-19 00:58:50.835391 - PARAMETER # Total Trainable Parameters : 1850305 
2021-02-19 00:59:33.620691: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[42539eb4c18b:01116] *** Process received signal ***
[42539eb4c18b:01116] Signal: Aborted (6)
[42539eb4c18b:01116] Signal code:  (-6)
[42539eb4c18b:01116] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fd3c0283f20]
[42539eb4c18b:01116] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd3c0283e97]
[42539eb4c18b:01116] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd3c0285801]
[42539eb4c18b:01116] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xb67e674)[0x7fd2f6592674]
[42539eb4c18b:01116] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7fd2f4469127]
[42539eb4c18b:01116] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7fd2f44699af]
[42539eb4c18b:01116] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7fd2ea248641]
[42539eb4c18b:01116] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fd2ea245d38]
[42539eb4c18b:01116] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7fd2e17da6df]
[42539eb4c18b:01116] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd3c002d6db]
[42539eb4c18b:01116] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd3c036688f]
[42539eb4c18b:01116] *** End of error message ***
./UNet_1GPU.sh: line 50:  1116 Aborted                 (core dumped) python "${BASEDIR}/../main.py" --unet_variant='tinyUNet' --activation_fn='relu' --exec_mode='train_and_evaluate' --iter_unit='batch' --num_iter=2500 --batch_size=16 --warmup_step=10 --results_dir="${1}" --data_dir="${2}" --dataset_name='DAGM2007' --dataset_classID="${3}" --data_format='NCHW' --use_auto_loss_scaling --noamp --xla --learning_rate=1e-4 --learning_rate_decay_factor=0.8 --learning_rate_decay_steps=500 --rmsprop_decay=0.9 --rmsprop_momentum=0.8 --loss_fn_name='adaptive_loss' --weight_decay=1e-5 --weight_init_method='he_uniform' --augment_data --display_every=250 --debug_verbosity=0

Here’s what the kern.log show on the machine (ubuntu 20.04):

Feb 19 02:05:06 franklin kernel: [553014.767409] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
Feb 19 02:05:06 franklin kernel: [553014.767430] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics Exception: ESR 0x504648=0x104000e 0x504650=0x0 0x504644=0xd3eff2 0x50464c=0x17f
Feb 19 02:05:06 franklin kernel: [553014.767501] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
Feb 19 02:05:06 franklin kernel: [553014.769726] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics Exception: ESR 0x52e648=0x129000e 0x52e650=0x20 0x52e644=0xd3eff2 0x52e64c=0x17f
Feb 19 02:05:06 franklin kernel: [553014.770198] NVRM: Xid (PCI:0000:41:00): 43, pid=349325, Ch 00000018

I found one person with the same issue but there’s no solution, besides a quick line referring to possible driver issue with older cards (Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43))

Thanks!

Added bug report: nvidia-bug-report.log.gz (27.7 MB)

From “setup”:

Your Pascal gpu is not supported.