Hello,
Is the Quadro P6000 compatible with the NGC U-Net Indstrial described here: Building Image Segmentation Faster Using Jupyter Notebooks from NGC
I got an error when launching UNet_1GPU.sh which I solved by adding --gpus=all when calling the docker run time:
021-02-19 00:21:02.643635: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Not found: no CUDA devices found
[94deb4fc991d:00583] *** Process received signal ***
[94deb4fc991d:00583] Signal: Aborted (6)
[94deb4fc991d:00583] Signal code: (-6)
Here’s some information from the container:
root@94deb4fc991d:/workspace/unet_industrial/scripts# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_May__6_19:09:25_PDT_2020
Cuda compilation tools, release 11.0, V11.0.167
Build cuda_11.0_bu.TC445_37.28358933_0
Also container Ubuntu version:
cat /etc/issue
Ubuntu 18.04.4 LTS \n \l
Here’s the driver/cuda infos:
nvidia-smi
Fri Feb 19 01:15:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P6000 On | 00000000:41:00.0 Off | Off |
| 26% 29C P8 8W / 250W | 73MiB / 24446MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Here’s some info about the GPU card:
41:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102GL [Quadro P6000] [10de:1b30] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GP102GL [Quadro P6000] [10de:11a0]
Flags: bus master, fast devsel, latency 0, IRQ 90
Memory at 9e000000 (32-bit, non-prefetchable) [size=16M]
Memory at 80000000 (64-bit, prefetchable) [size=256M]
Memory at 90000000 (64-bit, prefetchable) [size=32M]
I/O ports at 2000 [size=128]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
Here’s the error message displayed when running the script
./UNet_1GPU.sh /results /data 1
Output:
DLL 2021-02-19 00:58:50.835391 - PARAMETER # Total Trainable Parameters : 1850305
2021-02-19 00:59:33.620691: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[42539eb4c18b:01116] *** Process received signal ***
[42539eb4c18b:01116] Signal: Aborted (6)
[42539eb4c18b:01116] Signal code: (-6)
[42539eb4c18b:01116] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fd3c0283f20]
[42539eb4c18b:01116] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd3c0283e97]
[42539eb4c18b:01116] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd3c0285801]
[42539eb4c18b:01116] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xb67e674)[0x7fd2f6592674]
[42539eb4c18b:01116] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7fd2f4469127]
[42539eb4c18b:01116] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7fd2f44699af]
[42539eb4c18b:01116] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7fd2ea248641]
[42539eb4c18b:01116] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fd2ea245d38]
[42539eb4c18b:01116] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7fd2e17da6df]
[42539eb4c18b:01116] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd3c002d6db]
[42539eb4c18b:01116] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd3c036688f]
[42539eb4c18b:01116] *** End of error message ***
./UNet_1GPU.sh: line 50: 1116 Aborted (core dumped) python "${BASEDIR}/../main.py" --unet_variant='tinyUNet' --activation_fn='relu' --exec_mode='train_and_evaluate' --iter_unit='batch' --num_iter=2500 --batch_size=16 --warmup_step=10 --results_dir="${1}" --data_dir="${2}" --dataset_name='DAGM2007' --dataset_classID="${3}" --data_format='NCHW' --use_auto_loss_scaling --noamp --xla --learning_rate=1e-4 --learning_rate_decay_factor=0.8 --learning_rate_decay_steps=500 --rmsprop_decay=0.9 --rmsprop_momentum=0.8 --loss_fn_name='adaptive_loss' --weight_decay=1e-5 --weight_init_method='he_uniform' --augment_data --display_every=250 --debug_verbosity=0
Here’s what the kern.log show on the machine (ubuntu 20.04):
Feb 19 02:05:06 franklin kernel: [553014.767409] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Address
Feb 19 02:05:06 franklin kernel: [553014.767430] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics Exception: ESR 0x504648=0x104000e 0x504650=0x0 0x504644=0xd3eff2 0x50464c=0x17f
Feb 19 02:05:06 franklin kernel: [553014.767501] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
Feb 19 02:05:06 franklin kernel: [553014.769726] NVRM: Xid (PCI:0000:41:00): 13, pid=349037, Graphics Exception: ESR 0x52e648=0x129000e 0x52e650=0x20 0x52e644=0xd3eff2 0x52e64c=0x17f
Feb 19 02:05:06 franklin kernel: [553014.770198] NVRM: Xid (PCI:0000:41:00): 43, pid=349325, Ch 00000018
I found one person with the same issue but there’s no solution, besides a quick line referring to possible driver issue with older cards (Error: Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Address (Xid 13/Xid 43))
Thanks!
Added bug report: nvidia-bug-report.log.gz (27.7 MB)