Installed using these directions:
I’ve tried all examples listed with the exception of those in the jupyter notebook.
sudo docker run --gpus all -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3 python nvidia-examples/cnn/resnet.py ================ == TensorFlow == ================ NVIDIA Release 20.03-tf2 (build 11026100) TensorFlow Version 2.1.0 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use ‘nvidia-docker run’ to start this container; see nvidia docker · NVIDIA/nvidia-docker Wiki · GitHub . NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. 2020-10-31 20:46:12.836512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:13.475607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7 2020-10-31 20:46:13.476390: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7 PY 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] TF 2.1.0 Script arguments: --image_width=224 --image_height=224 --distort_color=False --momentum=0.9 --loss_scale=128.0 --image_format=channels_last --data_dir=None --data_idx_dir=None --batch_size=256 --num_iter=300 --iter_unit=batch --log_dir=None --export_dir=None --tensorboard_dir=None --display_every=10 --precision=fp16 --dali_mode=None --use_xla=False --predict=False 2020-10-31 20:46:14.116304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-10-31 20:46:14.393616: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.393911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.394016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.394108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.395555: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.395859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.397711: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.398678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.398748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.399312: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400081: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.427371: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3400295000 Hz 2020-10-31 20:46:14.433648: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48d2b10 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.433708: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-10-31 20:46:14.681780: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.682296: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48b6f50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.682388: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-10-31 20:46:14.682997: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.683266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.683328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.683350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.683416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.683437: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.683451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.683483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.683503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.683833: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684384: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.684812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:15.989702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-31 20:46:15.989774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-10-31 20:46:15.989789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-10-31 20:46:15.990811: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1324] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2020-10-31 20:46:15.991702: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9465 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:2d:00.0, compute capability: 6.1) WARNING:tensorflow:Expected a shuffled dataset but input dataset x
is not shuffled. Please invoke shuffle()
on input dataset. 2020-10-31 20:46:30.582299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:31.000289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:33.327820: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.586535: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.757268: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941441: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941521: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 747.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973338: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 236.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973433: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 832.02MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.976730: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990272: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 40.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990363: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 737.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
…
2020-10-31 20:46:44.070604: I tensorflow/core/common_runtime/bfc_allocator.cc:962] Sum Total of in-use chunks: 9.07GiB 2020-10-31 20:46:44.070611: I tensorflow/core/common_runtime/bfc_allocator.cc:964] total_region_allocated_bytes_: 9925364224 memory_limit_: 9925364286 available bytes: 62 curr_region_allocation_bytes_: 17179869184 2020-10-31 20:46:44.070621: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Stats: Limit: 9925364286 InUse: 9735095040 MaxInUse: 9817396736 NumAllocs: 2730 MaxAllocSize: 2785116160 2020-10-31 20:46:44.070654: W tensorflow/core/common_runtime/bfc_allocator.cc:429] ***************************************************************************************************x 2020-10-31 20:46:44.070721: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at conv_ops.cc:539 : Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2020-10-31 20:46:44.070800: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node res4b_branch2c/Conv2D}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Traceback (most recent call last): File “nvidia-examples/cnn/resnet.py”, line 50, in nvutils.train(resnet50, args) File “/workspace/nvidia-examples/cnn/nvutils/runner.py”, line 216, in train initial_epoch=initial_epoch, **valid_params) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 819, in fit use_multiprocessing=use_multiprocessing) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 694, in fit steps_name=‘steps_per_epoch’) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration batch_outs = batch_function(*batch_data) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1123, in train_on_batch outputs = self.train_function(ins) # pylint: disable=not-callable File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3727, in call outputs = self._graph_fn(*converted_inputs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1551, in call return self._call_impl(args, kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1591, in _call_impl return self._call_flat(args, self.captured_inputs, cancellation_manager) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1692, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 545, in call ctx=ctx) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py”, line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File “”, line 3, in raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node res4b_branch2c/Conv2D (defined at /workspace/nvidia-examples/cnn/nvutils/runner.py:216) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_keras_scratch_graph_20352] Function call stack: keras_scratch_graph