Out of memory trying to run WSL2 resnet deep learning code example

Installed using these directions:

I’ve tried all examples listed with the exception of those in the jupyter notebook.

sudo docker run --gpus all -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3 python nvidia-examples/cnn/resnet.py ================ == TensorFlow == ================ NVIDIA Release 20.03-tf2 (build 11026100) TensorFlow Version 2.1.0 Container image Copyright © 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications © NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use ‘nvidia-docker run’ to start this container; see https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker . NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. 2020-10-31 20:46:12.836512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:13.475607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7 2020-10-31 20:46:13.476390: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7 PY 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] TF 2.1.0 Script arguments: --image_width=224 --image_height=224 --distort_color=False --momentum=0.9 --loss_scale=128.0 --image_format=channels_last --data_dir=None --data_idx_dir=None --batch_size=256 --num_iter=300 --iter_unit=batch --log_dir=None --export_dir=None --tensorboard_dir=None --display_every=10 --precision=fp16 --dali_mode=None --use_xla=False --predict=False 2020-10-31 20:46:14.116304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-10-31 20:46:14.393616: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.393911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.394016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.394108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.395555: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.395859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.397711: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.398678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.398748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.399312: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400081: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.427371: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3400295000 Hz 2020-10-31 20:46:14.433648: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48d2b10 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.433708: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-10-31 20:46:14.681780: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.682296: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48b6f50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.682388: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-10-31 20:46:14.682997: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.683266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.683328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.683350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.683416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.683437: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.683451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.683483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.683503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.683833: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684384: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.684812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:15.989702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-31 20:46:15.989774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-10-31 20:46:15.989789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-10-31 20:46:15.990811: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1324] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2020-10-31 20:46:15.991702: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9465 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:2d:00.0, compute capability: 6.1) WARNING:tensorflow:Expected a shuffled dataset but input dataset x is not shuffled. Please invoke shuffle() on input dataset. 2020-10-31 20:46:30.582299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:31.000289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:33.327820: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.586535: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.757268: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941441: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941521: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 747.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973338: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 236.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973433: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 832.02MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.976730: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990272: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 40.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990363: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 737.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2020-10-31 20:46:44.070604: I tensorflow/core/common_runtime/bfc_allocator.cc:962] Sum Total of in-use chunks: 9.07GiB 2020-10-31 20:46:44.070611: I tensorflow/core/common_runtime/bfc_allocator.cc:964] total_region_allocated_bytes_: 9925364224 memory_limit_: 9925364286 available bytes: 62 curr_region_allocation_bytes_: 17179869184 2020-10-31 20:46:44.070621: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Stats: Limit: 9925364286 InUse: 9735095040 MaxInUse: 9817396736 NumAllocs: 2730 MaxAllocSize: 2785116160 2020-10-31 20:46:44.070654: W tensorflow/core/common_runtime/bfc_allocator.cc:429] ***************************************************************************************************x 2020-10-31 20:46:44.070721: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at conv_ops.cc:539 : Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2020-10-31 20:46:44.070800: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node res4b_branch2c/Conv2D}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Traceback (most recent call last): File “nvidia-examples/cnn/resnet.py”, line 50, in nvutils.train(resnet50, args) File “/workspace/nvidia-examples/cnn/nvutils/runner.py”, line 216, in train initial_epoch=initial_epoch, **valid_params) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 819, in fit use_multiprocessing=use_multiprocessing) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 694, in fit steps_name=‘steps_per_epoch’) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration batch_outs = batch_function(*batch_data) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1123, in train_on_batch outputs = self.train_function(ins) # pylint: disable=not-callable File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3727, in call outputs = self._graph_fn(*converted_inputs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1551, in call return self._call_impl(args, kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1591, in _call_impl return self._call_flat(args, self.captured_inputs, cancellation_manager) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1692, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 545, in call ctx=ctx) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py”, line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File “”, line 3, in raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node res4b_branch2c/Conv2D (defined at /workspace/nvidia-examples/cnn/nvutils/runner.py:216) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_keras_scratch_graph_20352] Function call stack: keras_scratch_graph