Out of memory trying to run WSL2 resnet deep learning code example

zfortier19 · October 31, 2020, 8:50pm

Installed using these directions:

I’ve tried all examples listed with the exception of those in the jupyter notebook.

sudo docker run --gpus all -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3 python nvidia-examples/cnn/resnet.py ================ == TensorFlow == ================ NVIDIA Release 20.03-tf2 (build 11026100) TensorFlow Version 2.1.0 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use ‘nvidia-docker run’ to start this container; see nvidia docker · NVIDIA/nvidia-docker Wiki · GitHub . NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. 2020-10-31 20:46:12.836512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:13.475607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7 2020-10-31 20:46:13.476390: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7 PY 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] TF 2.1.0 Script arguments: --image_width=224 --image_height=224 --distort_color=False --momentum=0.9 --loss_scale=128.0 --image_format=channels_last --data_dir=None --data_idx_dir=None --batch_size=256 --num_iter=300 --iter_unit=batch --log_dir=None --export_dir=None --tensorboard_dir=None --display_every=10 --precision=fp16 --dali_mode=None --use_xla=False --predict=False 2020-10-31 20:46:14.116304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-10-31 20:46:14.393616: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.393911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.394016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.394108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.395555: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.395859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.397711: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.398678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.398748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.399312: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400081: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.400389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.427371: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3400295000 Hz 2020-10-31 20:46:14.433648: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48d2b10 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.433708: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-10-31 20:46:14.681780: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.682296: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48b6f50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-10-31 20:46:14.682388: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2020-10-31 20:46:14.682997: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.683266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2d:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1 coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 451.17GiB/s 2020-10-31 20:46:14.683328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:14.683350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:14.683416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-31 20:46:14.683437: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-31 20:46:14.683451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-31 20:46:14.683483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-31 20:46:14.683503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:14.683833: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684384: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:14.684701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0 2020-10-31 20:46:14.684812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 2020-10-31 20:46:15.989702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-31 20:46:15.989774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 2020-10-31 20:46:15.989789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N 2020-10-31 20:46:15.990811: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1324] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2020-10-31 20:46:15.991702: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node Your kernel may have been built without NUMA support. 2020-10-31 20:46:15.991988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9465 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:2d:00.0, compute capability: 6.1) WARNING:tensorflow:Expected a shuffled dataset but input dataset x is not shuffled. Please invoke shuffle() on input dataset. 2020-10-31 20:46:30.582299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-31 20:46:31.000289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-31 20:46:33.327820: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.11GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.586535: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.757268: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.61GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941441: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.941521: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 747.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973338: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 236.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.973433: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 832.02MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.976730: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990272: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 40.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2020-10-31 20:46:33.990363: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 737.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

…

2020-10-31 20:46:44.070604: I tensorflow/core/common_runtime/bfc_allocator.cc:962] Sum Total of in-use chunks: 9.07GiB 2020-10-31 20:46:44.070611: I tensorflow/core/common_runtime/bfc_allocator.cc:964] total_region_allocated_bytes_: 9925364224 memory_limit_: 9925364286 available bytes: 62 curr_region_allocation_bytes_: 17179869184 2020-10-31 20:46:44.070621: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Stats: Limit: 9925364286 InUse: 9735095040 MaxInUse: 9817396736 NumAllocs: 2730 MaxAllocSize: 2785116160 2020-10-31 20:46:44.070654: W tensorflow/core/common_runtime/bfc_allocator.cc:429] ***************************************************************************************************x 2020-10-31 20:46:44.070721: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at conv_ops.cc:539 : Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2020-10-31 20:46:44.070800: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node res4b_branch2c/Conv2D}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Traceback (most recent call last): File “nvidia-examples/cnn/resnet.py”, line 50, in nvutils.train(resnet50, args) File “/workspace/nvidia-examples/cnn/nvutils/runner.py”, line 216, in train initial_epoch=initial_epoch, **valid_params) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 819, in fit use_multiprocessing=use_multiprocessing) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 694, in fit steps_name=‘steps_per_epoch’) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_generator.py”, line 265, in model_iteration batch_outs = batch_function(*batch_data) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py”, line 1123, in train_on_batch outputs = self.train_function(ins) # pylint: disable=not-callable File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py”, line 3727, in call outputs = self._graph_fn(*converted_inputs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1551, in call return self._call_impl(args, kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1591, in _call_impl return self._call_flat(args, self.captured_inputs, cancellation_manager) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 1692, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py”, line 545, in call ctx=ctx) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py”, line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File “”, line 3, in raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[256,1024,14,14] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node res4b_branch2c/Conv2D (defined at /workspace/nvidia-examples/cnn/nvutils/runner.py:216) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_keras_scratch_graph_20352] Function call stack: keras_scratch_graph

chlwlgh1027 · January 5, 2021, 4:43am

Hello,

I met the same problem and, I just find out that batch size in resnet.py in nividia-sample is 256.

I think this is the reason for the out of memory problem.

After I change batch size to 1 it works.

jinpark · January 8, 2021, 7:13am

I was getting the same issue, but your solution was worked for me. Thank you.
However, the performance is too slow. Do you have any idea?
It takes about 20 mins to finish…

2021-01-08 06:47:23.404475: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-01-08 06:48:55.484976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-08 07:03:55.900491: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value ‘sm_86’ is not defined for option ‘gpu-name’

Relying on driver to perform ptx compilation. This message will be only logged once.
global_step: 10 images_per_sec: 0.0
global_step: 20 images_per_sec: 3.5
global_step: 30 images_per_sec: 3.6
global_step: 40 images_per_sec: 3.2
global_step: 50 images_per_sec: 3.2
global_step: 60 images_per_sec: 3.2
global_step: 70 images_per_sec: 3.2
global_step: 80 images_per_sec: 3.3
global_step: 90 images_per_sec: 3.2
global_step: 100 images_per_sec: 3.2
global_step: 110 images_per_sec: 3.1
global_step: 120 images_per_sec: 3.1
global_step: 130 images_per_sec: 3.2
global_step: 140 images_per_sec: 3.2
global_step: 150 images_per_sec: 3.1
global_step: 160 images_per_sec: 3.1
global_step: 170 images_per_sec: 3.2
global_step: 180 images_per_sec: 3.2
global_step: 190 images_per_sec: 3.2
global_step: 200 images_per_sec: 3.2
global_step: 210 images_per_sec: 3.2
global_step: 220 images_per_sec: 3.4
global_step: 230 images_per_sec: 3.5
global_step: 240 images_per_sec: 4.0
global_step: 250 images_per_sec: 3.8
global_step: 260 images_per_sec: 3.9
global_step: 270 images_per_sec: 3.5
global_step: 280 images_per_sec: 4.0
global_step: 290 images_per_sec: 3.9
global_step: 300 images_per_sec: 3.8
epoch: 0 time_taken: 1144.7
300/300 - 1145s - loss: nan - top1: 0.9967 - top5: 0.9967

chlwlgh1027 · January 9, 2021, 5:18am

Hello!

Which GPU do you use?

In my case, I use GTX 1070.

jinpark · January 17, 2021, 7:43am

I use RTX 3080

satya.cp · January 21, 2021, 8:25pm

Thanks chlwlgh1027 for the solution!!

I have an RTX 2060; I changed the batch_size to 8 and it completed in 83 seconds; one can play around with the batch_size …not necessary to set it at 1…just reduce it from 256 to 8/16/32 etc.