automatic mixed precision failure

evanfyk8i · March 22, 2019, 10:47am

I was keen to test the new automatic mixed precision you are offering in your tensorflow container nvcr.io/nvidia/tensorflow:19.03-py3. My code works fine normally but when I enable auto mixed precision it fails for me.

root@7ea1fde48ca8:/workspace# export TF_ENABLE_AUTO_MIXED_PRECISION=1
root@7ea1fde48ca8:/workspace# python neural_style.py --content examples/1-content.jpg --styles examples/1-style.jpg --network imagenet-vgg-verydeep-19.mat --output blah16.jpg
2019-03-22 05:36:33.023660: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300040000 Hz
2019-03-22 05:36:33.024432: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x54547f0 executing computations on platform Host. Devices:
2019-03-22 05:36:33.024471: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-03-22 05:36:33.154671: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-22 05:36:33.155276: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x5524160 executing computations on platform CUDA. Devices:
2019-03-22 05:36:33.155316: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2019-03-22 05:36:33.155783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.75GiB freeMemory: 15.44GiB
2019-03-22 05:36:33.155816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-22 05:36:33.649698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-22 05:36:33.649768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-22 05:36:33.649790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-22 05:36:33.650171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14938 MB memory) → physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2019-03-22 05:36:34.072402: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:34.072972: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:35.132201: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:35.132875: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:36.585227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-22 05:36:36.585311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-22 05:36:36.585330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-22 05:36:36.585338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-22 05:36:36.585711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14938 MB memory) → physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2019-03-22 05:36:37.034372: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:37.034714: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:37.082609: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:37.083845: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:37.304684: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:37.305175: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:37.616460: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:37.617087: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-03-22 05:36:38.177550: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:38.178386: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-03-22 05:36:40.165986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-22 05:36:40.166069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-22 05:36:40.166089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-22 05:36:40.166098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-22 05:36:40.166442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14938 MB memory) → physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2019-03-22 05:36:40.190878: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:40.191325: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do Optimization started…

2019-03-22 05:36:40.511646: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:40.513016: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 44/97 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts) content loss: 2.14634e+06
2019-03-22 05:36:42.648689: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:42.650502: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 58/149 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
2019-03-22 05:36:42.893699: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally style loss: inf
2019-03-22 05:36:43.451683: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:43.452169: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do tv loss: 0
2019-03-22 05:36:43.470820: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-03-22 05:36:43.472961: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 59/204 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts) total loss: inf Traceback (most recent call last):
File “neural_style.py”, line 224, in
main()
File “neural_style.py”, line 184, in main checkpoint_iterations=options.checkpoint_iterations
File “/workspace/stylize.py”, line 145, in stylize
train_step.run()
AttributeError: ‘Tensor’ object has no attribute ‘run’

You can reproduce by cloning: GitHub - evanthomas/neural-style: Neural style in TensorFlow!
You will need to get the model here: FloydHub Blog
(The one linked in the github page has changed for some reason).

Evan.

carlc · April 11, 2019, 8:41pm

Hi Evan,

Thanks for reporting this!

The underlying issue is that, by default, apply_gradients returns an Operation object that you can call .run() on. But the loss scaling optimizer code changed that behavior to return a Tensor object that doesn’t have a run() method. We’ve fixed this behavior, and the next container release (19.04) will work as you’d expect.

Until, two options to get things working for you:

Replace train_step.run() with sess.run(train_step) which should work for either operations or tensors
When using TF-AMP, call train_step.op.run()

Let me know if either of those work,
Carl

evanfyk8i · April 11, 2019, 11:36pm

Hi Carl,

thanks for looking at it!

I think I’ll have time to test your suggestions over the weekend. I’m quite excited about AMP because I currently don’t have the skills to manually convert to MP.

Evan.

evanfyk8i · April 13, 2019, 5:29am

Hi Carl,

replacing train_step.run() with sess.run(train_step) allows the optimisation to run but the loss function overflows. (The exception at the end is a result of the loss overflowing).

python neural_style.py --content examples/1-content.jpg --styles examples/1-style.jpg --network imagenet-vgg-verydeep-19.mat --output blah16.jpg
2019-04-13 05:15:14.991501: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3650000000 Hz
2019-04-13 05:15:14.995200: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x6204f60 executing computations on platform Host. Devices:
2019-04-13 05:15:14.995251: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-04-13 05:15:15.976972: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x62e7750 executing computations on platform CUDA. Devices:
2019-04-13 05:15:15.977030: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-04-13 05:15:15.979485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:17:00.0
totalMemory: 10.73GiB freeMemory: 10.57GiB
2019-04-13 05:15:15.979518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-13 05:15:27.899626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-13 05:15:27.899701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-13 05:15:27.899714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-13 05:15:27.900443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10198 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-04-13 05:15:28.641117: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:28.641875: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:29.013258: W tensorflow/core/framework/allocator.cc:124] Allocation of 54579200 exceeds 10% of system memory.
2019-04-13 05:15:29.075482: W tensorflow/core/framework/allocator.cc:124] Allocation of 54579200 exceeds 10% of system memory.
2019-04-13 05:15:29.208972: W tensorflow/core/framework/allocator.cc:124] Allocation of 27340800 exceeds 10% of system memory.
2019-04-13 05:15:29.252181: W tensorflow/core/framework/allocator.cc:124] Allocation of 27340800 exceeds 10% of system memory.
2019-04-13 05:15:29.625993: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:29.626359: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:29.863165: W tensorflow/core/framework/allocator.cc:124] Allocation of 54579200 exceeds 10% of system memory.
2019-04-13 05:15:30.560056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-13 05:15:30.560107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-13 05:15:30.560113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-13 05:15:30.560118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-13 05:15:30.560426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10198 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-04-13 05:15:31.014642: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:31.014846: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:31.065438: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:31.065743: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:31.209730: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:31.209990: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:31.398695: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:31.399003: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
2019-04-13 05:15:31.749090: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:31.749597: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-04-13 05:15:33.339512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-13 05:15:33.339571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-13 05:15:33.339578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-13 05:15:33.339583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-13 05:15:33.339914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10198 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-04-13 05:15:33.362323: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:33.362644: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
Optimization started…

2019-04-13 05:15:33.540019: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:33.541490: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 44/97 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
content loss: 2.14634e+06
2019-04-13 05:15:40.035576: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:40.037413: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 58/149 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
2019-04-13 05:15:40.250387: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally
style loss: inf
2019-04-13 05:15:42.357122: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:42.357777: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1230] No whitelist ops found, nothing to do
tv loss: 0
2019-04-13 05:15:42.382481: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:42.385084: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 59/204 nodes to float16 precision using 0 cast(s) to float16 (excluding Const and Variable casts)
total loss: inf
2019-04-13 05:15:42.878215: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-13 05:15:42.885650: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1660] Converted 199/814 nodes to float16 precision using 7 cast(s) to float16 (excluding Const and Variable casts)

content loss: 2.14634e+06
style loss: inf
tv loss: 0
total loss: inf

Elapsed time: 35.07869744300842
Traceback (most recent call last):
File “neural_style.py”, line 225, in
main(sys.argv[1:])
File “neural_style.py”, line 184, in main
checkpoint_iterations=options.checkpoint_iterations
File “/data/stylize.py”, line 158, in stylize
img_out = vgg.unprocess(best.reshape(shape[1:]), vgg_mean_pixel)
AttributeError: ‘NoneType’ object has no attribute ‘reshape’

evanfyk8i · April 13, 2019, 5:35am

As another data point I tested with this code:

I only let it run for about half an hour out of full training time of weeks. It actually ran marginally slower with AMP.

nha.tuan84 · October 4, 2021, 5:01am

Hi @carlc

I run AMP using both Python and C++.
I found that there is the difference between number of converted nodes in C++ and Python.
(no of converted nodes of Python > C++ => Python AMP run faster than C++)
Why does this happen?

Thanks.

Topic		Replies	Views
AMP didn't convert any node to float16 Frameworks (archived) tensorflow	1	790	July 19, 2019
AMP error from tensorflow Frameworks (archived) tensorflow	3	935	April 11, 2019
Automatic Mixed Precision for NVIDIA Tensor Core Architecture in TensorFlow Technical Blog	5	614	October 8, 2021
Check failed: attr_def Deep Learning (Training & Inference) mixed-precision	17	1779	April 29, 2019
Tensorflow model mix precision training error Deep Learning (Training & Inference) mixed-precision	0	546	November 5, 2019
Training Precision ISSUE TAO Toolkit	2	480	October 26, 2020
Use Automatic Mixed Precision on Tensor Cores in Frameworks Today Technical Blog	0	282	August 21, 2022
Argument not found 'use_amp' TAO Toolkit	1	515	September 1, 2023
Tensorflow OOM with AMP, but not with non-AMP !! Deep Learning (Training & Inference) mixed-precision	1	813	May 24, 2019
TensorFlow's official benchmark script throwing error on Nvidia NGC TensorFlow 19.04 Container Frameworks (archived) tensorflow	1	945	April 26, 2019

automatic mixed precision failure

Related topics