Check failed: attr_def

sgambient · April 7, 2019, 8:06pm

Getting the following on the nvcr.io/nvidia/tensorflow:19.03-py2 container. How do I get more info on what variable is the problem ?

2019-04-07 19:59:20.799730: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1704] Running auto_mixed_precision graph optimizer
2019-04-07 19:59:20.807122: F tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:820] Check failed: attr_def
Aborted (core dumped)

sgambient · April 10, 2019, 4:01am

Here is the stack trace if it helps in figuring out the problem.

Program terminated with signal SIGABRT, Aborted.
#0 0x00007f7a85ba2428 in __GI_raise (sig=sig@entry=6)
at …/sysdeps/unix/sysv/linux/raise.c:54
54 …/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f7a8635e700 (LWP 29489))]
(gdb) bt
#0 0x00007f7a85ba2428 in __GI_raise (sig=sig@entry=6) at …/sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f7a85ba402a in __GI_abort () at abort.c:89
#2 0x00007f7a490748a4 in tensorflow::internal::LogMessageFatal::~LogMessageFatal() () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f7a48f4063b in tensorflow::grappler::(anonymous namespace)::AutoMixedPrecisionImpl::SupportsFloat16(tensorflow::grappler::(anonymous namespace)::NodeTypeId const&) const ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f7a48f4094f in std::_Function_handler<bool (int), tensorflow::grappler::(anonymous namespace)::AutoMixedPrecisionImpl::PropagateWhiteThroughClear(absl::flat_hash_set<int, absl::hash_interna
l::Hash, std::equal_to, std::allocator > const&, absl::flat_hash_set<int, absl::hash_internal::Hash, std::equal_to, std::allocator >) const::{lambda(int)#1}>::_M_invoke(
std::_Any_data const&, int&&) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f7a48f2d779 in tensorflow::grappler::(anonymous namespace)::DfsTypeTraversal(tensorflow::grappler::(anonymous namespace)::GraphTypeTopologyView const&, absl::Span<tensorflow::grappler::(ano
nymous namespace)::NodeTypeId const const>, tensorflow::grappler::(anonymous namespace)::TypeTraversalDirection, tensorflow::grappler::(anonymous namespace)::DfsTypePredicates const&, tensorflow::gra
ppler::(anonymous namespace)::DfsTypeCallbacks const&) [clone .constprop.1539] () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f7a48f42e6b in tensorflow::grappler::(anonymous namespace)::AutoMixedPrecisionImpl::Optimize() [clone .constprop.1475] ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f7a48f457ff in tensorflow::grappler::AutoMixedPrecision::Optimize(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f7a48f1f195 in tensorflow::grappler::MetaOptimizer::RunOptimizer(tensorflow::grappler::GraphOptimizer*, tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem*, tensorflow::Gra$
hDef*, tensorflow::grappler::MetaOptimizer::GraphOptimizationResult*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f7a48f200b1 in tensorflow::grappler::MetaOptimizer::OptimizeGraph(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f7a48f2170a in tensorflow::grappler::MetaOptimizer::Optimize(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f7a48f23af2 in tensorflow::grappler::RunMetaOptimizer(tensorflow::grappler::GrapplerItem const&, tensorflow::ConfigProto const&, tensorflow::DeviceBase*, tensorflow::grappler::Cluster*, ten
sorflow::GraphDef*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f7a48f14f8e in tensorflow::GraphExecutionState::OptimizeGraph(tensorflow::BuildGraphOptions const&, std::unique_ptr<tensorflow::Graph, std::default_deletetensorflow::Graph >, std::unique
_ptr<tensorflow::FunctionLibraryDefinition, std::default_deletetensorflow::FunctionLibraryDefinition >) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f7a48f16b0c in tensorflow::GraphExecutionState::BuildGraph(tensorflow::BuildGraphOptions const&, std::unique_ptr<tensorflow::ClientGraph, std::default_deletetensorflow::ClientGraph >)
() from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f7a461803d2 in tensorflow::DirectSession::CreateGraphs(tensorflow::BuildGraphOptions const&, std::unordered_map<std::string, std::unique_ptr<tensorflow::Graph, std::default_delete >, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, std::unique_ptr<tensorflow::Graph, std::default_deletetensorflow::Graph > > > >, std::un
ique_ptr<tensorflow::FunctionLibraryDefinition, std::default_deletetensorflow::FunctionLibraryDefinition >, tensorflow::DirectSession::RunStateArgs, absl::InlinedVector<tensorflow::DataType, 4ul,
std::allocatortensorflow::DataType >, absl::InlinedVector<tensorflow::DataType, 4ul, std::allocatortensorflow::DataType >, long long*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f7a46181b7c in tensorflow::DirectSession::CreateExecutors(tensorflow::CallableOptions const&, std::unique_ptr<tensorflow::DirectSession::ExecutorsAndKeys, std::default_delete<tensorflow::Di
rectSession::ExecutorsAndKeys> >, std::unique_ptr<tensorflow::DirectSession::FunctionInfo, std::default_deletetensorflow::DirectSession::FunctionInfo >, tensorflow::DirectSession::RunStateArgs*) (
) from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f7a46183e48 in tensorflow::DirectSession::GetOrCreateExecutors(absl::Span<std::string const>, absl::Span<std::string const>, absl::Span<std::string const>, tensorflow::DirectSession::Execut
orsAndKeys**, tensorflow::DirectSession::RunStateArgs*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007f7a461854a4 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor

const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::T
ensor> >, tensorflow::RunMetadata) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#18 0x00007f7a439c7e15 in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> >
const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tens
or> >, tensorflow::RunMetadata) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#19 0x00007f7a43bd5daf in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::T
ensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, TF_Tensor**, std::vector<std::string, std::allocatorstd::string > const&, TF_Buffer*, TF_Status*) [clone .constprop.
654] () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#20 0x00007f7a43bd6629 in TF_SessionRun () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007f7a439c3251 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::all
ocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, s
td::allocator<_object*> >) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#22 0x00007f7a439c3332 in tensorflow::TF_SessionRun_wrapper(TF_Session, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > c
onst&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_objec
t*> >*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#23 0x00007f7a4397788c in _wrap_TF_SessionRun_wrapper () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#24 0x00000000004bc4aa in PyEval_EvalFrameEx ()
#25 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#26 0x00000000004c1f56 in PyEval_EvalFrameEx ()
#27 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#28 0x00000000004d5669 in ?? ()
#29 0x00000000004a587e in PyObject_Call ()
—Type to continue, or q to quit—
#30 0x00000000004be51e in PyEval_EvalFrameEx ()
#31 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#32 0x00000000004c1f56 in PyEval_EvalFrameEx ()
#33 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#34 0x00000000004c1f56 in PyEval_EvalFrameEx ()
#35 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#36 0x00000000004c1f56 in PyEval_EvalFrameEx ()
#37 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#38 0x00000000004c17c6 in PyEval_EvalFrameEx ()
#39 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#40 0x00000000004c17c6 in PyEval_EvalFrameEx ()
#41 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#42 0x00000000004c17c6 in PyEval_EvalFrameEx ()
#43 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#44 0x00000000004c17c6 in PyEval_EvalFrameEx ()
#45 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#46 0x00000000004eb69f in ?? ()
#47 0x00000000004e58f2 in PyRun_FileExFlags ()
#48 0x00000000004e41a6 in PyRun_SimpleFileExFlags ()
#49 0x00000000004938ce in Py_Main ()
#50 0x00007f7a85b8d830 in __libc_start_main (main=0x493370 , argc=12, argv=0x7ffd6b701f68, init=, fini=, rtld_fini=, stack_end=0x7ffd6b701f58)
at …/csu/libc-start.c:291
#51 0x0000000000493299 in _start ()

carlc · April 10, 2019, 7:03pm

Hi sgambient,

This is interesting failure – I don’t think we’ve seen it before. Are you able to provide the Python TensorFlow model script to reproduce? Also, do you know if you are using any custom TF ops? (That is, ones not built into TensorFlow.)

sgambient · April 10, 2019, 8:11pm

It is happening in part of a larger network. Yes there are custom ops. Would like to isolate the part that is causing this problem, and provide you that code. Is there a way to selectively disable AMP for some op/layer ?

carlc · April 10, 2019, 10:54pm

If it were isolated to a small amount of code that could reproduce the issue, that would be ideal! My guess is that there is a custom op that is triggering the issue, so if you have the op declarations for the custom ops, that might give us some ideas.

There are some knobs to manipulate the type handling of AMP, but this is a more basic issue – it’s looking at an OpDef, and an expected invariant isn’t being followed. (You can see the logic in the pull request to TensorFlow here: https://github.com/tensorflow/tensorflow/blob/cdb7db55568626e215700bb365065d53d1d7fff8/tensorflow/core/grappler/optimizers/auto_mixed_precision.cc#L833).

michael4e2ca · April 11, 2019, 3:33pm

Hi,
I have the same issue now and don’t have custom ops.
I opened another issue though…
Please see
https://devtalk.nvidia.com/default/topic/1049896/container-tensorflow/amp-error-from-tensorflow/post/5328684/#5328684

BR

sgambient · April 11, 2019, 4:11pm

My current suspicion is a tf.map_fn() that is causing this.

carlc · April 11, 2019, 4:53pm

Hi sgambient – could you say a little more? In particular, what happens in the relevant map_fn? Afaik, the map_fn construct is lowered by TensorFlow into a while_loop structure, which shouldn’t have any issues on its own with TF-AMP (I just ran a couple quick tests), but that’s not saying that much.

carlc · April 11, 2019, 8:35pm

One idea that would be helpful for debugging: if you set the environment variable TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_LOG_PATH=/some/directory, then the graph rewrite pass will save out the graph structure both pre- and post-AMP optimization for every graph that it processes. If you are able to provide the pre-optimization graph on which it crashes, we can look more closely at how this particular CHECK assertion gets hit.

In detail, you’d want to:

export TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_LOG_PATH=<some directory path that is writeable>
Run your tf script
After it exits, there should be a number of files with names like “graphdef_preop_tf_graph_TIMESTAMP.pb.txt”
For ones that don’t crash, there will be both “graphdef_preop…” and “graphed_AutoMixedPrecision…”, corresponding to the pre- and post-optimization graphs
For the graph that does crash, there should be only the graphdef_preop version

If you are willing and able to share the graphdef_preop…pb.txt graph (this will contain a string-serialized GraphDef protobuf), we should be able to easily reproduce on our side.

Thans!

sgambient · April 11, 2019, 8:55pm

Hi, sharing the graph is not going to be possible in this public forum for me. If you can let me know what to look for, I will be glad to convey the output.

michael4e2ca · April 14, 2019, 7:47am

Hi,
I performed what you said I get several “graphdef_preop…” but no “graphed_AutoMixedPrecision…” at all.
BR,

sgambient · April 15, 2019, 8:31am

I did some more experimentation. Identified the network. However when I run exact same network standalone, there is no problem. The only differences I see in the graphs (e.g. import_pb_to_tensorboard ) are that in the stand alone there are 130 ( vs 128 in the complex one) nodes, and names are less by 1 ( e.g Shape vs Shape_1 , etc ).

Is there a way to printout the Op name etc , that is causing the error ?

sgambient · April 19, 2019, 8:34pm

Not able to reproduce in a smaller stand alone network.

carlc · April 19, 2019, 9:13pm

Hey all,

We believe that we’ve figured out the issue internally – it’s a bug that was caught during code review moving auto mixed precision into upstream tensorflow: Mixed precision Grappler optimizer by MattConley · Pull Request #26342 · tensorflow/tensorflow · GitHub.

That fix didn’t make it into the 19.03 container release, but it will be in 19.04 – which should be publicly available in ~a week (can’t quite give an exact date, depends on testing.)

We’ll update this thread as soon as the updated container is available with the bug fix and then check if it’s, well, actually fixed for you guys.

-Carl

carlc · April 24, 2019, 3:22pm

Hi all,

The 19.04 container went up yesterday – could you try to run with nvcr.io/nvidia/tensorflow:19.04-py2 (or py3, if you prefer) and let us know if the attr_def failure is still triggered?

Thanks,
-Carl

sgambient · April 24, 2019, 7:46pm

Seems to be gone. Thanks. Although, a simple follow on question. I noticed that it takes a few iterations before end-to-end execution latency settle down to a lower value. Is that expected ? If so, what is the cause for that ?

carlc · April 24, 2019, 7:56pm

Great, glad to hear it!

That the first few iterations would behave differently (speed-wise) is pretty normal. Does your model use convolutions of any kind? If so, my best guess is that you’re seeing the effect of convolution algorithm auto-tuning. Especially for FP16, there are quite a few possible kernel algorithms to choose from to implement the same logical convolution. TensorFlow auto-tunes across those algorithms, and it likes to spread out the overhead from doing so across the first few iterations of training (as opposed to just trying them all up front).

michael4e2ca · April 29, 2019, 11:25am

Hi,
I tried with the new container and it works fine.
Thanks