TensorFlow Issue - 'NonMaxSuppressionV3' in binary

Hi guys,
I’m facing an issue concerning the use of TensorFlow on the Jetson TX2.

After having re-flashed the Jetson I’m currently using virtualenv and virtualenvwrapper to isolate all the libraries and avoid conflict.

I managed installing all the necessary libraries but when I want to use a script based on a .pb file generated with tensorflow 1.12 I have 2 distinct errors whether I try with TF1.8 or 1.9 (and 1.10):

With TF 1.8:

GPU is available!
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 112689152)]
Model path: trained_models/mask_rcnn_plantule_V0_epoch5.pb
<BEGIN Loading Graph>
Traceback (most recent call last):
  File "detect_instances.py", line 596, in <module>
    main(sys.argv)
  File "detect_instances.py", line 440, in main
    tf.import_graph_def(graph_def, name="")
  File "/home/nvidia/PythonEnv/InstallationEnv/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/home/nvidia/PythonEnv/InstallationEnv/lib/python3.5/site-packages/tensorflow/python/framework/importer.py", line 489, in import_graph_def
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'NonMaxSuppressionV3' in binary running on tegra-ubuntu. Make sure the Op and Kernel are registered in the binary running in this process.

With TF 1.10 (and TF 1.9):

GPU is available!
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 1205403648)]
Model path: trained_models/mask_rcnn_plantule_V0_epoch5.pb
<BEGIN Loading Graph>
Traceback (most recent call last):
  File "/home/nvidia/PythonEnv/InstallationEnv/lib/python3.5/site-packages/tensorflow/python/framework/importer.py", line 418, in import_graph_def
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef mentions attr 'T' not in Op<name=NonMaxSuppressionV3; signature=boxes:float, scores:float, max_output_size:int32, iou_threshold:float, score_threshold:float -> selected_indices:int32>; NodeDef: ROI_1/rpn_non_max_suppression/NonMaxSuppressionV3 = NonMaxSuppressionV3[T=DT_FLOAT](ROI_1/strided_slice_21, ROI_1/strided_slice_22, ROI_1/rpn_non_max_suppression/NonMaxSuppressionV3/max_output_size, ROI_1/rpn_non_max_suppression/iou_threshold, ROI_1/rpn_non_max_suppression/score_threshold). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).

I’ve been looking on forums for a while and I can’t find any solution except maybe the fact that the .pb has been generated using a higher version of TF than the one I’m using to do inference… what do you think ?

For information when I installed the different version of TF I’m using this link:

https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-8-wheel-with-jetpack-3-2-/

Thank you in advance

Hi,

To find out where the error comes from, could you run your script with pure CPU mode first?

config = tf.ConfigProto(
        device_count = {'GPU': 0}
    )
sess = tf.Session(config=config)

Thanks.

Hi AastaLLL,

I tried your modification in my script and I still get the same error…(the one on TF1.10)

Hi guys,

we finally identify the problem and apparently it’s coming from tensorflow version. My model was trained using TF 1.12 and it’s not possible to do inference with a lower version…

We quickly re-trained a model to see (with TF 1.9) and it did work so we decided to re-train the model using TF 1.9 and the whole database and we’ll see.

Quick questions:

  • is there a way to convert graph from a version of TF to another (to avoid retraining) ?
  • how can I update CuDNN (libcudnn7-dev), cause at the first try the Jetson TX2 told me I was using the 7.0.5 version and that the 7.1.5 was required…even though after I retried the error disappeared.

Thank in advance

Hi,

Suppose you can convert it into the .pb file to solve these kind of problem:

Check this tutorial for more information:
https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc

Thanks.

Hi,

my question was more about if it’s possible to convert a .pb file firstly built with a certain version of TF to another .pb file that will suit another version of TF ?

Other question: I have the .h5 file (built with TF1.10), can I convert it to .pb using my version of TF(1.9) and will it work with TF1.9 ?

I also detected another issue concerning the CuDNN library which isn’t at the right version so I’m not even sure that the conversion would work…

I’ll probably just retrained the whole thing as 5 epoch is a short training.

Thanks for your help

Hi guys,

I tried for several days to launch the training but I get several errors I can’t find solution for.

The first one is a warning:

/home/nvidia/.virtualenvs/cv/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/nvidia/.virtualenvs/cv/lib/python3.5/site-packages/keras/engine/training_generator.py:47: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
  UserWarning('Using a generator with `use_multiprocessing=True`'

I vaguely understand that there is some code that will use a large amount of memory and the second part concerning the “use_multiprocessing=True” is about a Keras class I suppose, so nothing concerning the Jetson in itself really.

The second error is more concerning:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 366, in _handle_workers
    pool._maintain_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 240, in _maintain_pool
    self._repopulate_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
    w.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 366, in _handle_workers
    pool._maintain_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 240, in _maintain_pool
    self._repopulate_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
    w.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

From what I can see it tells about memory issue but my TX2 has 8Gb of RAM and usually has between 500Mb and 1.5Gb already in use before I launch the training.

I get a final message:

2019-01-29 09:58:46.817619: E tensorflow/stream_executor/cuda/cuda_dnn.cc:342] Loaded runtime CuDNN library: 7.0.5 but source was compiled with: 7.1.5.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Segmentation fault (core dumped)

It tells about CuDNN but as I’m using Jetpack it’s not possible to use another version than the 7.0.5, is it a tensorflow version issue again ?

Thank you for your help

Some news !

I finally managed to re-train my network but using tensorflow1.9 for cpu (with tensorboard 1.8) and not the tensorflow-gpu library which created the errors.

I have a questions though:

  • is tensorflow-gpu alone enough to train on the TX2 or should I have tensorflow(cpu) too ?
  • is there a more recent tensorflow-gpu version on jetpack3.2 for the TX2 available ? So far I tried using the 1.9 but I can’t find a more recent one (I can go until 1.10.1 but it’s cpu version).

My guess is the CudNN library is the real issue, if I’m not wrong I can’t update CuDNN using Jetpack right (meaning with Jetpack3.2)?

I’ll ask my colleague to change libraries dependency to fit with my version.

If anyone has an idea on the above errors I still take any thoughts.

Thanks

Hi,

Sorry for the late reply and thanks for keeping us updated.

1) It’s not recommended to use Jetson for training.
The hardware is designed for fast inferencing and it is not good for training due to the large data bandwidth requirement.

2) We only release ONE TensorFlow official package for JetPack3.3:
https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/
And there is no available plan to update the TX2 TensorFlow package currently.

As an alternative, you can build the TensorFlow on your own.
But please remember that all the CUDA related package (ex, CUDA, cuDNN, TensorRT) have dependencies on the GPU driver.
So it’s required to use the package and OS from the identical JetPack installer.

Thanks.

Ok,
thank you AastaLLL for all the precision.

Little question concerning the after-training process: once I have my .pb file, what’s the procedure to use TensorRT and use the fast inferencing capabilities of the Jetson ?

Thanks again

Hi,

It’s recommended to convert it into TensorRT PLAN for the better performance.
You can check this tutorial for the conversion steps:
https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification

Thanks.

Hi,
thank you for the link.

I have a general question about the tensorflow-gpu library.

As you told me the Jetson isn’t really suited for training neural network, I trained quickly a new network on my host machine using TF1.9.
My issue is when I tried to use this network with TF-gpu (same version, 1.9) it doesn’t work…

So I really wonder what’s the point of TF-gpu if it usable only if you train a network with it but the jetson isn’t made for training…

Anyway my real question is: can I use TF1.9 or higher on the Jetson (non-gpu version I mean) with TensorRT without loosing performance after optimization by TensorRT ? (as long as I trained the network with the same version of TF that I’ll use on the Jetson for inference)
In other word: what’s the utility of TF-gpu in comparison with the classic TF ?

Thanks again

Hi,

Could you share the error log you meet with us?

TF-gpu and TF has almost the same interface and functionality except from the low-level hardware implementation.
In general, user doesn’t need to do any special handling.

But if you want to use TF-TRT, please help to check if your package already with TensorRT support.
Thanks.

Hi,
here is the error log I get when I run a script to detect target on several images with TF-gpu:

GPU is available!
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 4281491456)
Model path: trained_models/plantule_epoch1.pb
<BEGIN Loading Graph>
<END Loading Graph>
<BEGIN DETECTION IN IMAGES>
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2
FILENAME:
---------
plantule1.jpg
zero_padding2d_3/Pad: (Pad): /job:localhost/replica:0/task:0/device:GPU:0
conv1_2/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
conv1_2/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
conv1_2/convolution: (Conv2D): /job:localhost/replica:0/task:0/device:GPU:0
conv1_2/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
bn_conv1_2/gamma/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn_conv1_2/beta/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn_conv1_2/moving_mean/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn_conv1_2/moving_variance/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn_conv1_2/FusedBatchNorm: (FusedBatchNorm): /job:localhost/replica:0/task:0/device:GPU:0
activation_81/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
max_pooling2d_3/MaxPool: (MaxPool): /job:localhost/replica:0/task:0/device:GPU:0
res2a_branch2a_2/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
res2a_branch2a_2/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
res2a_branch2a_2/convolution: (Conv2D): /job:localhost/replica:0/task:0/device:GPU:0
res2a_branch2a_2/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
bn2a_branch2a_2/gamma/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn2a_branch2a_2/beta/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn2a_branch2a_2/moving_mean/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn2a_branch2a_2/moving_variance/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
bn2a_branch2a_2/FusedBatchNorm: (FusedBatchNorm): /job:localhost/replica:0/task:0/device:GPU:0
activation_82/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
...
mrcnn_mask_deconv_2/strided_slice/stack: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice/stack_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice/stack_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_1/stack: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_1/stack_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_1/stack_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_2/stack: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_2/stack_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/strided_slice_2/stack_2: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/mul/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/add/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/mul_1/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/add_1/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/stack/3: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_deconv_2/Reshape_1/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_2/kernel: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_2/bias: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_2/Reshape/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mrcnn_mask_2/Reshape_1/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Killed

Basically I have the list of operation that are done and at one moment I only have a “killed” output and the script stops, no further explanation.

When I use TF-cpu I don’t get this error and the detection goes well.

Thank you

PS: I added myself the “…” in the middle of error log just to skip all the operations that appear.

Hi,

Guess that you are running out of memory.
Could you also monitor the system status with this command:

sudo ./tegrastats

It takes twice the memory in TensorFlow since it generates one for CPU and the other for GPU.
If the issue is from OOM, it’s recommended to use our TensorRT API directly.
https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification

Thanks.

Hi everyone,

to close the subject, it was indeed an OOM issue.

I saw on other forum that creating a swap file could be a good idea and it worked for me.

Thank you for your help.