Got a problem with BYOC example from Intro to Clara Train SDK notebook

Hi

I executed a codes example from BYOC jupyter notebook (https://ngc.nvidia.com/catalog/resources/nvidia:med:getting_started.) I have encountered exception errors in the section “BYO Network Architecture and Loss.” The errors show as follows:

Requested train epochs: 2; iterations: 8
2020-07-01 10:30:21.766050: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:696] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.789263: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:696] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.825699: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:538] remapper failed: Invalid argument: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.829155: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:538] arithmetic_optimizer failed: Invalid argument: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.832190: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:696] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.834987: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:696] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2020-07-01 10:30:21.952151: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv3DBackpropInputOpV2 only supports NDHWC on the CPU.
	 [[{{node gradients/CustomNetwork/conv3d_6/Conv3D_grad/Conv3DBackpropInputV2}}]]
Exception: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>: Conv3DBackpropInputOpV2 only supports NDHWC on the CPU.
	 [[node gradients/CustomNetwork/conv3d_6/Conv3D_grad/Conv3DBackpropInputV2 (defined at components/optimizers/optimizer.py:51) ]]

Errors may have originated from an input operation.
Input Source operations connected to node gradients/CustomNetwork/conv3d_6/Conv3D_grad/Conv3DBackpropInputV2:
 CustomNetwork/conv3d_6/Conv3D/ReadVariableOp (defined at /claraDevDay/MMARs/GettingStarted/BYOC/myNetworkArch.py:40)

Original stack trace for 'gradients/CustomNetwork/conv3d_6/Conv3D_grad/Conv3DBackpropInputV2':
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "apps/train.py", line 47, in <module>
  File "apps/train.py", line 30, in main
  File "utils/train_conf.py", line 44, in train_mmar
  File "workflows/trainers/supervised_trainer.py", line 271, in train
  File "workflows/builders/tf_builder.py", line 152, in build
  File "components/optimizers/optimizer.py", line 51, in build
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/optimizer.py", line 419, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/optimizer.py", line 537, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 755, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 415, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 755, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_grad.py", line 159, in _Conv3DGrad
    data_format=data_format),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 2070, in conv3d_backprop_input_v2
    dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'CustomNetwork/conv3d_6/Conv3D', defined at:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
[elided 4 identical lines from previous traceback]
  File "workflows/trainers/supervised_trainer.py", line 271, in train
  File "workflows/builders/tf_builder.py", line 132, in build
  File "components/models/model.py", line 76, in build
  File "/claraDevDay/MMARs/GettingStarted/BYOC/myNetworkArch.py", line 62, in get_predictions
    ,factor=self.factor,data_format=self.data_format,channel_axis=self.channel_axis)
  File "/claraDevDay/MMARs/GettingStarted/BYOC/myNetworkArch.py", line 40, in network
    output = tf.keras.layers.Conv3D(num_classes, 1, padding='same', data_format=data_format)(conv7_2)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 200, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 1122, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 658, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 248, in __call__
    name=self.name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1553, in conv3d
    dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)

  File "workflows/fitters/supervised_fitter.py", line 224, in fit
  File "workflows/fitters/supervised_fitter.py", line 548, in _do_fit
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)

I am not sure that these errors occurred because I use CPU to train the network or not but other examples of BYOC are working fine when I executed them. I also attached the notebook here.

Thankyou.

Hi
Thanks for your interest in clara train sdk.
I think the CPU could be the main issue. Could you clarify what ran and what gave errors. Different examples in the notebook show different capabilities. It may be that some layers are not supported for cpu.

In general we don’t recommend training on CPU and we also need a GPU that is pascal or newer. That is if you trying running with Maxwell gpu you will run into errros

Hope that helps