When using data_format=channels_last -> ValueError: Cannot feed value of shape (1, 1, 8, 8, 8) for Tensor 'NV_MODEL_INPUT:0', which has shape '(?, ?, ?, ?, 1)

Hello,

I am having problems implementing a custom neural network with Nvidia Clara Train SDK Transfer Learning Toolkit.

As a first step I am trying to implement the example provided in section 4 of the The Transfer Learning Toolkit Getting Started Guide (see https://docs.nvidia.com/clara/tlt-mi/tlt-mi-getting-started/index.html#byom_tlt_topic) but am unsuccessful.

I downloaded and cloned (cp -r command) the segmentation_ct_liver_and_tumor_v1 model from ngc and edited the config files to use the shallow Unet example provided (see link above and/or .py file below).

Issue #1: At first I was getting error messages related to a bug in Tensorflow v1.13 when using the “Conv3D()” function and the “channels first” data format (see https://github.com/tensorflow/tensorflow/pull/23004). I believe this would get fixed if the docker image was updated with a more recent version of tensorflow…

Issue #2: To circumvent issue#1 I made all the “data_format” arguments “channels_last” in the train_config.json file. Now I am getting the error message: ValueError: Cannot feed value of shape (1, 1, 8, 8, 8) for Tensor ‘NV_MODEL_INPUT:0’, which has shape ‘(?, ?, ?, ?, 1)’.

As I understand this, the shape ‘(?, ?, ?, ?, 1)’ is correct and tells me that the model was properly built with only one channel (the last one). The shape (1, 1, 8, 8, 8) tells me that the data is being loaded with ‘channel_first’ which is incorrect. It should be (1, 8, 8, 8, 1) or (batch, depth, width, height, channel)

It seems that somewhere the code fails to use “channels_last” from the config files and still uses “channels_first” (perhaps as default?) causing the shape error. Just speculating here but my hunch tells me it’s somewhere when the data actually gets loaded. Perhaps somewhere in the component “VolumeTo4dArray”?

I don’t know how to debug this… Please can someone help me find a solution / point me in a direction? I’d really like to be able to implement a custom model using the Transfer Learning Toolkit. I’ve been trying to debug this for a while and I haven’t found anything that worked. Any help is greatly appreciated!

Cheers,

Olivier

Below is the contents of the full command output log, the train_config.json file and the python class “unet_shallow.py”

command output log:

/segmentation_ct_liver_and_tumor_shallowUnet/commands# ./train.sh
2019-09-09 17:00:49.251226: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2019-09-09 17:00:49.251752: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x62501f0 executing computations on platform Host. Devices:
2019-09-09 17:00:49.251777: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-09 17:00:49.317442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-09 17:00:49.318202: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x6320110 executing computations on platform CUDA. Devices:
2019-09-09 17:00:49.318250: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): GeForce GTX 1060 with Max-Q Design, Compute Capability 6.1
2019-09-09 17:00:49.318394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 2.95GiB freeMemory: 2.35GiB
2019-09-09 17:00:49.318417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-09-09 17:00:49.728798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-09 17:00:49.728852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-09-09 17:00:49.728861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-09-09 17:00:49.729028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2033 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:105: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
/usr/local/lib/python3.5/dist-packages/skimage/transform/_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images.
  warn("Anti-aliasing will be enabled by default in skimage 0.15 to "
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "tlt2/src/apps/train.py", line 71, in <module>
  File "tlt2/src/apps/train.py", line 62, in main
  File "tlt2/src/workflows/trainers/simple_trainer.py", line 149, in train
  File "tlt2/src/workflows/fitters/legacy_fitter.py", line 368, in fit
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 1, 8, 8, 8) for Tensor 'NV_MODEL_INPUT:0', which has shape '(?, ?, ?, ?, 1)'

config_train.json:

{
    "epochs": 1250,
    "num_training_epoch_per_valid": 20,
    "learning_rate": 1e-4,
    "multi_gpu": false,

    "train":
    {
        "loss":
        {
            "name": "Dice",
	    "args": {
		"data_format": "channels_last"
	    }
        },
	
        "optimizer":
        {
            "name": "Adam"
        },

        "lr_policy":
        {
            "name": "DecayLRonStep",
            "args": {
                "decay_ratio": 0.1,
                "decay_freq": 50000
            }
        },
	
        "model":
        {
            "path": "unet_shallow.UnetShallow",
            "args": {
                "num_classes": 3,
                "factor": 8,
                "final_activation": "softmax",
		"data_format": "channels_last"
            }
        },

        "pre_transforms":
        [
            {
                "name": "LoadNifty",
                "args": {
                    "fields": [
                        "image",
                        "label"
                    ]
                }
            },
            {
                "name": "VolumeTo4DArray",
                "args": {
                    "fields": [
                        "image",
                        "label"
                    ]
                }
            },
            {
                "name": "ScaleIntensityRange",
                "args": {
                    "field": "image",
                    "a_min": -21,
                    "a_max": 189,
                    "b_min": 0.0,
                    "b_max": 1.0,
                    "clip": true
                }
            },
            
	    {
		"name": "NPResize3D",
		"args": {
			"applied_keys": ["image", "label"],
			"output_shape": [8,8,8]
			}
	    },
            {
                "name": "NPRandomFlip3D",
                "args": {
                    "applied_keys": ["image", "label"],
                    "probability": 0.1
                }
            },
            {
                "name": "NPRandomRot90XY",
                "args": {
                    "applied_keys": ["image", "label"],
                    "probability": 0.1
                }
            },
            {
                "name": "ScaleIntensityOscillation",
                "args": {
                    "field": "image",
                    "magnitude": 0.20
                }
            }
        ],

        "image_pipeline": {
            "name": "ImagePipeline",
            "args": {
                "task": "segmentation",
                "data_list_file_path": "{DATASET_JSON}",
                "data_file_base_dir": "{DATA_ROOT}",
                "data_list_key": "training",
                "crop_size": [-1, -1, -1],
                "data_format": "channels_last",
                "batch_size": 1,
                "num_channels": 1,
                "num_workers": 8,
                "prefetch_size": 10
            }
        }
    },

    "validate":
    {
        "metrics":
        [
            {
                "name": "MetricAverageFromArrayDice",
                "args": {
                    "name": "mean_dice",
                    "stopping_metric": true,
                    "applied_key": "model",
                    "label_key": "label"
                }
            }
        ],

        "pre_transforms":
        [
            {
                "name": "LoadNifty",
                "args": {
                    "fields": [
                        "image",
                        "label"
                    ]
                }
            },
            {
                "name": "VolumeTo4DArray",
                "args": {
                    "fields": [
                        "image",
                        "label"
                    ]
                }
            },
            {
                "name": "ScaleIntensityRange",
                "args": {
                    "field": "image",
                    "a_min": -21,
                    "a_max": 189,
                    "b_min": 0.0,
                    "b_max": 1.0,
                    "clip": true
                }
            }
        ],

        "image_pipeline": {
            "name": "ImagePipeline",
            "args": {
                "task": "segmentation",
                "data_list_file_path": "{DATASET_JSON}",
                "data_file_base_dir": "{DATA_ROOT}",
                "data_list_key": "validation",
                "crop_size": [-1, -1, -1],
                "data_format": "channels_last",
                "batch_size": 1,
                "num_channels": 1,
                "num_workers": 4,
                "prefetch_size": 1
            }
        },

        "inferer":
        {
            "name": "ScanWindowInferer",
            "args": {
              "is_channels_first": false,
	          "roi_size": [8, 8, 8]
            }
        }
    }
}

unet_shallow.py :

import tensorflow as tf
from medical.tlt2.src.components.models.model import Model
class UnetShallow(Model):
    def __init__(self, num_classes,
                 factor=32,
                 training=False,
                 data_format='channels_first',
                 final_activation='softmax'):
        Model.__init__(self)
        self.model = None
        self.num_classes = num_classes
        self.factor = factor
        self.training = training
        self.data_format = data_format
        self.final_activation = final_activation
        if data_format == 'channels_first':
            self.channel_axis = 1
            # force channels-first ordering
            tf.keras.backend.set_image_data_format('channels_first')
            print(tf.keras.backend.image_data_format())
        elif data_format == 'channels_last':
            self.channel_axis = -1
            tf.keras.backend.set_image_data_format('channels_last')
            print(tf.keras.backend.image_data_format())
    def network(self, inputs, training, num_classes, factor, data_format, channel_axis):
        # very shallow Unet Network
        with tf.variable_scope('UnetShallow'):
            # print(inputs)
            # print(inputs.shape)
            # print(type(inputs))
            # print(data_format)
            conv1_1 = tf.keras.layers.Conv3D(factor, 3, padding='same', data_format=data_format, activation='relu')(inputs)
            conv1_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv1_1)
            pool1 = tf.keras.layers.MaxPool3D(pool_size=(2, 2, 2), strides=2, data_format=data_format)(conv1_2)
            conv2_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(pool1)
            conv2_2 = tf.keras.layers.Conv3D(factor * 4, 3, padding='same', data_format=data_format, activation='relu')(conv2_1)
            unpool1 = tf.keras.layers.UpSampling3D(size=(2, 2, 2), data_format=data_format)(conv2_2)
            unpool1 = tf.keras.layers.Concatenate(axis=channel_axis)([unpool1, conv1_2])
            conv7_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(unpool1)
            conv7_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv7_1)
            output = tf.keras.layers.Conv3D(num_classes, 1, padding='same', data_format=data_format)(conv7_2)
            if str.lower(self.final_activation) == 'softmax':
                output = tf.nn.softmax(output, axis=channel_axis, name='softmax')
            elif str.lower(self.final_activation) == 'sigmoid':
                output = tf.nn.sigmoid(output, name='sigmoid')
            elif str.lower(self.final_activation) == 'linear':
                pass
            else:
                raise ValueError(
                    'Unsupported final_activation, it must of one (softmax, sigmoid or linear), but provided:' + self.final_activation)
        return output
    # additional custom loss
    def loss(self):
        return 0
    def get_predictions(self, inputs, training, build_ctx=None):
        # if self.data_format == 'channels_first':
        #     inputs = tf.transpose(inputs, perm=[1, 0, 2, 3, 4])
        # print(build_ctx)
        self.model = self.network(
            inputs=inputs,
            training=training,
            num_classes=self.num_classes,
            factor=self.factor,
            data_format=self.data_format,
            channel_axis=self.channel_axis
        )
        return self.model
    def get_loss(self):
        return self.loss()

Hi
Thanks for your interest in Clara train SDK.
It seems our documentations has some typos which is causing you these errors. We are working on updating it soon. We generally have channel to be first and that is why it is the default. In the mean while could you try the example below.

Hope this solves your problem

import tensorflow as tf
from medical.tlt2.src.components.models.model import Model


class CustomNetwork(Model):

    def __init__(self, num_classes,
                 factor=32,
                 training=False,
                 data_format='channels_first',
                 final_activation='linear'):
        Model.__init__(self)
        self.model = None
        self.num_classes = num_classes
        self.factor = factor
        self.training = training
        self.data_format = data_format
        self.final_activation = final_activation

        if data_format == 'channels_first':
            self.channel_axis = 1
        elif data_format == 'channels_last':
            self.channel_axis = -1

    def network(self, inputs, training, num_classes, factor, data_format, channel_axis):
        # very shallow Unet Network
        with tf.variable_scope('CustomNetwork'):

            conv1_1 = tf.keras.layers.Conv3D(factor, 3, padding='same', data_format=data_format, activation='relu')(inputs)
            conv1_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv1_1)
            pool1 = tf.keras.layers.MaxPool3D(pool_size=(2, 2, 2), strides=2, data_format=data_format)(conv1_2)

            conv2_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(pool1)
            conv2_2 = tf.keras.layers.Conv3D(factor * 4, 3, padding='same', data_format=data_format, activation='relu')(conv2_1)

            unpool1 = tf.keras.layers.UpSampling3D(size=(2, 2, 2), data_format=data_format)(conv2_2)
            unpool1 = tf.keras.layers.Concatenate(axis=channel_axis)([unpool1, conv1_2])

            conv7_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(unpool1)
            conv7_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv7_1)

            output = tf.keras.layers.Conv3D(num_classes, 1, padding='same', data_format=data_format)(conv7_2)

            if str.lower(self.final_activation) == 'softmax':
                output = tf.nn.softmax(output, axis=channel_axis, name='softmax')
            elif str.lower(self.final_activation) == 'sigmoid':
                output = tf.nn.sigmoid(output, name='sigmoid')
            elif str.lower(self.final_activation) == 'linear':
                pass
            else:
                raise ValueError(
                    'Unsupported final_activation, it must of one (softmax, sigmoid or linear), but provided:' + self.final_activation)

        return output

    # additional custom loss
    def loss(self):
        return 0

    def get_predictions(self, inputs, training, build_ctx=None):
        if self.data_format == "channels_first":
            inputs = tf.transpose(inputs, perm=[0, 2, 3, 4, 1])
        self.model = self.network(
            inputs=inputs,
            training=training,
            num_classes=self.num_classes,
            factor=self.factor,
            data_format="channels_last",
            channel_axis=self.channel_axis
        )
        if self.data_format == "channels_first":
            self.model = tf.transpose(self.model, perm=[0, 4, 1, 2, 3])
        return self.model

    def get_loss(self):
        return self.loss()

I tried the code you provided (renamed it unet_shallow.py) with ‘channels_last’ and it gave me the same error as issue#2.

In train.config.json changed data_format to ‘channels_first’ and now i get the error below. I think that is caused by the bug in tensorflow v1.13 when using Conv3D and ‘channels_first’ together. It seems to revert to “channels_last” somehwere in the code.

/segmentation_ct_liver_and_tumor_shallowUnet/commands# ./train.sh
MMAR_ROOT set to /workspace/segmentation_ct_liver_and_tumor_shallowUnet/commands/..
PYTHONPATH set to :/opt/nvidia:/opt/nvidia:/workspace/custom_models_class
2019-09-11 17:59:06,989 - ImagePipeline - INFO - Data Property: {'crop_size': [None, None, None], 'task': 'segmentation', 'data_format': 'channels_first', 'label_format': None, 'num_label_channels': 1, 'num_data_dims': 3, 'num_channels': 1}
2019-09-11 17:59:07,018 - ImagePipeline - INFO - Data Property: {'crop_size': [None, None, None], 'task': 'segmentation', 'data_format': 'channels_first', 'label_format': None, 'num_label_channels': 1, 'num_data_dims': 3, 'num_channels': 1}
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "tlt2/src/apps/train.py", line 71, in <module>
  File "tlt2/src/apps/train.py", line 62, in main
  File "tlt2/src/workflows/trainers/simple_trainer.py", line 117, in train
  File "tlt2/src/workflows/builders/tf_builder.py", line 111, in build
  File "/workspace/custom_models_class/unet_shallow.py", line 75, in get_predictions
    channel_axis=self.channel_axis
  File "/workspace/custom_models_class/unet_shallow.py", line 41, in network
    unpool1 = tf.keras.layers.Concatenate(axis=channel_axis)([unpool1, conv1_2])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 538, in __call__
    self._maybe_build(inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1603, in _maybe_build
    self.build(input_shapes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/utils/tf_utils.py", line 151, in wrapper
    output_shape = fn(instance, input_shape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/merge.py", line 392, in build
    'Got inputs shapes: %s' % (input_shape))
ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, None, None, None, 32), (None, None, None, None, 16)]

Hi

Let’s step back for a sec. I assume if you your can train with one of our built in networks already. Correct?
The example I provide you should work with channel_first. Please change any other transformations/ pipelines etc to all be channel first (you could also remove the parameter as the default is channel first)

Also we just updated all the documentation https://docs.nvidia.com/clara/
You can now see how you could bring your own lass, reader, transformation, metric and model https://docs.nvidia.com/clara/tlt-mi/clara-train-sdk-v1.1/byom.html