Image Segmentation Using DIGITS 5

Originally published at: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/

Today we’re excited to announce NVIDIA DIGITS 5. DIGITS 5 comes with a number of new features, two of which are of particular interest for this post: a fully-integrated segmentation workflow, allowing you to create image segmentation datasets and visualize the output of a segmentation network, and the DIGITS model store, a public online repository…

Any plans to support Theano or TensorFlow?

Can we use this for SAR images segmentation.

Hi! I have also worked with FCN-8s for image segmentation with good results. The only problem is the high time required for both training and inference. Any idea how to attack this without loosing accuracy?

Hello Nerea, technology is moving fast so chances are the newer generations of GPUs would meet your processing time requirements. Did you try any of the GPUs from the Pascal family? Also, Fully Convolutional Networks scale extremely well to multi-GPU training so you might want to consider adding GPUs to reduce training time. For inference you can use TensorRT (https://developer.nvidia.co... to reduce test time.

Another option would be to base your network on a different CNN: FCN-8s is based on VGG-16 but you might want to stick to e.g. Alexnet and add skip connections, if what you're interested in is a finer grain in the predictions.

Alternatively, you could increase the stride of the first few convolutional layers to increase their receptive field and reduce the number of activations in the network thus reducing the computational complexity.

Yet another option is to reduce the image size if you feel this won't destroy too much content.

These are some thoughts but there are probably a million ways to tackle this problem.

Hello, I have found FCN models to work very well for very different datasets (natural images, synthetic images, medical images). I suggest you try and please let us know the result!

We would love to support more frameworks in the future. DIGITS is open-source, feel free to contribute!

I'm having trouble understanding what is exactly meant by offset in Table 2 and why is it calculated as (P - (K - 1)/2) / S.

I understand that the new feature map size could be calculated as (W + 2P - K) / S + 1 where W is the input size. So for example, conv 1 with W = 224, P = 100, K = 11 and S = 4 would result in size 104. How does that relate to offset as defined above?

Consider conv1 for example: because this layer has 100-pixel padding on each side its output is larger than if there was no padding at all. Therefore if you were to upscale the output of conv1 to reconstitute an image of the original input you would have to crop the upscaled output. In a lot of cases, a center crop will do but if you want to calculate the offset incurred by each conv or pooling layer you can use the (P-(K-1)/2)/S formula. In practice you need to do this since the Caffe "crop" layer requires an offset and a shape, it won't do the center crop automatically.

Interestingly your question made me realize I had made a mistake in the offset calculations. The number is correct for each layer however the offset does not add up the way I showed. Amazingly the final answer is correct. I will fix this in the article but in the meantime I suggest you have a look at https://github.com/BVLC/caf...

I figured out the general intention, but I couldn't get the numbers to match and it is still not clear to me. In this example, 224x224 image with padding 100 on each side would result in 104x104 feature maps after conv1 (as given by W' = (W + 2P - K) / S + 1). If we upscale with deconvolution using stride 4, I am expecting a 416x416 image. With offset 23 in Table 2 and cropping 23 pixels at each side, this gives 370x370. What am I missing here?

When I think about it some more, the actual size after deconvolution should also depend on used kernel size, but I'm not sure if I could just use the above formula for size with known W' (104) and solve for W.

Let's assume your input is 224*224. The intrinsic offset of conv1 is (P-(K-1)/2)/S is 23.75. The output of conv1 has size 104*104.

In the article I omitted to say that a deconvolution layer yields an intrinsic offset of (K-1)/2-P so in your example if you want to upscale conv1 using stride S=4 and kernel size K=7 this would be 3. The size of the output of the upscale layer would be (W-1)*S-2P+K=(104-1)*4-0+7=419 (for each spatial dimension).

The update I need to make in the article is to fix the recipe for composing those offsets across layers. We can't simply add them up. We need to "back propagate" the offset, from the top of the graph to the bottom of the graph. The composition of a layer L1 with a layer L2 (i.e. L2 is a bottom of L1 in Caffe terminology), with offsets O1 and O2 respectively, yields an offset of O2/F+O1, where F is the cumulative scaling factor of L2.

So now in our example we have:
- offset of upscale layer: 3
- offset of conv1 layer: 23.75
- scaling factor of conv1: 1/4

- total offset of composition of upscale with conv1: 23.75/(1/4)+3=98

This means that you need to take 98 pixels off each border of the output of the upscale layer => you end up with 419-98*2=223 pixels. Adjusting for rounding errors due to integer kernel sizes this is exactly what you need.

Thanks for a detailed explanation! :)

Is the public DIGITS Model Store available now?

Sorry the public DIGITS Model Store is not available yet.

Interesting. There is a formatting mistake near "For example, consider conv1: since"

Hello Greg, as demonstrated in the paper you mentioned above
(Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.pdf),
to tackle domain shift best results are obtained using balanced gradient contribution (BGC), which consists in creating batches with images from both SYNTHIA (synthetic) and real images datasets. What are the practical implementation pro's and con's with respect to transfer learning? sorry for the general question, I am very new to these topics... still learning :-)

Hello Filippo. In this paper, section 4.3 mentions that feature extractors (also called "contraction blocks") are initialized from the corresponding "base" CNN, pre-trained on ILSVRC. This is what I did too.

BGC comes into play when studying the benefits of using the synthetic dataset during training before deploying the network on real images. Admittedly I don't have experience with this. Unless I am mistaken the paper does not give quantitative proof that BGC performs better than fine-tuning as only the BGC results are given.

Thanks for letting us know, this is fixed now.

Thank you for this great article! Is there any chance to get your pretrained model with FCN-8s on SYNTHIA? Or is there a way to get at least some of the models from the model store? Thanks and greetings

Does anyone know when DIGITS 5 will be available on Amazon as AWI?

Currently only 4 is available: https://aws.amazon.com/mark...