Semantic segmentation networks

Semantic segmentation networks DEFAULT

Semantic Segmentation — Popular Architectures

Semantic segmentation is the task of classifying each and very pixel in an image into a class as shown in the image below. Here you can see that all persons are red, the road is purple, the vehicles are blue, street signs are yellow etc.

Semantic segmentation is different from instance segmentation which is that different objects of the same class will have different labels as in person1, person2 and hence different colours. The picture below very crisply illustrates the difference between instance and semantic segmentation.

One important question can be why do we need this granularity of understanding pixel by pixel location?

Some examples that come to mind are:

i) Self Driving Cars — May need to know exactly where another car is on the road or the location of a human crossing the road

ii) Robotic systems — Robots that say join two parts together will perform better if they know the exact locations of the two parts

iii) Damage Detection — It maybe important in this case to know the exact extent of damage

Lets now talk about 3 model architectures that do semantic segmentation.

1. Fully Convolutional Network (FCN)

FCN is a popular algorithm for doing semantic segmentation. This model uses various blocks of convolution and max pool layers to first decompress an image to 1/32th of its original size. It then makes a class prediction at this level of granularity. Finally it uses up sampling and deconvolution layers to resize the image to its original dimensions.

These models typically don’t have any fully connected layers. The goal of down sampling steps is to capture semantic/contextual information while the goal of up sampling is to recover spatial information. Also there are no limitations on image size. The final image is the same size as the original image. To fully recover the fine grained spatial information lost in down sampling, skip connections are used. A skip connection is a connection that bypasses at least one layer. Here it is used to pass information from the down sampling step to the up sampling step. Merging features from various resolution levels helps combining context information with spatial information.

I trained a FCN to perform semantic segmentation or road vs non road pixels for a self driving car.

2. U-Net

The U-Net architecture is built upon the Fully Convolutional Network (FCN) and modified in a way that it yields better segmentation in medical imaging.

Compared to FCN-8, the two main differences are:

(1) U-net is symmetric and

(2) the skip connections between the downsampling path and the upsampling path apply a concatenation operator instead of a sum.

These skip connections intend to provide local information to the global information while upsampling. Because of its symmetry, the network has a large number of feature maps in the upsampling path, which allows to transfer information. B

The U-Net owes its name to its symmetric shape, which is different from other FCN variants.

U-Net architecture is separated in 3 parts:

1 : The contracting/downsampling path
2 : Bottleneck
3 : The expanding/upsampling path

I have implemented U-net for smoke segmentation. A major advantage of U-net is that it is much faster to run than FCN or Mask RCNN.

3. Mask RCNN

Lets start with a gentle introduction toMask RCNN.

Faster RCNN is a very good algorithm that is used for object detection. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask — which is a binary mask that indicates the pixels where the object is in the bounding box. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. To do this Mask RCNN uses the Fully Convolution Network (FCN).

So in short we can say that Mask RCNN combines the two networks — Faster RCNN and FCN in one mega architecture. The loss function for the model is the total loss in doing classification, generating bounding box and generating the mask.

Mask RCNN has a couple of additional improvements that make it much more accurate than FCN. You can read more about them in their paper.

I have trained custom Mask RCNN models using Keras Matterport github and Tensorflow object detection. To learn how to build a Mask RCNN yourself, please follow the tutorial at Car Damage Detection Blog.

I have my own deep learning consultancy and love to work on interesting problems. I have helped many startups deploy innovative AI based solutions. Check us out at —

You can also see my other writings at:

If you have a project that we can collaborate on, then please contact me through my website or at [email protected]

More about FCN

More about U-Net


Semantic segmentation of microscopic neuroanatomical data by combining topological priors with encoder–decoder deep networks


  1. 1.

    Bohland, J. W. et al. A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale. PLoS Comput. Biol.5, e1000334 (2009).

    Article Google Scholar

  2. 2.

    Dodt, H.-U. et al. Ultramicroscopy: three-dimensional visualization of neuronal networks in the whole mouse brain. Nat. Methods4, 331–336 (2007).

    Article Google Scholar

  3. 3.

    Ragan, T. et al. Serial two-photon tomography for automated ex vivo mouse brain imaging. Nat. Methods9, 255 (2012).

    Article Google Scholar

  4. 4.

    Oh, S. W. et al. A mesoscale connectome of the mouse brain. Nature508, 207–214 (2014).

    Article Google Scholar

  5. 5.

    Pinskiy, V. et al. High-throughput method of whole-brain sectioning, using the tape-transfer technique. PLoS ONE10, e0102363 (2015).

    Article Google Scholar

  6. 6.

    Lin, M. K. et al. A high-throughput neurohistological pipeline for brain-wide mesoscale connectivity mapping of the common marmoset. eLife8, e40042 (2019).

    Article Google Scholar

  7. 7.

    Halavi, M., Hamilton, K. A., Parekh, R. & Ascoli, G. Digital reconstructions of neuronal morphology: three decades of research trends. Front. Neurosci.6, 49 (2012).

    Article Google Scholar

  8. 8.

    Helmstaedter, M. & Mitra, P. P. Computational methods and challenges for large-scale circuit mapping. Curr. Opin. Neurobiol.22, 162–169 (2012).

    Article Google Scholar

  9. 9.

    Peng, H., Meijering, E. & Ascoli, G. A. From DIADEM to BigNeuron. Neuroinform.13, 259–260 (2015).

    Article Google Scholar

  10. 10.

    Rey-Villamizar, N. et al. Large-scale automated image analysis for computational profiling of brain tissue surrounding implanted neuroprosthetic devices using Python. Front. Neuroinform.8, 39 (2014).

    Article Google Scholar

  11. 11.

    Peng, H. et al. BigNeuron: large-scale 3D neuron reconstruction from optical microscopy images. Neuron87, 252–256 (2015).

    Article Google Scholar

  12. 12.

    Lawrie, S. M. & Abukmeil, S. S. Brain abnormality in schizophrenia: a systematic and quantitative review of volumetric magnetic resonance imaging studies. Br. J. Psychiatry172, 110–120 (1998).

    Article Google Scholar

  13. 13.

    Taylor, R. H., Lavealle, S., Burdea, G. C. & Mosges, R. Computer-integrated Surgery: Technology and Clinical Applications (MIT Press, 1995).

  14. 14.

    Zijdenbos, A. P. & Dawant, B. M. Brain segmentation and white matter lesion detection in MR images. Crit. Rev. Biomed. Eng.22, 401–465 (1994).

    Google Scholar

  15. 15.

    Worth, A. J., Makris, N., Caviness, V. S.Jr & Kennedy, D. N. Neuroanatomical segmentation in MRI: technological objectives. Int. J. Pattern Recognit. Artif. Intell.11, 1161–1187 (1997).

    Article Google Scholar

  16. 16.

    Khoo, V. S. et al. Magnetic resonance imaging (MRI): considerations and applications in radiotherapy treatment planning. Radiother. Oncol.42, 1–15 (1997).

    Article Google Scholar

  17. 17.

    Grimson, W. E. L. et al. Utilizing segmented mri data in image-guided surgery. Int. J. Pattern Recognit. Artif. Intell.11, 1367–1397 (1997).

    Article Google Scholar

  18. 18.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015).

    Article Google Scholar

  19. 19.

    LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE86, 2278–2324 (1998).

    Article Google Scholar

  20. 20.

    Fukushima, K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern.36, 193–202 (1980).

    Article Google Scholar

  21. 21.

    Pahariya, G. et al. High precision automated detection of labeled nuclei in gigapixel resolution image data of mouse brain. Preprint at BioRxiv (2019).

  22. 22.

    Ramesh, N., Yoo, J.-H. & Sethi, I. Thresholding based on histogram approximation. In IEEE Proc.—Vision, Image and Signal Processing Vol. 142, 271–279 (IEEE, 1995).

  23. 23.

    Sharma, N. et al. Segmentation and classification of medical images using texture-primitive features: application of BAM-type artificial neural network. J. Med. Phys.33, 119–126 (2008).

    Article Google Scholar

  24. 24.

    Boykov, Y. Y. & Jolly, M.-P. Interactive graph cuts for optimal boundary and region segmentation of objects in nd images. In Proc. Eighth IEEE International Conference on Computer Vision, ICCV 2001 Vol. 1, 105–112 (IEEE, 2001).

  25. 25.

    Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal.42, 60–88 (2017).

    Article Google Scholar

  26. 26.

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012) 1097–1105 (NIPS, 2012).

  27. 27.

    He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proc. IEEE International Conference on Computer Vision 2961–2969 (IEEE, 2017).

  28. 28.

    Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 779–788 (IEEE, 2016).

  29. 29.

    Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell.39, 652–663 (2016).

    Article Google Scholar

  30. 30.

    Sabour, S., Frosst, N. & Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30 (NIPS 2017) 3856–3866 (NIPS, 2017).

  31. 31.

    Badrinarayanan, V., Kendall, A. & Cipolla, R. Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.39, 2481–2495 (2017).

    Article Google Scholar

  32. 32.

    Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Int. Conference on Medical Image Computing and Computer-assisted Intervention 234–241 (Springer, 2015).

  33. 33.

    Buslaev, A., Seferbekov, S. S., Iglovikov, V. & Shvets, A. Fully convolutional network for automatic road extraction from satellite imagery. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 197–1973 (IEEE, 2018).

  34. 34.

    Belkin, M., Hsu, D. J. & Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Advances in Neural Information Processing Systems 31 (NIPS 2018) 2300–2311 (NIPS, 2018).

  35. 35.

    Ciçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-assisted Intervention 424–432 (Springer, 2016).

  36. 36.

    Milletari, F., Navab, N. & Ahmadi, S.-A. V-Net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV) 565–571 (IEEE, 2016).

  37. 37.

    Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data6, 27 (2019).

    Article Google Scholar

  38. 38.

    Delgado-Friedrichs, O., Robins, V. & Sheppard, A. Skeletonization and partitioning of digital images using discrete Morse theory. IEEE Trans. Pattern Anal. Mach. Intell.37, 654–666 (2014).

    Article Google Scholar

  39. 39.

    Gyulassy, A., Bremer, P.-T., Hamann, B. & Pascucci, V. A practical approach to Morse–Smale complex computation: scalability and generality. IEEE Trans. Vis. Comput. Graph.14, 1619–1626 (2008).

    Article Google Scholar

  40. 40.

    Robins, V., Wood, P. J. & Sheppard, A. P. Theory and algorithms for constructing discrete Morse complexes from grayscale digital images. IEEE Trans. Pattern Anal. Mach. Intell.33, 1646–1658 (2011).

    Article Google Scholar

  41. 41.

    Dey, T. K., Wang, J. & Wang, Y. Road network reconstruction from satellite images with machine learning supported by topological methods. In Proc. 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems 520–523 (ACM, 2019).

  42. 42.

    Edelsbrunner, H. & Harer, J. Persistent homology—a survey. Contemp. Math.453, 257–282 (2008).

    MathSciNetArticle Google Scholar

  43. 43.

    Forman, R. A user’s guide to discrete Morse theory. Sém. Lothar. Combin.48, B48c (2002).

  44. 44.

    Sousbie, T., Pichon, C., Colombi, S., Novikov, D. & Pogosyan, D. The 3D skeleton: tracing the filamentary structure of the Universe. Mon. Not. R. Astron. Soc.383, 1655–1670 (2008).

    Article Google Scholar

Download references


We gratefully acknowledge IHC Brightfield Imaged data from P. Strick at U Pitt, and thank J. Nagashima and M. Hanada also at U Pitt for annotating these images. We acknowledge the effort from annotators at the Center for Computational Brain Research at IIT Madras for the bulk of the data annotation and proofreading for this project. This work was supported by the NIH (EB022899, MH114824, MH114821, NS107466, AT010414), the Crick-Clay Professorship (Cold Spring Harbor Laboratory), the Mathers Charitable Foundation, and H. N. Mahabala Chair (IIT Madras). Work at Ohio State University was in addition partly supported by the NSF under grants CCF-1740761, RI-1815697 and DMS-1547357.

Author information


  1. Cold Spring Harbor Laboratory, New York, NY, USA

    Samik Banerjee, Xu Li, Bing-Xing Huo, Katherine Matho, Meng-Kuan Lin, Josh Huang & Partha P. Mitra

  2. Computer Science and Engineering Department, The Ohio State University, Columbus, OH, USA

    Lucas Magee, Dingkang Wang & Yusu Wang

  3. Center for Computational Brain Research, Indian Institute of Technology, Chennai, India

    Jaikishan Jayakumar, Keerthi Ram & Mohanasankar Sivaprakasam


The idea of using topological priors in the pipeline was conceptualized by Y.W. and P.P.M. Algorithmic design and development was performed by S.B. and L.M. Proofreading assistance and neuroanatomical expertise for neuroanatomical ground truth data was provided by J.J. and K.M. Data preparation, including quality control and acquisition, was performed by B.-X.H., J.J. and K.M. under the supervision of J.H. and P.P.M. The ALBU baseline was tested by D.W. Evaluation of the algorithm was conducted by S.B., D.W., L.M. and X.L. Data preparation, including design of an online proofreading interface and hosting, was done by M.-K.L., M.S. and K.R. S.B., L.M., J.J. and P.P.M. prepared and edited the paper.

Corresponding author

Correspondence to Partha P. Mitra.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Working of Discrete Morse Algorithm.

The Discrete Morse algorithm is given an input image (a). A Gaussian filter is applied to the image (b) - a density function is defined at the pixels. Then the algorithm extracts the ridges of the function across the domain (c) - these ridges form the 1-stable manifold. Finally, each path in the 1-stable manifold is assigned an grayscale value based on intensity along the path, and a grayscale mask is outputted in (d).

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Banerjee, S., Magee, L., Wang, D. et al. Semantic segmentation of microscopic neuroanatomical data by combining topological priors with encoder–decoder deep networks. Nat Mach Intell2, 585–594 (2020).

Download citation

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  1. Tupperbox invite
  2. Dell xps external monitor
  3. Oem yamaha graphics
  4. 2011 bmw 528i starter
  5. Basenji pic

Subscribe to Jeremy Jordan

In this post, I'll discuss how to use convolutional neural networks for the task of semantic image segmentation. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.

"What's in this image, and where in the image is it located?"

Jump to:

More specifically, the goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.

An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image. (Source)

One important thing to note is that we're not separating instances of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as instance segmentation models, which do distinguish between separate objects of the same class.

Segmentation models are useful for a variety of tasks, including:

  • Autonomous vehicles
    We need to equip cars with the necessary perception to understand their environment so that self-driving cars can safely integrate into our existing roads.

A real-time segmented road scene for autonomous driving. (Source)

  • Medical image diagnostics
    Machines can augment analysis performed by radiologists, greatly reducing the time required to run diagnositic tests.

chest xray
A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. (Source)

Representing the task

Simply, our goal is to take either a RGB color image ($height \times width \times 3$) or a grayscale image ($height \times width \times 1$) and output a segmentation map where each pixel contains a class label represented as an integer ($height \times width \times 1$).

input to label

Note: For visual clarity, I've labeled a low-resolution prediction map. In reality, the segmentation label resolution should match the original input's resolution.

Similar to how we treat standard categorical values, we'll create our target by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.

one hot

A prediction can be collapsed into a segmentation map (as shown in the first image) by taking the of each depth-wise pixel vector.

We can easily inspect a target by overlaying it onto the observation.


When we overlay a single channel of our target (or prediction), we refer to this as a mask which illuminates the regions of an image where a specific class is present.

Constructing an architecture

A naive approach towards constructing a neural network architecture for this task is to simply stack a number of convolutional layers (with padding to preserve dimensions) and output a final segmentation map. This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.

Image credit

Recall that for deep convolutional networks, earlier layers tend to learn low-level concepts while later layers develop more high-level (and specialized) feature mappings. In order to maintain expressiveness, we typically need to increase the number of feature maps (channels) as we get deeper in the network.

This didn't necessarily pose a problem for the task of image classification, because for that task we only care about what the image contains (and not where it is located). Thus, we could alleviate computational burden by periodically downsampling our feature maps through pooling or strided convolutions (ie. compressing the spatial resolution) without concern. However, for image segmentation, we would like our model to produce a full-resolution semantic prediction.

One popular approach for image segmentation models is to follow an encoder/decoder structure where we downsample the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the upsample the feature representations into a full-resolution segmentation map.

Image credit

Methods for upsampling

There are a few different approaches that we can use to upsample the resolution of a feature map. Whereas pooling operations downsample the resolution by summarizing a local area with a single value (ie. average or max pooling), "unpooling" operations upsample the resolution by distributing a single value into a higher resolution.

Image credit

However, transpose convolutions are by far the most popular approach as they allow for us to develop a learned upsampling.

Image credit

Whereas a typical convolution operation will take the dot product of the values currently in the filter's view and produce a single value for the corresponding output position, a transpose convolution essentially does the opposite. For a transpose convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.

A simplified 1D example of upsampling through a transpose operation. (Source)

For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with stride 2 - as shown in the below example), the overlapping values are simply added together. Unfortunately, this tends to produce a checkerboard artifact in the output and is undesirable, so it's best to ensure that your filter size does not produce an overlap.

Input in blue, output in green. (Source)

Fully convolutional networks

The approach of using a "fully convolutional" network trained end-to-end, pixels-to-pixels for the task of image segmentation was introduced by Long et al. in late 2014. The paper's authors propose adapting existing, well-studied image classification networks (eg. AlexNet) to serve as the encoder module of the network, appending a decoder module with transpose convolutional layers to upsample the coarse feature maps into a full-resolution segmentation map.

Image credit (with modification)

The full network, as shown below, is trained according to a pixel-wise cross entropy loss.

Image credit

However, because the encoder module reduces the resolution of the input by a factor of 32, the decoder module struggles to produce fine-grained segmentations (as shown below).


The paper's authors comment eloquently on this struggle:

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where... Combining fine layers and coarse layers lets the model make local predictions that respect global structure. ― Long et al.

Adding skip connections

The authors address this tension by slowly upsampling (in stages) the encoded representation, adding "skip connections" from earlier layers, and summing these two feature maps.

Image credit (with modification)

These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.


Ronneberger et al. improve upon the "fully convolutional" architecture primarily through expanding the capacity of the decoder module of the network. More concretely, they propose the U-Net architecture which "consists of a contracting path to capture context and a symmetric expanding path that enables precise localization." This simpler architecture has grown to be very popular and has been adapted for a variety of segmentation problems.

U Net
Image credit

Note: The original architecture introduces a decrease in resolution due to the use of padding. However, some practitioners opt to use padding where the padding values are obtained by image reflection at the border.

Whereas Long et al. (FCN paper) reported that data augmentation ("randomly mirroring and “jittering” the images by translating them up to 32 pixels") did not result in a noticeable improvement in performance, Ronneberger et al. (U-Net paper) credit data augmentations ("random elastic deformations of the training samples") as a key concept for learning. It appears as if the usefulness (and type) of data augmentation depends on the problem domain.

Advanced U-Net variants

The standard U-Net model consists of a series of convolution operations for each "block" in the architecture. As I discussed in my post on common convolutional network architectures, there exist a number of more advanced "blocks" that can be substituted in for stacked convolutional layers.

Drozdzal et al. swap out the basic stacked convolution blocks in favor of residual blocks. This residual block introduces short skip connections (within the block) alongside the existing long skip connections (between the corresponding feature maps of encoder and decoder modules) found in the standard U-Net structure. They report that the short skip connections allow for faster convergence when training and allow for deeper models to be trained.

Expanding on this, Jegou et al. proposed the use of dense blocks, still following a U-Net structure, arguing that the "characteristics of DenseNets make them a very good fit for semantic segmentation as they naturally induce skip connections and multi-scale supervision." These dense blocks are useful as they carry low level features from previous layers directly alongside higher level features from more recent layers, allowing for highly efficient feature reuse.

FC DenseNet
Image credit (with modification)

One very important aspect of this architecture is the fact that the upsampling path does not have a skip connection between the input and output of a dense block. The authors note that because the "upsampling path increases the feature maps spatial resolution, the linear growth in the number of features would be too memory demanding." Thus, only the output of a dense block is passed along in the decoder module.

The FC-DenseNet103 model acheives state of the art results (Oct 2017) on the CamVid dataset.

Dilated/atrous convolutions

One benefit of downsampling a feature map is that it broadens the receptive field (with respect to the input) for the following filter, given a constant filter size. Recall that this approach is more desirable than increasing the filter size due to the parameter inefficiency of large filters (discussed here in Section 3.1). However, this broader context comes at the cost of reduced spatial resolution.

Dilated convolutions provide alternative approach towards gaining a wide field of view while preserving the full spatial dimension. As shown in the figure below, the values used for a dilated convolution are spaced apart according to some specified dilation rate.

Image credit

Some architectures swap out the last few pooling layers for dilated convolutions with successively higher dilation rates to maintain the same field of view while preventing loss of spatial detail. However, it is often still too computationally expensive to completely replace pooling layers with dilated convolutions.

Defining a loss function

The most commonly used loss function for the task of image segmentation is a pixel-wise cross entropy loss. This loss examines each pixel individually, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.

cross entropy

Because the cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. Long et al. (FCN paper) discuss weighting this loss for each output channel in order to counteract a class imbalance present in the dataset.

Meanwhile, Ronneberger et al. (U-Net paper) discuss a loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped their U-Net model segment cells in biomedical images in a discontinuous fashion such that individual cells may be easily identified within the binary segmentation map.

pixel loss weights
Notice how the binary segmentation map produces clear borders around the cells. (Source)

Another popular loss function for image segmentation tasks is based on the Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:

$$ Dice = \frac{{2\left| {A \cap B} \right|}}{{\left| A \right| + \left| B \right|}} $$

where ${\left| {A \cap B} \right|}$ represents the common elements between sets A and B, and $\left| A \right|$ represents the number of elements in set A (and likewise for set B).

For the case of evaluating a Dice coefficient on predicted segmentation masks, we can approximate ${\left| {A \cap B} \right|}$ as the element-wise multiplication between the prediction and target mask, and then sum the resulting matrix.


Because our target mask is binary, we effectively zero-out any pixels from our prediction which are not "activated" in the target mask. For the remaining pixels, we are essentially penalizing low-confidence predictions; a higher value for this expression, which is in the numerator, leads to a better Dice coefficient.

In order to quantify $\left| A \right|$ and $\left| B \right|$, some researchers use the simple sum whereas other researchers prefer to use the squared sum for this calculation. I don't have the practical experience to know which performs better empirically over a wide range of tasks, so I'll leave you to try them both and see which works better.


In case you were wondering, there's a 2 in the numerator in calculating the Dice coefficient because our denominator "double counts" the common elements between the two sets. In order to formulate a loss function which can be minimized, we'll simply use $1 - Dice$. This loss function is known as the soft Dice loss because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask.

With respect to the neural network output, the numerator is concerned with the common activations between our prediction and target mask, where as the denominator is concerned with the quantity of activations in each mask separately. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.

soft dice

A soft Dice loss is calculated for each class separately and then averaged to yield a final score. An example implementation is provided below.

Common datasets and segmentation competitions

Below, I've listed a number of common datasets that researchers use to train new models and benchmark against the state of the art. You can also explore previous Kaggle competitions and read about how winning solutions implemented segmentation models for their given task.


Past Kaggle Competitions

Further Reading



Blog posts

Image labeling tools

Useful Github repos

Fully Convolutional Networks for Semantic Segmentation - 허다운

Image Segmentation in 2021: Architectures, Losses, Datasets, and Frameworks

In this piece, we’ll take a plunge into the world of image segmentation using deep learning.

Let’s dive in.

What is image segmentation?

As the term suggests this is the process of dividing an image into multiple segments. In this process, every pixel in the image is associated with an object type. There are two major types of image segmentation — semantic segmentation and instance segmentation.

In semantic segmentation, all objects of the same type are marked using one class label while in instance segmentation similar objects get their own separate labels.

image segmentation

👉Image Segmentation: Tips and Tricks from 39 Kaggle Competitions
👉How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)

Image segmentation architectures

The basic architecture in image segmentation consists of an encoder and a decoder.

The encoder extracts features from the image through filters. The decoder is responsible for generating the final output which is usually a segmentation mask containing the outline of the object. Most of the architectures have this architecture or a variant of it.

Let’s look at a couple.


U-Net is a convolutional neural network originally developed for segmenting biomedical images. When visualized its architecture looks like the letter U and hence the name U-Net. Its architecture is made up of two parts, the left part — the contracting path and the right part — the expansive path. The purpose of the contracting path is to capture context while the role of the expansive path is to aid in precise localization.

U-net architecture image segmentation

U-Net is made up of an expansive path on the right and a contracting path on the left. The contracting path is made up of two three-by-three convolutions. The convolutions are followed by a rectified linear unit and a two-by-two max-pooling computation for downsampling.

U-Net’s full implementation can be found here.

FastFCN —Fast Fully-connected network

In this architecture, a Joint Pyramid Upsampling(JPU) module is used to replace dilated convolutions since they consume a lot of memory and time. It uses a fully-connected network at its core while applying JPU for upsampling. JPU upsamples the low-resolution feature maps to high-resolution feature maps.

If you’d like to get your hands dirty with some code implementation, here you go.


This architecture consists of a two-stream CNN architecture. In this model, a separate branch is used to process image shape information. The shape stream is used to process boundary information.

You can implement it by checking out the code here.


In this architecture, convolutions with upsampled filters are used for tasks that involve dense prediction. Segmentation of objects at multiple scales is done via atrous spatial pyramid pooling. Finally, DCNNs are used to improve the localization of object boundaries. Atrous convolution is achieved by upsampling the filters through the insertion of zeros or sparse sampling of input feature maps.

You can try its implementation on either PyTorch or TensorFlow.

Mask R-CNN

In this architecture, objects are classified and localized using a bounding box and semantic segmentation that classifies each pixel into a set of categories. Every region of interest gets a segmentation mask. A class label and a bounding box are produced as the final output. The architecture is an extension of the Faster R-CNN. The Faster R-CNN is made up of a deep convolutional network that proposes the regions and a detector that utilizes the regions.

Here is an image of the result obtained on the COCO test set.

Mask R-CNN results on the COCO test set

Later, we will work on a couple of Mask R-CNN use cases to automatically segment and construct pixel-wise masks for each object in an image.

Mask R-CNN use cases

As we alluded to in the previous section, for today’s introductory use case demonstration, we will be focusing on the Mask R-CNN framework for image segmentation. Specifically, we will utilize the weights of the Mask R-CNN model pretrained on the COCO dataset aforementioned to build an inference type of model.

During the model building process, we will also set up Neptune experiments to track and compare prediction performance with different hyperparameter tuning.

Now, let’s dive right in!

Install Mask R-CNN

First and foremost, we need to install the required packages and set up our environment. For this exercise, the algorithm implementation by Matterport will be used. Since there is no distributed version of this package so far, I put together several steps to install it by cloning from the Github repo:

One caveat here is that the original Matterport code has not been updated to be compatible with Tensorflow 2+. Hence, for all Tensorflow 2+ users, myself included, getting it to work becomes quite challenging as it would require significant modifications to the source code. If you prefer not to customize your code, an updated version for Tensorflow 2+ is also available here. Therefore, please make sure to clone the correct repo according to your Tensorflow versions.

Step 1: Clone the Mask R-CNN GitHub repo

  • Tensorflow 1+ and keras prior to 2.2.4: 
git clone clone updated_mask_rcnn

This will create a new folder named “updated_mask_rcnn” to differentiate the updated version from the original one.

Step 2: 

  • Check and Install package dependencies
  • Navigate to the folder containing the repo
  • Run:  pip install -r requirements.txt

Step 3:

  • Run setup to install the package
  • Run:  python clean -all install

Few points to ponder:

  1. If you encounter this error message: then upgrade your setuptools.
  2. For Windows users, if you are asked to install the pycocotools, be sure to use pip install pycocotools-windows, rather than the pycocotools as it may have compatibility issues with Windows.

Load the pretrained model

Next, from the Mask_RCNN project Github, let’s download the model weights into the current working directory: mask_rcnn_coco.h5

Image segmentation model tracking with Neptune

When it comes to the model training process, Neptune offers an effective yet easy-to-use way to track and log almost everything model-related, from hyperparameters specification to best model saving, to result from plots logging and so much more. What’s cool about experiment tracking with Neptune is that it will automatically generate performance charts for practitioners to compare different runs, and thus to select an optimal one.

For a more detailed explanation of configuring your Neptune environment and setting up your experiment, please check out this complete guide and my other blog here on Implementing macro F1 scores in Keras.

In this blog, I will also be demonstrating how to leverage Neptune during the image segmentation implementation. Yes, Neptune can well be used to track image processing models!

Importing all the required packages:

import neptune import os import sys import numpy as np import import matplotlib import matplotlib.pyplot as plt ROOT_DIR = os.path.abspath(PATH_TO_YOUR_WORK_DIRECTORY) sys.path.append(ROOT_DIR) from mrcnn import utils import mrcnn.model as modellib from mrcnn import visualize from mrcnn.config import Config from mrcnn.model import MaskRCNN from mrcnn.visualize import display_instances from mrcnn.model import log from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array sys.path.append(os.path.join(ROOT_DIR, "samples/coco/")) import coco MODEL_DIR = os.path.join(ROOT_DIR, "logs") COCO_MODEL_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")

Now, let’s create a project with Neptune specifically for this image segmentation excise:

Next, in Python, creating a Neptune experiment connected to our Image Segmentation Project project, so that we can log and monitor the model information and outputs to Neptune:

import neptune import os project = neptune.init(api_token=os.getenv('NEPTUNE_API_TOKEN'), project_qualified_name='YourUserName/YourProjectName') npt_exp = project.create_experiment('implement-MaskRCNN-Neptune', tags=['image segmentation', 'mask rcnn', 'keras', 'neptune'])

Few notes:

  1. The api_token arg in the neptune.init() takes your Neptune API generated from the config steps;
  2. The tags arg in the project.create_experiment() is optional, but it’s good to specify tags for a given project for easy sharing and tracking.

Having ImageSegmentationProject in my demo, along with its initial experiment successfully set up, we can move onto the modeling part.

Config the Mask R-CNN model

To run image segmentation and inference, we need to define our model as an instance of MaskRCNN class and construct a config object as one parameter fed into the class. The purpose of this config object is to specify how our model is leveraged to train and make predictions.

To warm-up, let’s only specify the batch size for the simplest implementation.

classInferenceConfig(coco.CocoConfig): GPU_COUNT = 1 IMAGES_PER_GPU = 1 config = InferenceConfig() npt_exp.send_text('Model Config Pars', str(config.to_dict()))

Here, batch size = GPU_COUNT * IMAGES_PER_GPU, where both values are set to 1 as we will do segmentations on one image at a time. We also sent the config info to Neptune so that we can keep track of our experiments.

This video clip shows what we will see in our Neptune project, which I zoomed in to show details.

Image segmentation task # 1 with simple model configuration

With all the preparation work completed, we continue with the most exciting part — to make inferences on real images and see how the model is doing.

For task #1, we will work with this image, which can be downloaded here for free.

The following demonstrates how our MASKRCNN model instance is defined:

model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config) model.load_weights(COCO_MODEL_PATH, by_name=True) image_path = path_to_image_monks img = load_img(image_path) img = img_to_array(img) results = model.detect([img], verbose=1)

Points to ponder:

  1. We specify the type of our current model to be “inference”, indicating that we are making image predictions/inference.
  2. For the Mask R-CNN model to do prediction, the image must be converted to a Numpy array.
  3. Rather than using model.predict() as we would for a Keras model prediction, we call the model.detect() function.

Sweet! Now we have the segmentation result, but how should we inspect the result and get a corresponding image out of it? Well, the model output is a dictionary containing multiple components,

  • ROIs: the regions-of-interest(ROI) for the segmented objects.
  • Masks: the masks for the segmented objects.
  • class_ids: the class ID integer for the segmented objects.
  • scores: the predicted probability of each segment belonging to a class.

To visualize the output, we can use the following code.

image_results = results[0] box, mask, classID, score = image_results['rois'], image_results['masks'], image_results['class_ids'], image_results['scores'] fig_images, cur_ax = plt.subplots(figsize=(15, 15)) display_instances(img, box, mask, classID, class_names, score, ax=cur_ax) npt_exp.log_image('Predicted Image', fig_images)

Here the class_names refers to a list of 80 object labels/categories in the COCO dataset. You can copy and paste it from my Github.

Running the code above returns this predicted output image in our Neptune experiment,

Image segmentation task 2

Impressive isn’t it! Our model successfully segmented the monks/humans and the dogs. What’s more impressive is that the model assigns a very high probability/confidence score (i.e., close to 1) to each segmentation!

Image segmentation task # 2 with model hyperparameter tuning

Now You may think that our model did a great job on the last image probably because every object is sort of at the focus, which makes the segmentation task easier because there weren’t too many confounding background objects. How about images with blurred backgrounds? Would the model achieve an equal level of performance?

Let’s experiment together.

Shown below is an image of an adorable teddy bear with a blurred background of cakes.

For better code organization, we can compile the aforementioned model inference steps into a function runMaskRCNN, which takes in two primary args.: modelConfig and imagePath:

defrunMaskRCNN(modelConfig, imagePath, MODEL_DIR=MODEL_DIR, COCO_MODEL_PATH=COCO_MODEL_PATH):''' Args: modelConfig: config object imagePath: full path to the image ''' model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=modelConfig) model.load_weights(COCO_MODEL_PATH, by_name=True) image_path = imagePath img = load_img(image_path) img = img_to_array(img) results = model.detect([img], verbose=1) modelOutput = results[0] return modelOutput, img

Experimenting with the first model, we are using the same config as for task #1. This information will be sent to Neptune for tracking and comparing later.

cur_image_path = path_to_image_teddybear image_results, img = runMaskRCNN(modelConfig=config, imagePath=cur_image_path) npt_exp.send_text('Model Config Pars', str(config.to_dict())) fig_images, cur_ax = plt.subplots(figsize=(15, 15)) display_instances(img, image_results['rois'], image_results['masks'], image_results['class_ids'], class_names, image_results['scores'], ax=cur_ax) npt_exp.log_image('Predicted Image', fig_images)

Predicted by this model, the following output image should show up in our Neptune experiment log.

Image segmentation task 4

As we can see, the model successfully segmented the teddy bear and cupcakes in the background. As far as the cupcake cover goes, the model labeled it as “bottle” with a fairly high probability/confidence score, the same applies to the cupcake tray underneath, which was identified as “bowl”. Both make sense!

Overall, our model did a decent job identifying each object. However, we do also notice that a part of the cake was mislabeled as “teddy bear” with a probability score of 0.702 (i.e., the green box in the middle).

How can we fix this?

Customize the config object for model hyperparameter tuning

We can construct a custom model config to override the hyperparameters in the base config class. Hence to tailor the modeling process specifically for this teddy bear image:

classCustomConfig(coco.CocoConfig):"""Configuration for inference on the teddybear image. Derives from the base Config class and overrides values specific to the teddybear image. """ NAME = "customized" NUM_CLASSES = 1 + 80 GPU_COUNT = 1 IMAGES_PER_GPU = 1 STEPS_PER_EPOCH = 500 DETECTION_MIN_CONFIDENCE = 0.71 LEARNING_RATE = 0.06 LEARNING_MOMENTUM = 0.7 WEIGHT_DECAY = 0.0002 VALIDATION_STEPS = 30 config = CustomConfig() npt_exp.send_text('Model Config Pars', str(config.to_dict()))

After running the model with this new config, we will see this image with correct segmentations in the Neptune project log,

Image segmentation task 5

In our custom config class, we specified the number of classes, steps in each epoch, learning rate, weight decay, and so on. For a complete list of hyperparameters, please refer to the file in the package.

We encourage you to play around with different (hyperparameter) combinations and set up your Neptune project to track and compare their performance. The video clip below showcases the two models we just built along with their prediction results in Neptune.

Mask R-CNN layer weights

For the geeky audience out there who want to go deep into the weeds with our model, we can also collect and visualize the weights and biases for each layer of this CNN model. The following code snippet demonstrates how to do this for the first 5 convolutional layers.

LAYER_TYPES = ['Conv2D'] layers = model.get_trainable_layers() layers = list(filter(lambda l: l.__class__.__name__ in LAYER_TYPES, layers)) print(f'Total layers = {len(layers)}') layers = layers[:5] fig, ax = plt.subplots(len(layers), 2, figsize=(10, 3*len(layers)+10), gridspec_kw={"hspace":1}) for l, layer in enumerate(layers): weights = layer.get_weights() for w, weight in enumerate(weights): tensor = layer.weights[w] ax[l, w].set_title( _ = ax[l, w].hist(weight[w].flatten(), 50) npt_exp.log_image('Model_Weights', fig)

This is a screenshot of what displays in Neptune, showing the histogram of layer weights,

Image segmentation histograms

Image segmentation loss functions

Semantic segmentation models usually use a simple cross-categorical entropy loss function during training. However, if you are interested in getting the granular information of an image, then you have to revert to slightly more advanced loss functions.

Let’s go through a couple of them.

Focal Loss

This loss is an improvement to the standard cross-entropy criterion. This is done by changing its shape such that the loss assigned to well-classified examples is down-weighted. Ultimately, this ensures that there is no class imbalance. In this loss function, the cross-entropy loss is scaled with the scaling factors decaying at zero as the confidence in the correct classes increases. The scaling factor automatically down weights the contribution of easy examples at training time and focuses on the hard ones.

Dice loss

This loss is obtained by calculating smooth dice coefficient function. This loss is the most commonly used loss is segmentation problems.  

Intersection over Union (IoU)-balanced Loss

The IoU-balanced classification loss aims at increasing the gradient of samples with high IoU and decreasing the gradient of samples with low IoU. In this way, the localization accuracy of machine learning models is increased.

Boundary loss

One variant of the boundary loss is applied to tasks with highly unbalanced segmentations. This loss’s form is that of a distance metric on space contours and not regions. In this manner, it tackles the problem posed by regional losses for highly imbalanced segmentation tasks.

Weighted cross-entropy

In one variant of cross-entropy, all positive examples are weighted by a certain coefficient. It is used in scenarios that involve class imbalance.

Lovász-Softmax loss

This loss performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses.

Other losses worth mentioning are:

  • TopK loss whose aim is to ensure that networks concentrate on hard samples during the training process.
  • Distance penalized CE loss that directs the network to boundary regions that are hard to segment.
  • Sensitivity-Specificity (SS) loss that computes the weighted sum of the mean squared difference of specificity and sensitivity.
  • Hausdorff distance(HD) loss that estimated the Hausdorff distance from a convolutional neural network.

These are just a couple of loss functions used in image segmentation. To explore many more check out this repo.

Image segmentation datasets

If you are still here, chances are that you might be asking yourself where you can get some datasets to get started.

Let’s look at a few.

Common Objects in COntext — Coco Dataset

COCO is a large-scale object detection, segmentation, and captioning dataset. The dataset contains 91 classes. It has 250,000 people with key points. Its download size is 37.57 GiB. It contains 80 object categories. It is available under the Apache 2.0 License and can be downloaded from here.

PASCAL Visual Object Classes (PASCAL VOC)

PASCAL has 9963 images with 20 different classes. The training/validation set is a 2GB tar file. The dataset can be downloaded from the official website.

The Cityscapes Dataset

This dataset contains images of city scenes. It can be used to evaluate the performance of vision algorithms in urban scenarios. The dataset can be downloaded from here.

The Cambridge-driving Labeled Video Database — CamVid

This is a motion-based segmentation and recognition dataset. It contains 32 semantic classes. This link contains further explanations and download links to the dataset.

Image segmentation frameworks

Now that you are armed with possible datasets, let’s mention a few tools/frameworks that you can use to get started.

  • FastAI library — given an image this library is able to create a mask of the objects in the image.
  • Sefexa Image Segmentation Tool — Sefexa is a free tool that can be used for Semi-automatic image segmentation, analysis of images, and creation of ground truth
  • Deepmask — Deepmask by Facebook Research is a Torch implementation of DeepMask and SharpMask
  • MultiPath — This a Torch implementation of the object detection network from “A MultiPath Network for Object Detection”.
  • OpenCV — This is an open-source computer vision library with over 2500 optimized algorithms.
  • MIScnn — is a medical image segmentation open-source library. It allows setting up pipelines with state-of-the-art convolutional neural networks and deep learning models in a few lines of code.
  • Fritz: Fritz offers several computer vision tools including image segmentation tools for mobile devices.

Final thoughts

Hopefully, this article gave you some background into image segmentation as well as some tools and frameworks. We also hope that the use case demonstration could spark your interest in getting started exploring this fascinating area of deep neural networks

We’ve covered:

  • what image segmentation is,
  • a couple of image segmentation architectures,
  • some image segmentation losses,
  • image segmentation tools and frameworks,
  • use case implementation with the Mask R-CNN algorithm.

For more information check out the links attached to each of the architectures and frameworks. In addition, the Neptune project is made available here, and the full code can be accessed at this Github repo here. 

Happy segmenting!

Derrick Mwiti

Derrick Mwiti

Derrick Mwiti is a data scientist who has a great passion for sharing knowledge. He is an avid contributor to the data science community via blogs such as Heartbeat, Towards Data Science, Datacamp, Neptune AI, KDnuggets just to mention a few. His content has been viewed over a million times on the internet. Derrick is also an author and online instructor. He also trains and works with various institutions to implement data science solutions as well as to upskill their staff. You might want to check his Complete Data Science & Machine Learning Bootcamp in Python course.

Katherine (Yi) Li

Katherine (Yi) Li

Data Scientist | Data Science Writer
A data enthusiast specializing in machine learning and data mining. Programming, coding and delivering data-driven insights are her passion. She believes that knowledge increases upon sharing; hence she writes about data science in hope of inspiring individuals who are embarking on a similar data science career.


Image Processing in Python: Algorithms, Tools, and Methods You Should Know

9 mins read | Author Neetika Khandelwal | Updated May 27th, 2021

Images define the world, each image has its own story, it contains a lot of crucial information that can be useful in many ways. This information can be obtained with the help of the technique known as Image Processing.

It is the core part of computer vision which plays a crucial role in many real-world examples like robotics, self-driving cars, and object detection. Image processing allows us to transform and manipulate thousands of images at a time and extract useful insights from them. It has a wide range of applications in almost every field. 

Python is one of the widely used programming languages for this purpose. Its amazing libraries and tools help in achieving the task of image processing very efficiently. 

Through this article, you will learn about classical algorithms, techniques, and tools to process the image and get the desired output.

Let’s get into it!

Continue reading ->

Segmentation networks semantic

A 2021 guide to Semantic Segmentation


Deep learning has been very successful when working with images as data and is currently at a stage where it works better than humans on multiple use-cases. The most important problems that humans have been  interested in solving with computer vision are image classification, object detection and segmentation in the increasing order of their difficulty.

In the plain old task of image classification we are just interested in getting the labels of all the objects that are present in an image. In object detection we come further a step and try to know along with what all objects that are present in an image, the location at which the objects are present with the help of bounding boxes. Image segmentation takes it to a new level by trying to find out accurately the exact boundary of the objects in the image.

In this article we will go through this concept of image segmentation, discuss the relevant use-cases, different neural network architectures involved in achieving the results, metrics and datasets to explore.

What is image segmentation

We know an image is nothing but a collection of pixels. Image segmentation is the process of classifying each pixel in an image belonging to a certain class and hence can be thought of as a classification problem per pixel. There are two types of segmentation techniques

  1. Semantic segmentation :- Semantic segmentation is the process of classifying each pixel belonging to a particular label. It doesn't different across different instances of the same object. For example if there are 2 cats in an image, semantic segmentation gives same label to all the pixels of both cats
  2. Instance segmentation :- Instance segmentation differs from semantic segmentation in the sense that it gives a unique label to every instance of a particular object in the image. As can be seen in the image above all 3 dogs are assigned different colours i.e different labels. With semantic segmentation all of them would have been assigned the same colour.

So we will now come to the point where would we need this kind of an algorithm

Use-cases of image segmentation

Handwriting Recognition :- Junjo et all demonstrated how semantic segmentation is being used to extract words and lines from handwritten documents in their 2019 research paper to recognise handwritten characters

Google portrait mode :- There are many use-cases where it is absolutely essential to separate foreground from background. For example in Google's portrait mode we can see the background blurred out while the foreground remains unchanged to give a cool effect

YouTube stories :- Google recently released a feature YouTube stories for content creators to show different backgrounds while creating stories.

Virtual make-up :- Applying virtual lip-stick is possible now with the help of image segmentation

4.Virtual try-on :- Virtual try on of clothes is an interesting feature which was available in stores using specialized hardware which creates a 3d model. But with deep learning and image segmentation the same can be obtained using just a 2d image

Visual Image Search :- The idea of segmenting out clothes is also used in image retrieval algorithms in eCommerce. For example Pinterest/Amazon allows you to upload any picture and get related similar looking products by doing an image search based on segmenting out the cloth portion

Self-driving cars :- Self driving cars need a complete understanding of their surroundings to a pixel perfect level. Hence image segmentation is used to identify lanes and other necessary information

Nanonets helps fortune 500 companies enable better customer experiences at scale using Semantic Segmentation.

Methods and Techniques

Before the advent of deep learning, classical machine learning techniques like SVM, Random Forest, K-means Clustering were used to solve the problem of image segmentation. But as with most of the image related problem statements deep learning has worked comprehensively better than the existing techniques and has become a norm now when dealing with Semantic Segmentation. Let's review the techniques which are being used to solve the problem

Fully Convolutional Network

The general architecture of a CNN consists of few convolutional and pooling layers followed by few fully connected layers at the end. The paper of Fully Convolutional Network released in 2014 argues that the final fully connected layer can be thought of as doing a 1x1 convolution that cover the entire region.

Hence the final dense layers can be replaced by a convolution layer achieving the same result. But now the advantage of doing this is the size of input need not be fixed anymore. When involving dense layers the size of input is constrained and hence when a different sized input has to be provided it has to be resized. But by replacing a dense layer with convolution, this constraint doesn't exist.

Also when a bigger size of image is provided as input the output produced will be a feature map and not just a class output like for a normal input sized image. Also the observed behavior of the final feature map represents the heatmap of the required class i.e the position of the object is highlighted in the feature map. Since the output of the feature map is a heatmap of the required object it is valid information for our use-case of segmentation.

Since the feature map obtained at the output layer is a down sampled due to the set of convolutions performed, we would want to up-sample it using an interpolation technique. Bilinear up sampling works but the paper proposes using learned up sampling with deconvolution which can even learn a non-linear up sampling.

The down sampling part of the network is called an encoder and the up sampling part is called a decoder. This is a pattern we will see in many architectures i.e reducing the size with encoder and then up sampling with decoder. In an ideal world we would not want to down sample using pooling and keep the same size throughout but that would lead to a huge amount of parameters and would be computationally infeasible.

Although the output results obtained have been decent the output observed is rough and not smooth. The reason for this is loss of information at the final feature layer due to downsampling by 32 times using convolution layers. Now it becomes very difficult for the network to do 32x upsampling by using this little information. This architecture is called FCN-32

To address this issue, the paper proposed 2 other architectures FCN-16, FCN-8. In FCN-16 information from the previous pooling layer is used along with the final feature map and hence now the task of the network is to learn 16x up sampling which is better compared to FCN-32. FCN-8 tries to make it even better by including information from one more previous pooling layer.


U-net builds on top of the fully convolutional network from above. It was built for medical purposes to find tumours in lungs or the brain. It also consists of an encoder which down-samples the input image to a feature map and the decoder which up samples the feature map to input image size using learned deconvolution layers.

The main contribution of the U-Net architecture is the shortcut connections. We saw above in FCN that since we down-sample an image as part of the encoder we lost a lot of information which can't be easily recovered in the encoder part. FCN tries to address this by taking information from pooling layers before the final feature layer.

U-Net proposes a new approach to solve this information loss problem. It proposes to send information to every up sampling layer in decoder from the corresponding down sampling layer in the encoder as can be seen in the figure above thus capturing finer information whilst also keeping the computation low. Since the layers at the beginning of the encoder would have more information they would bolster the up sampling operation of decoder by providing fine details corresponding to the input images thus improving the results a lot. The paper also suggested use of a novel loss function which we will discuss below.


Deeplab from a group of researchers from Google have proposed a multitude of techniques to improve the existing results and get finer output at lower computational costs. The 3 main improvements suggested as part of the research are

1) Atrous convolutions
2) Atrous Spatial Pyramidal Pooling
3) Conditional Random Fields usage for improving final output
Let's discuss about all these

Atrous Convolution

One of the major problems with FCN approach is the excessive downsizing due to consecutive pooling operations. Due to series of pooling the input image is down sampled by 32x which is again up sampled to get the segmentation result. Downsampling by 32x results in a loss of information which is very crucial for getting fine output in a segmentation task. Also deconvolution to up sample by 32x is a computation and memory expensive operation since there are additional parameters involved in forming a learned up sampling.

The paper proposes the usage of Atrous convolution or the hole convolution or dilated convolution which helps in getting an understanding of large context using the same number of parameters.

Dilated convolution works by increasing the size of the filter by appending zeros(called holes) to fill the gap between parameters. The number of holes/zeroes filled in between the filter parameters is called by a term dilation rate. When the rate is equal to 1 it is nothing but the normal convolution. When rate is equal to 2 one zero is inserted between every other parameter making the filter look like a 5x5 convolution. Now it has the capacity to get the context of 5x5 convolution while having 3x3 convolution parameters. Similarly for rate 3 the receptive field goes to 7x7.

In Deeplab last pooling layers are replaced to have stride 1 instead of 2 thereby keeping the down sampling rate to only 8x. Then a series of atrous convolutions are applied to capture the larger context. For training the output labelled mask is down sampled by 8x to compare each pixel. For inference, bilinear up sampling is used to produce output of the same size which gives decent enough results at lower computational/memory costs since bilinear up sampling doesn't need any parameters as opposed to deconvolution for up sampling.


Spatial Pyramidal Pooling is a concept introduced in SPPNet to capture multi-scale information from a feature map. Before the introduction of SPP input images at different resolutions are supplied and the computed feature maps are used together to get the multi-scale information but this takes more computation and time. With Spatial Pyramidal Pooling multi-scale information can be captured with a single input image.

With the SPP module the network produces 3 outputs of dimensions 1x1(i.e GAP), 2x2 and 4x4. These values are concatenated by converting to a 1d vector thus capturing information at multiple scales. Another advantage of using SPP is input images of any size can be provided.

ASPP takes the concept of fusing information from different scales and applies it to Atrous convolutions. The input is convolved with different dilation rates and the outputs of these are fused together.

As can be seen the input is convolved with 3x3 filters of dilation rates 6, 12, 18 and 24 and the outputs are concatenated together since they are of same size. A 1x1 convolution output is also added to the fused output. To also provide the global information, the GAP output is also added to above after up sampling. The fused output of 3x3 varied dilated outputs, 1x1 and GAP output is passed through 1x1 convolution to get to the required number of channels.

Since the required image to be segmented can be of any size in the input the multi-scale information from ASPP helps in improving the results.

Improving output with CRF

Pooling is an operation which helps in reducing the number of parameters in a neural network but it also brings a property of invariance along with it. Invariance is the quality of a neural network being unaffected by slight translations in input. Due to this property obtained with pooling the segmentation output obtained by a neural network is coarse and the boundaries are not concretely defined.

To deal with this the paper proposes use of graphical model CRF. Conditional Random Field operates a post-processing step and tries to improve the results produced to define shaper boundaries. It works by classifying a pixel based not only on it's label but also based on other pixel labels. As can be seen from the above figure the coarse boundary produced by the neural network gets more refined after passing through CRF.

Deeplab-v3 introduced batch normalization and suggested dilation rate multiplied by (1,2,4) inside each layer in a Resnet block.  Also adding image level features to ASPP module which was discussed in the above discussion on ASPP was proposed as part of this paper

Deeplab-v3+ suggested to have a decoder instead of plain bilinear up sampling 16x. The decoder takes a hint from the decoder used by architectures like U-Net which take information from encoder layers to improve the results. The encoder output is up sampled 4x using bilinear up sampling and concatenated with the features from encoder which is again up sampled 4x after performing a 3x3 convolution. This approach yields better results than a direct 16x up sampling. Also modified Xception architecture is proposed to be used instead of Resnet as part of encoder and depthwise separable convolutions are now used on top of Atrous convolutions to reduce the number of computations.

Global Convolution Network

Semantic segmentation involves performing two tasks concurrently

i) Classification
ii) Localization

The classification networks are created to be invariant to translation and rotation thus giving no importance to location information whereas the localization involves getting accurate details w.r.t the location. Thus inherently these two tasks are contradictory. Most segmentation algorithms give more importance to localization i.e the second in the above figure and thus lose sight of global context. In this work the author proposes a way to give importance to classification task too while at the same time not losing the localization information

The author proposes to achieve this by using large kernels as part of the network thus enabling dense connections and hence more information. This is achieved with the help of a GCN block as can be seen in the above figure. GCN block can be thought of as a k x k convolution filter where k can be a number bigger than 3. To reduce the number of parameters a k x k filter is further split into 1 x k and k x 1, kx1 and 1xk blocks which are then summed up. Thus by increasing value k, larger context is captured.

In addition, the author proposes a Boundary Refinement block which is similar to a residual block seen in Resnet consisting of a shortcut connection and a residual connection which are summed up to get the result. It is observed that having a Boundary Refinement block resulted in improving the results at the boundary of segmentation.

Results showed that GCN block improved the classification accuracy of pixels closer to the center of object indicating the improvement caused due to capturing long range context whereas Boundary Refinement block helped in improving accuracy of pixels closer to boundary.

See More Than Once – KSAC for Semantic Segmentation

Deeplab family uses ASPP to have multiple receptive fields capture information using different atrous convolution rates. Although ASPP has been significantly useful in improving the segmentation of results there are some inherent problems caused due to the architecture. There is no information shared across the different parallel layers in ASPP thus affecting the generalization power of the kernels in each layer. Also since each layer caters to different sets of training samples(smaller objects to smaller atrous rate and bigger objects to bigger atrous rates), the amount of data for each parallel layer would be less thus affecting the overall generalizability.  Also the number of parameters in the network increases linearly with the number of parameters and thus can lead to overfitting.

To handle all these issues the author proposes a novel network structure called Kernel-Sharing Atrous Convolution (KSAC). As can be seen in the above figure, instead of having a different kernel for each parallel layer is ASPP a single kernel is shared across thus improving the generalization capability of the network. By using KSAC instead of ASPP 62% of the parameters are saved when dilation rates of 6,12 and 18 are used.

Another advantage of using a KSAC structure is the number of parameters are independent of the number of dilation rates used. Thus we can add as many rates as possible without increasing the model size. ASPP gives best results with rates 6,12,18 but accuracy decreases with 6,12,18,24 indicating possible overfitting. But KSAC accuracy still improves considerably indicating the enhanced generalization capability.

This kernel sharing technique can also be seen as an augmentation in the feature space since the same kernel is applied over multiple rates. Similar to how input augmentation gives better results, feature augmentation performed in the network should help improve the representation capability of the network.

Video Segmentation

For use cases like self-driving cars, robotics etc. there is a need for real-time segmentation on the observed video. The architectures discussed so far are pretty much designed for accuracy and not for speed. So if they are applied on a per-frame basis on a video the result would come at very low speed.

Also generally in a video there is a lot of overlap in scenes across consecutive frames which could be used for improving the results and speed which won't come into picture if analysis is done on a per-frame basis. Using these cues let's discuss architectures which are specifically designed for videos


Spatio-Temporal FCN proposes to use FCN along with LSTM to do video segmentation. We are already aware of how FCN can be used to extract features for segmenting an image. LSTM are a kind of neural networks which can capture sequential information over time. STFCN combines the power of FCN with LSTM to capture both the spatial information and temporal information

As can be seen from the above figure STFCN consists of a FCN, Spatio-temporal module followed by deconvolution. The feature map produced by a FCN is sent to Spatio-Temporal Module which also has an input from the previous frame's module. The module based on both these inputs captures the temporal information in addition to the spatial information and sends it across which is up sampled to the original size of image using deconvolution similar to how it's done in FCN

Since both FCN and LSTM are working together as part of STFCN the network is end to end trainable and outperforms single frame segmentation approaches.  There are similar approaches where LSTM is replaced by GRU but the concept is same of capturing both the spatial and temporal information

Semantic Video CNNs through Representation Warping

This paper proposes the use of optical flow across adjacent frames as an extra input to improve the segmentation results

The approach suggested can be roped in to any standard architecture as a plug-in. The key ingredient that is at play is the NetWarp module. To compute the segmentation map the optical flow between the current frame and previous frame is calculated i.e Ft and is passed through a FlowCNN to get Λ(Ft) . This process is called Flow Transformation. This value is passed through a warp module which also takes as input the feature map of an intermediate layer calculated by passing through the network. This gives a warped feature map which is then combined with the intermediate feature map of the current layer and the entire network is end to end trained. This architecture achieved SOTA results on CamVid and Cityscapes video benchmark datasets.

Clockwork Convnets for Video Semantic Segmentation

This paper proposes to improve the speed of execution of a neural network for segmentation task on videos by taking advantage of the fact that semantic information in a video changes slowly compared to pixel level information. So the information in the final layers changes at a much slower pace compared to the beginning layers. The paper suggests different times

The above figure represents the rate of change comparison for a mid level layer pool4 and a deep layer fc7. On the left we see that since there is a lot of change across the frames both the layers show a change but the change for pool4 is higher.  In the right we see that there is not a lot of change across the frames. Hence pool4 shows marginal change whereas fc7 shows almost nil change.

The research utilizes this concept and suggests that in cases where there is not much of a change across the frames there is no need of computing the features/outputs again and the cached values from the previous frame can be used. Since the rate of change varies with layers different clocks can be set for different sets of layers. When the clock ticks the new outputs are calculated, otherwise the cached results are used. The rate of clock ticks can be statically fixed or can be dynamically learnt

Low-Latency Video Semantic Segmentation

This paper improves on top of the above discussion by adaptively selecting the frames to compute the segmentation map or to use the cached result instead of using a fixed timer or a heuristic.

The paper proposes to divide the network into 2 parts, low level features and high level features. The cost of computing low level features in a network is much less compared to higher features. The research suggests to use the low level network features as an indicator of the change in segmentation map. In their observations they found strong correlation between low level features change and the segmentation map change. So to understand if there is a need to compute if the higher features are needed to be calculated, the lower features difference across 2 frames is found and is compared if it crosses a particular threshold. This entire process is automated by a small neural network whose task is to take lower features of two frames and to give a prediction as to whether higher features should be computed or not. Since the network decision is based on the input frames the decision taken is dynamic compared to the above approach.

Segmentation for point clouds

Data coming from a sensor such as lidar is stored in a format called Point Cloud. Point cloud is nothing but a collection of unordered set of 3d data points(or any dimension). It is a sparse representation of the scene in 3d and CNN can't be directly applied in such a case. Also any architecture designed to deal with point clouds should take into consideration that it is an unordered set and hence can have a lot of possible permutations. So the network should be permutation invariant. Also the points defined in the point cloud can be described by the distance between them. So closer points in general carry useful information which is useful for segmentation tasks


PointNet is an important paper in the history of research on point clouds using deep learning to solve the tasks of classification and segmentation.  Let's study the architecture of Pointnet

Input of the network for n points is an n x 3 matrix. n x 3 matrix is mapped to n x 64 using a shared multi-perceptron layer(fully connected network) which is then mapped to n x 64 and then to n x 128 and n x 1024. Max pooling is applied to get a 1024 vector which is converted to k outputs by passing through MLP's with sizes 512, 256 and k. Finally k class outputs are produced similar to any classification network.

Classification deals only with the global features but segmentation needs local features as well. So the local features from intermediate layer at n x 64 is concatenated with global features to get a n x 1088 matrix which is sent through mlp of 512 and 256 to get to n x 256 and then though MLP's of 128 and m to give m output classes for every point in point cloud.

Also the network involves an input transform and feature transform as part of the network whose task is to not change the shape of input but add invariance to affine transformations i.e translation, rotation etc.


A-CNN proposes the usage of Annular convolutions to capture spatial information. We know from CNN that convolution operations capture the local information which is essential to get an understanding of the image. A-CNN devised a new convolution called Annular convolution which is applied to neighbourhood points in a point-cloud.

The architecture takes as input n x 3 points and finds normals for them which is used for ordering of points. A subsample of points is taken using the FPS algorithm resulting in ni x 3 points. On these annular convolution is applied to increase to 128 dimensions. Annular convolution is performed on the neighbourhood points which are determined using a KNN algorithm.

Another set of the above operations are performed to increase the dimensions to 256. Then an mlp is applied to change the dimensions to 1024 and pooling is applied to get a 1024 global vector similar to point-cloud. This entire part is considered the encoder.  For classification the encoder global output is passed through mlp to get c class outputs. For segmentation task both the global and local features are considered similar to PointCNN and is then passed through an MLP to get m class outputs for each point.


Let's discuss the metrics which are generally used to understand and evaluate the results of a model.

Pixel Accuracy

Pixel accuracy is the most basic metric which can be used to validate the results. Accuracy is obtained by taking the ratio of correctly classified pixels w.r.t total pixels

Accuracy = (TP+TN)/(TP+TN+FP+FN)

The main disadvantage of using such a technique is the result might look good if one class overpowers the other. Say for example the background class covers 90% of the input image we can get an accuracy of 90% by just classifying every pixel as background

Intersection Over Union

IOU is defined as the ratio of intersection of ground truth and predicted segmentation outputs over their union. If we are calculating for multiple classes, IOU of each class is calculated and their mean is taken. It is a better metric compared to pixel accuracy as if every pixel is given as background in a 2 class input the IOU value is (90/100+0/100)/2 i.e 45% IOU which gives a better representation as compared to 90% accuracy.

Frequency weighted IOU

This is an extension over mean IOU which we discussed and is used to combat class imbalance. If one class dominates most part of the images in a dataset like for example background, it needs to be weighed down compared to other classes. Thus instead of taking the mean of all the class results, a weighted mean is taken based on the frequency of the class region in the dataset.

F1 Score

The metric popularly used in classification F1 Score can be used for segmentation task as well to deal with class imbalance.

Average Precision

Area under the Precision - Recall curve for a chosen threshold IOU average over different classes is used for validating the results.

Loss functions

Loss function is used to guide the neural network towards optimization. Let's discuss a few popular loss functions for semantic segmentation task.

Cross Entropy Loss

Simple average of cross-entropy classification loss for every pixel in the image can be used as an overall function. But this again suffers due to class imbalance which FCN proposes to rectify using class weights

UNet tries to improve on this by giving more weight-age to the pixels near the border which are part of the boundary as compared to inner pixels as this makes the network focus more on identifying borders and not give a coarse output.

Focal Loss

Focal loss was designed to make the network focus on hard examples by giving more weight-age and also to deal with extreme class imbalance observed in single-stage object detectors. The same can be applied in semantic segmentation tasks as well

Dice Loss

Dice function is nothing but F1 score. This loss function directly tries to optimize F1 score. Similarly direct IOU score can be used to run optimization as well

Tversky Loss

It is a variant of Dice loss which gives different weight-age to FN and FP

Hausdorff distance

It is a technique used to measure similarity between boundaries of ground truth and predicted. It is calculated by finding out the max distance from any point in one boundary to the closest point in the other. Reducing directly the boundary loss function is a recent trend and has been shown to give better results especially in use-cases like medical image segmentation where identifying the exact boundary plays a key role.

The advantage of using a boundary loss as compared to a region based loss like IOU or Dice Loss is it is unaffected by class imbalance since the entire region is not considered for optimization, only the boundary is considered.

The two terms considered here are for two boundaries i.e the ground truth and the output prediction.

Annotation tools

LabelMe :-

Image annotation tool written in python.
Supports polygon annotation.
Open Source and free.
Runs on Windows, Mac, Ubuntu or via Anaconda, Docker
Link :-

Computer Vision Annotation Tool :-

Video and image annotation tool developed by Intel
Free and available online
Runs on Windows, Mac and Ubuntu
Link :-

Vgg image annotator :-

Free open source image annotation tool
Simple html page < 200kb and can run offline
Supports polygon annotation and points.
Link :-

Rectlabel :-

Paid annotation tool for Mac
Can use core ML models to pre-annotate the images
Supports polygons, cubic-bezier, lines, and points
Link :-

Labelbox :-

Paid annotation tool
Supports pen tool for faster and accurate annotation
Link :-


As part of this section let's discuss various popular and diverse datasets available in the public which one can use to get started with training.

Pascal Context

This dataset is an extension of Pascal VOC 2010 dataset and goes beyond the original dataset by providing annotations for the whole scene and has 400+ classes of real-world data.

Link :-

COCO Dataset

The COCO stuff dataset has 164k images of the original COCO dataset with pixel level annotations and is a common benchmark dataset. It covers 172 classes: 80 thing classes, 91 stuff classes and 1 class 'unlabeled'

Link :-

Cityscapes Dataset

This dataset consists of segmentation ground truths for roads, lanes, vehicles and objects on road. The dataset contains 30 classes and of 50 cities collected over different environmental and weather conditions. Has also a video dataset of finely annotated images which can be used for video segmentation. KITTI and CamVid are similar kinds of datasets which can be used for training self-driving cars.

Link :-

Lits Dataset

The dataset was created as part of a challenge to identify tumor lesions from liver CT scans. The dataset contains 130 CT scans of training data and 70 CT scans of testing data.

Link :-

CCP Dataset

Cloth Co-Parsing is a dataset which is created as part of research paper Clothing Co-Parsing by Joint Image Segmentation and Labeling . The dataset contains 1000+ images with pixel level annotations for a total of 59 tags.

Source :-

Pratheepan Dataset

A dataset created for the task of skin segmentation based on images from google containing 32 face photos and 46 family photos

Link :-

Inria Aerial Image Labeling

A dataset of aerial segmentation maps created from public domain images. Has a coverage of 810 sq km and has 2 classes building and not-building.

Link :-


This dataset contains the point clouds of six large scale indoor parts in 3 buildings with over 70000 images.

Link :-


We have discussed a taxonomy of different algorithms which can be used for solving the use-case of semantic segmentation be it on images, videos or point-clouds and also their contributions and limitations. We also looked through the ways to evaluate the results and the datasets to get started on. This should give a comprehensive understanding on semantic segmentation as a topic in general.

To get a list of more resources for semantic segmentation, get started with

Further Reading

You might be interested in our latest posts on:

Added further reading material.
Fully Convolutional Networks for Image Segmentation - SciPy 2017 - Daniil Pakhomov

How to do Semantic Segmentation using Deep learning

This article is a comprehensive overview including a step-by-step guide to implement a deep learning image segmentation model.

We shared a new updated blog on Semantic Segmentation here: A 2021 guide to Semantic Segmentation

Nowadays, semantic segmentation is one of the key problems in the field of computer vision. Looking at the big picture, semantic segmentation is one of the high-level task that paves the way towards complete scene understanding. The importance of scene understanding as a core computer vision problem is highlighted by the fact that an increasing number of applications nourish from inferring knowledge from imagery. Some of those applications include self-driving vehicles, human-computer interaction, virtual reality etc. With the popularity of deep learning in recent years, many semantic segmentation problems are being tackled using deep architectures, most often Convolutional Neural Nets, which surpass other approaches by a large margin in terms of accuracy and efficiency.

What is Semantic Segmentation?

Semantic segmentation is a natural step in the progression from coarse to fine inference:The origin could be located at classification, which consists of making a prediction for a whole input.The next step is localization / detection, which provide not only the classes but also additional information regarding the spatial location of those classes.Finally, semantic segmentation achieves fine-grained inference by making dense predictions inferring labels for every pixel, so that each pixel is labeled with the class of its enclosing object ore region.

example of semantic segmentation in street view

It is also worthy to review some standard deep networks that have made significant contributions to the field of computer vision, as they are often used as the basis of semantic segmentation systems:

  • AlexNet: Toronto’s pioneering deep CNN that won the 2012 ImageNet competition with a test accuracy of 84.6%. It consists of 5 convolutional layers, max-pooling ones, ReLUs as non-linearities, 3 fully-convolutional layers, and dropout.
  • VGG-16: This Oxford’s model won the 2013 ImageNet competition with 92.7% accuracy. It uses a stack of convolution layers with small receptive fields in the first layers instead of few layers with big receptive fields.
  • GoogLeNet: This Google’s network won the 2014 ImageNet competition with accuracy of 93.3%. It is composed by 22 layers and a newly introduced building block called inception module. The module consists of a Network-in-Network layer, a pooling operation, a large-sized convolution layer, and small-sized convolution layer.
  • ResNet: This Microsoft’s model won the 2016 ImageNet competition with 96.4 % accuracy. It is well-known due to its depth (152 layers) and the introduction of residual blocks. The residual blocks address the problem of training a really deep architecture by introducing identity skip connections so that layers can copy their inputs to the next layer.
Analysis of Deep Neural Network Models

What are the existing Semantic Segmentation approaches?

A general semantic segmentation architecture can be broadly thought of as an encoder network followed by a decoder network:

  • The encoder is usually is a pre-trained classification network like VGG/ResNet followed by a decoder network.
  • The task of the decoder is to semantically project the discriminative features (lower resolution) learnt by the encoder onto the pixel space (higher resolution) to get a dense classification.

Unlike classification where the end result of the very deep network is the only important thing, semantic segmentation not only requires discrimination at pixel level but also a mechanism to project the discriminative features learnt at different stages of the encoder onto the pixel space. Different approaches employ different mechanisms as a part of the decoding mechanism. Let’s explore the 3 main approaches:

1 — Region-Based Semantic Segmentation

The region-based methods generally follow the “segmentation using recognition” pipeline, which first extracts free-form regions from an image and describes them, followed by region-based classification. At test time, the region-based predictions are transformed to pixel predictions, usually by labeling a pixel according to the highest scoring region that contains it.

R-CNN architecture - general framework

R-CNN(Regions with CNN feature) is one representative work for the region-based methods. It performs the semantic segmentation based on the object detection results. To be specific, R-CNN first utilizes selective search to extract a large quantity of object proposals and then computes CNN features for each of them. Finally, it classifies each region using the class-specific linear SVMs. Compared with traditional CNN structures which are mainly intended for image classification, R-CNN can address more complicated tasks, such as object detection and image segmentation, and it even becomes one important basis for both fields. Moreover, R-CNN can be built on top of any CNN benchmark structures, such as AlexNet, VGG, GoogLeNet, and ResNet.

For the image segmentation task, R-CNN extracted 2 types of features for each region: full region feature and foreground feature, and found that it could lead to better performance when concatenating them together as the region feature. R-CNN achieved significant performance improvements due to using the highly discriminative CNN features. However, it also suffers from a couple of drawbacks for the segmentation task:

  • The feature is not compatible with the segmentation task.
  • The feature does not contain enough spatial information for precise boundary generation.
  • Generating segment-based proposals takes time and would greatly affect the final performance.

Due to these bottlenecks, recent research has been proposed to address the problems, including SDS, Hypercolumns, Mask R-CNN.

2 — Fully Convolutional Network-Based Semantic Segmentation

The original Fully Convolutional Network (FCN) learns a mapping from pixels to pixels, without extracting the region proposals. The FCN network pipeline is an extension of the classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images. The restriction of CNNs to accept and produce labels only for specific sized inputs comes from the fully-connected layers which are fixed. Contrary to them, FCNs only have convolutional and pooling layers which give them the ability to make predictions on arbitrary-sized inputs.

Fully convolutional Network (FCN) Architecture

One issue in this specific FCN is that by propagating through several alternated convolutional and pooling layers, the resolution of the output feature maps is down sampled. Therefore, the direct predictions of FCN are typically in low resolution, resulting in relatively fuzzy object boundaries. A variety of more advanced FCN-based approaches have been proposed to address this issue, including SegNet, DeepLab-CRF, and Dilated Convolutions.

3 — Weakly Supervised Semantic Segmentation

Most of the relevant methods in semantic segmentation rely on a large number of images with pixel-wise segmentation masks. However, manually annotating these masks is quite time-consuming, frustrating and commercially expensive. Therefore, some weakly supervised methods have recently been proposed, which are dedicated to fulfilling the semantic segmentation by utilizing annotated bounding boxes.

semantic segmentation

For example, Boxsup employed the bounding box annotations as a supervision to train the network and iteratively improve the estimated masks for semantic segmentation. Simple Does It treated the weak supervision limitation as an issue of input label noise and explored recursive training as a de-noising strategy. Pixel-level Labeling interpreted the segmentation task within the multiple-instance learning framework and added an extra layer to constrain the model to assign more weight to important pixels for image-level classification.

Doing Semantic Segmentation with Fully-Convolutional Network

In this section, let’s walk through a step-by-step implementation of the most popular architecture for semantic segmentation — the Fully-Convolutional Net (FCN). We’ll implement it using the TensorFlow library in Python 3, along with other dependencies such as Numpy and Scipy.In this exercise we will label the pixels of a road in images using FCN. We’ll work with the Kitti Road Dataset for road/lane detection. This is a simple exercise from the Udacity’s Self-Driving Car Nano-degree program, which you can learn more about the setup in this GitHub repo.

Kitti road dataset for semantic segmentation

Here are the key features of the FCN architecture:

  • FCN transfers knowledge from VGG16 to perform semantic segmentation.
  • The fully connected layers of VGG16 is converted to fully convolutional layers, using 1x1 convolution. This process produces a class presence heat map in low resolution.
  • The upsampling of these low resolution semantic feature maps is done using transposed convolutions (initialized with bilinear interpolation filters).
  • At each stage, the upsampling process is further refined by adding features from coarser but higher resolution feature maps from lower layers in VGG16.
  • Skip connection is introduced after each convolution block to enable the subsequent block to extract more abstract, class-salient features from the previously pooled features.

There are 3 versions of FCN (FCN-32, FCN-16, FCN-8). We’ll implement FCN-8, as detailed step-by-step below:

  • Encoder: A pre-trained VGG16 is used as an encoder. The decoder starts from Layer 7 of VGG16.
  • FCN Layer-8: The last fully connected layer of VGG16 is replaced by a 1x1 convolution.
  • FCN Layer-9: FCN Layer-8 is upsampled 2 times to match dimensions with Layer 4 of VGG 16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 4 of VGG16 and FCN Layer-9.
  • FCN Layer-10: FCN Layer-9 is upsampled 2 times to match dimensions with Layer 3 of VGG16, using transposed convolution with parameters: (kernel=(4,4), stride=(2,2), paddding=’same’). After that, a skip connection was added between Layer 3 of VGG 16 and FCN Layer-10.
  • FCN Layer-11: FCN Layer-10 is upsampled 4 times to match dimensions with input image size so we get the actual image back and depth is equal to number of classes, using transposed convolution with parameters:(kernel=(16,16), stride=(8,8), paddding=’same’).
FCN-8 Architecture

Step 1

We first load the pre-trained VGG-16 model into TensorFlow. Taking in the TensorFlow session and the path to the VGG Folder (which is downloadable here), we return the tuple of tensors from VGG model, including the image input, keep_prob (to control dropout rate), layer 3, layer 4, and layer 7.

VGG16 function

Step 2

Now we focus on creating the layers for a FCN, using the tensors from the VGG model. Given the tensors for VGG layer output and the number of classes to classify, we return the tensor for the last layer of that output. In particular, we apply a 1x1 convolution to the encoder layers, and then add decoder layers to the network with skip connections and upsampling.

Layers function

Step 3

The next step is to optimize our neural network, aka building TensorFlow loss functions and optimizer operations. Here we use cross entropy as our loss function and Adam as our optimization algorithm.

Optimize function

Step 4

Here we define the train_nn function, which takes in important parameters including number of epochs, batch size, loss function, optimizer operation, and placeholders for input images, label images, learning rate. For the training process, we also set keep_probability to 0.5 and learning_rate to 0.001. To keep track of the progress, we also print out the loss during training.

Step 5

Finally, it’s time to train our net! In this run function, we first build our net using the load_vgg, layers, and optimize function. Then we train the net using the train_nn function and save the inference data for records.

Run function

About our parameters, we choose epochs = 40, batch_size = 16, num_classes = 2, and image_shape = (160, 576). After doing 2 trial passes with dropout = 0.5 and dropout = 0.75, we found that the 2nd trial yields better results with better average losses.

semantic segmentation training sample results

To see the full code, check out this link:

If you enjoyed this piece, I’d love it share it  👏 and spread the knowledge.

You might be interested in our latest posts on:


Similar news:


981 982 983 984 985