What are Fully Convolutional Networks?

FCNs, or Fully Convolutional Networks, are a form of architecture that is primarily used for semantic segmentation. They are an architecture used mainly for semantic segmentation. Convolution, pooling, and upsampling are the only locally linked layers they use. Since dense layers aren’t used, there are fewer parameters (making the networks faster to train).

  • CONVOLUTION: It is the first layer which extracts features from an input image. Essentially, it is a matrix multiplication of the image matrix and a learnable filter matrix. The use of different filter matrices helps in extracting different features from the image. For example, filter A might capture all the vertical lines, while filter B captures all the Horizontal lines. The features get more and more complex as we go deeper in a convolution net, giving us a network which has the sequential holistic understanding of the image, very similar to how a Human would process any image.
  • POOLING: Convolution is followed by the operation of Pooling, which is responsible for reducing the resolution of convoluted features even more, leading to reduced computational requirements. Also pooling leads to reduction of noise, and extraction of only the dominant features which are rotational and positional invariant.
  • UNPOOLING: After reducing the resolution by extracting convoluted features and pooling , for the case of Semantic Segmentation the next step is to upscale the low resolution back to the original resolution of the input image. Pooling converts a patch of values to a single value, whereas unpooling does the opposite, converts a single value into a patch of values.
  • TRANSPOSED CONVOLUTION: Transposed Convolution or like some people incorrectly call it Deconvolution, can be seen as an opposite action to Convolution. Unlike convolution, a transposed convolution layer is used to upsample the reduced resolution feature back to its original resolution. A set of stride and padding values is learned to obtain the final output from the lower resolution features. The below illustration explains the procedure in a very easy to understand manner.