**What is Convolutional Neural Network****?**

This question has been answered a million times, almost everywhere on internet. I myself have answered it hundred times (ok, a few timesJ), and guess what – the answer is pretty simple. Convolutional Neural Network** **is **“****a class of Deep, Feed-Forward Artificial Neural Networks** **“!**

What? It’s this simple? Am I joking?

No, I am not.

**But the question again – What is so special about them?**

Well there are many characteristics which make them special and nobody can give a complete answer. But, since you have asked for it, I shall give you a few. J

- Convolutional networks have the property called ‘
**Spatial Invariance**’, meaning they learn to recognize image features anywhere in the image. Pooling allows for translation, rotation and scale invariance (This will be explained in the later sections of the blog).

Spatial invariance can be understood by a simple example – Imagine you have a face detector for dog. By identifying characteristics like a dog’s ears, nose, and mouth you will be able to identify the face. Your dog face detector goes off whenever there is a dog’s ear, nose, and mouth in the region of the detector. It doesn’t matter where you see a feature, as long as you see it at some level. - ‘
**Parameter Sharing**’ – Sharing of weights by all neurons in a particular feature map. It gives the ability to network to look for a given feature everywhere in the image, rather than in just a particular area. **Local connectivity**– Concept of each neural connected only to a subset of the input image (unlike a neural network where all the neurons are fully connected). In a convolutional network, each neuron only receives input from a small local group of the pixels in the input image. It implies that all of the inputs that go into a given neuron are actually close to each other. This feature in Convolutional Neural Network makes the assumption of locality, and hence is more powerful. Because as they say stronger assumption gives better result.

OK, we have seen enough of its characteristics now. Now, let’s see how does it work?

**How does it work?**

**Let’s understand the concept with an example.**

Consider an image identification problem using Convolutional Neural Network. Suppose there is an image represented through pixel values (every image can be considered as a matrix of pixel values). In this case suppose our image could be represented through a 5*5 matrix.

Next thing you do is that you take a 3*3 filter (which is another matrix) and slide it over the complete image and along the way take the dot product between the filter and chunks of the input image.

Here one thing worth noting is that the 3×3 matrix “sees” only a part of the input image in each stride (what is a stride?-we will see it later). In Convolutional Neural Network terminology, this 3×3 matrix is called a ‘**filter**‘ or ‘kernel’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘**Convolved Feature**’ or the ‘**Feature Map**‘.

It is important to note that filters acts as feature detectors from the original input image. It is quite obvious that different values of the filter matrix will produce different Feature Maps for the same input image. Here is an example-depending on what type of operation you want to perform on a particular image you use different types of filter (i.e. values inside 3*3 matrix will change). For example-for edge detection operation you can observe negative values in the filter matrix, whereas for blur operation all the values are positive.

**https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/**

CNN *learns* the values of these filters on its own during the training process. However we need to give some input beforehand such as number of filters, filter size etc. The more number of filters we use, the more image features get extracted and the better will be our network at recognizing patterns in unseen images.

Few parameters for feature extraction you need to understand are:

**Depth:** Depth corresponds to the number of filters we will use for the convolution operation. For different operations different filters are used.

**Stride: **Stride is** **the number of pixels by which we slide our filter matrix over the input matrix. A stride of 1 means we will move the filters one pixel at a time.

**Zero-padding: **Input matrix is padded with zeros around the border, so that we can apply the filter to bordering elements of the input image matrix. It ensures that filter covers the entire area of input image.

**Dealing with non-linearity: **After each convolutional layer, it is a general convention to apply a nonlinear layer (or **activation layer**) immediately afterward. The reason being Convolution is a linear operation (element wise multiplications and summations) and this is not the case in real life problems, so we account for non-linearity by introducing a non-linear function. In most of the cases people prefer to use non-linear ReLU (Rectified Linear Unit) function. Tanh and Sigmoid are other popular nonlinear functions.

**Pooling Layers:**** It **is also referred to as a down sampling layer. Pooling layer is used to reduce the spatial size of the representation so that amount of parameters and computation in the network can be reduced. Pooling layer operates on each feature map independently.

The most common type of pooling used is *max pooling. *This basically takes a filter (normally of size 2×2) and a stride of the same length. It then applies it to the input volume and outputs the maximum number in every sub region that the filter convolves around.

**Fully Connected Network: **Convolution, ReLU and Pooling works for feature extraction purpose. These layers are the basic building blocks of any Convolutional Neural Network. But idea behind extracting these features is to classify the image or any other purpose for that matter, and that is why we need fully connected neural network. In other words a fully connected layer takes a weighted sum of pixels in the entire input to that layer.

The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a softmax activation function (or any other such function such as sigmoid) in the output layer. The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. The other way to understand it is-think like this that the convolutional layers extract high level features and the fully connected layers decide the non-linear function from these features.

Finally; Backpropagation is used to calculate the *gradients* of the error with respect to all weights in the network and use *gradient descent* to update all filter weights and parameter values to minimize the output error.

LeNet, AlexNet, VGGNet are some famous Convolutional Neural Networks. I will advise you to explore them.

**https://www.quora.com/What-is-a-convolutional-neural-network**

**Why are fully connected layers used at the “very end” of convolutional NNs? Why not earlier?**

Once a fully connected layer is applied, the output is “scrambled,” i.e. has no spatial structure. Idea of a convolutional layer is to find local structure across every part of the input. Only when there is some spatial or quasi-spatial structure in the input, convolutional layer can be applied, and so it does not make sense to apply fully connected layer first and convolution layer later. Is not it?

So general order followed in Convolutional Neural Network is convolutional layer then nonlinear unit then max pool or any other pooling layer and finally fully connected layer.

**Fully Convolutional Network (FCN) – Fully connected layer in a deep neural network and an equivalent convolutional layer:**

In recent years, the idea of fully convolutional network has emerged. The main difference it possesses from traditional convolutional neural network is that the fully **Convolutional** net is learning filters everywhere, even the decision-making layers at the end of the network acts as filters. A fully convolutional net tries to learn representations and make decisions based on **Local **spatial input (however captures global context) instead of global spatial input.

In this section we will try to see the basic advantage that a Convolutional Neural Network has over fully Connected Neural Network. In a convolutional layer numbers of parameters to adjust are less due to the fact that the weights are shared in a convolutional layer.

Additionally Max Pooling can be used just after a convolutional layer to reduce the dimensionality of the layer. Also consider if a large input image is to be processed then the whole fully Connected Neural Network would have to be scanned over the large image, this is not efficient, thus if the Fully connected layers are converted to their equivalent Convolution layers then only a single forward pass is sufficient.

All these things lead to an efficient and robust model. This is the reason-why fully convolutional network is becoming popular day by day.

**DeepFix: a fully convolutional neural network for predicting human fixations (UPC Reading Group)**from

**Universitat Politècnica de Catalunya**

**Sparsely-Connected Neural Networks:**

Deep Neural Networks and Convolutional Neural Network have received considerable attention in data science community due to their ability to extract and represent high-level abstractions in data sets. They can be applied to wide range of recognition and classification tasks, however at the cost of a large number of parameters and computational complexity. High power consumption due to their high degree of complexity has always been a matter of concern.

The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fully-connected networks. By use of Sparsely-connected networks, number of connections in fully-connected networks can be reduced by up to 90% while retaining the accuracy and performance.

A surprisingly effective approach to do it is to simply reduce the number of channels in each convolutional layer by a fixed fraction and retrain the network. In many cases this leads to significantly smaller networks with only minimal changes to accuracy. Maximum sparsity is obtained by exploiting both inter-channel and intra-channel redundancy, with a fine-tuning step that minimize the recognition loss caused by maximizing sparsity.

**https://www.slideshare.net/JeremyNixon/understanding-convolutional-neural-networks**

I think we have covered almost everything that is needed for a basic understanding of Convolutional Neural Networks. This should be good enough to take the first step towards building our Convolutional Neural Network model. Still, if you have some problem understanding the concept-feel free to drop your doubt in the comment section below and we will try our best to resolve your issue. J

This is it for today. Soon we will be back with next one in the series **“Implementation of Convolutional Neural Network”**. Till then; adios.