This post consists of the following two sections:
- Section 1: Basics of Neural Networks
- Section 2: Understanding Backward Propagation and Gradient Descent
For decades researchers have been trying to deconstruct the inner workings of our incredible and fascinating brains, hoping to learn to infuse a brain-like intelligence into machines. For example, when we were toddlers, we did not learn to recognize objects by learning their distinctive features. A child learns to call a cat a cat and a dog a dog by being exposed to the same example many times and by being corrected for the wrong guesses. This extremely active “inspired from the brain” field of artificial computer intelligence is called Deep Learning. And the corresponding programming paradigm, which allows computers to learn from data, is called Artificial Neural Networks (ANN).
Just like a biological neuron has dendrites to receive signals, a cell body to process them, and an axon to output the signals to other neurons, the artificial neuron (also referred as a perceptron) has a number of input channels, a processing unit, and an output channel that can send signals to multiple other nodes.
Each input (x0, x1, x2) to the neuron has an associated weight (w0, w1, w2), which is assigned on the basis of its relative importance to other inputs. Each of the inputs is multiplied by its weight and then the processing unit applies a transformation function f, called the Activation Function, to the weighted sum of its inputs as shown in the figure below.
The ANN additionally takes another input 1 with weight b (called Bias, shown in the blue circle). The Activation function f, used to compute the final output Y (where Y = f(x0.w0+x1.w1+x2.w2 + b)), needs to be a non-linear function. This is because ANNs needs to learn complex relationships. We will see what this exactly means in a while.
Before moving on, let’s breakdown this whole process in steps and visualize it by an animation:
- Each input is multiplied by its weight.
- The weighted inputs are summed to a single number and a bias is added.
- The output from step 2 is passed through a nonlinear function to produce the final output.
Why do we need Activation Functions?
For example, if we simply want to estimate the cost of a meal, we could use a neuron based on a linear function f(z) = wx + b. Using (z) = z and weights equal to the price of each item, the linear neuron would take in the number of burgers, fries, and sodas and output the price of the meal.
The above kind of linear neuron is very easy for computation. However, by just having an interconnected set of linear neurons, we won’t be able to learn complex relationships: taking multiple linear functions and combining them still gives us a linear function, and we could represent the whole network by a simple matrix transformation.
Real world data is mostly non-linear, so this is where the Activation functions comes for help. Activation functions introduce non-linearity into the output of a neuron, and this ensures that we can learn more complex functions by approximating every non-linear function as a linear combination of a large number of non-linear functions.
Also, activation functions help us to limit the output to a certain finite value.
Types of Activation Functions
The activation function (non-linearity) simply needs to take a number as input and apply a mathematical operation on it. There is a rich variety of activation functions but the most commonly used ones are: Sigmoid, Tanh, and ReLU.
- Sigmoid: It takes a real-valued input and outputs a number between 1 and 0, as shown in the figure below. Intuitively, when the input is very small, the output of the neuron is very close to 0. When the input is very large, the output of the neuron is close to 1. In between the two extremes, it is S-shaped.
f(z) = 1 / (1 + exp(−z))
- tanh: It takes a real-valued input and outputs a number in the range from -1 to 1.
tanh(z) = 2σ(2z) − 1
- ReLU: It takes a real-valued input and replaces the negative values with zero. Despite some drawbacks, ReLU has become the neuron of choice for many tasks (especially in computer vision).
f(z) = max(0, z)
Neural Network Architecture and Types
The components of a neural network:
- Input layer: These nodes are just responsible for passing the input information to the hidden layer, without performing any computations. (A set of nodes is called a layer.)
- Hidden layer (“not an input or an output”): These are the nodes where all the computations occur and they transfer information from the input nodes to the output nodes. While a network can only have a single input and output layer, there can be multiple (or none) Hidden Layers.
- Output layer: This is where the final output is calculated (using activation functions).
- Weights: Each of the connections between nodes has an associated weight.
So far we’ve looked at models where the output from one layer is used as input to the next layer. Such networks are called feedforward networks – there are no loops in the network, information is always transferred forward and never fed back. Having loops means ending up with the case where the input function f(z) is dependent on the output. The models where we have such feedback loops are called recurrent neural networks, RNNs. To allow for backward propagation, RNNs have the concept of an internal memory. And they are very handy for tasks where we are dealing with sequences, e.g., for understanding speech or texts.
Feedforward networks can then again be split into two main categories:
- Single Layer Perceptron: These are the simplest form of NNs as they have no hidden layers (like in the initial examples/figures).
- Multi Layer Perceptron: When we have one or more hidden layers – the ones used for practical purposes and not just for theoretical explanations.
How does the model learn?
One of the most frequently used methods to train ANNs is called Backward Propagation or “learning from mistakes”. As we have seen, for any given set of inputs, the output of an ANN is dependent on the weights of the edges connecting the nodes in the network. So we want to learn the values for the weights such as to minimize the final error — the weights that minimize the errors we make on the training examples.
Without “Mathy” details: Initially, all the weights are assigned random values. For every input sample we calculate the final output and the prediction error. Then we propagate this error back to the previous layer, where it is used to adjust weights. Once all the layers are done adjusting weights, we use the new weights to calculate the new prediction error. This process of adjusting weights based on the final error is repeated until we have achieved an error below the desired threshold.
Now let’s understand it with some Math:
Let’s say that our output is given by the function Y = mx + b and our error function E, also called loss function, is given by:
where, m and b are respectively the weights and bias for the training example xi, and yi is the actual output. (In simple words, f is the difference between the actual output and the predicted output.)
The error E given by the loss function is zero when our model makes a perfectly correct prediction on every training example. Moreover, the closer E is to 0, the better our model is. As a result, our goal will be to select the values for all the weights in our model such that E is as close to 0 as possible. So the problem we are trying to solve is: “Which values to assign to the weights such that we can have the minimum E?” An optimization algorithm called Gradient Descent can provide us a solution.
Suppose you are at the top of a hill, and you have to reach the valley (or the lowest point) from the hill. And let’s also say that it is a foggy day and you cannot see far enough to determine exactly which direction leads to the valley. How would you go about your descent?
Starting at the top of the mountain, we could take a first step towards the path with a descending slope (i.e., limiting ourselves to only analyze the terrain in our surroundings), hoping that the descending trend will continue and it will eventually lead us to the valley. So we will move step by step, assess the slope around us, and keep moving until we reach the valley or a point where we can no longer move downhill – a local minimum.
Now, why this analogy? The hill is an analogy for the values of weights that give high prediction error while the valley is the location where the error is minimal. Gradient descent measures the gradient/slope (the change in cost caused by a change in weight, yellow lines in the figure below) for the starting value of the weight (green cross) and moves it towards the bottom of the hill.
We know that at the point where the curve representing our function is flat, the derivative (its rate of change) at that point is 0. In Gradient Descent, we take the derivative of the loss function to produce the gradient. Based on the gradient, the algorithm can tell if the weight should be increased or decreased in order to push the gradient towards the direction where the slope flattens out.
Now let’s apply gradient descent on the cost function we had defined above. Our cost function has two parameters: (weight) and (bias). Since we need to consider the impact each one has on the final prediction, we obtain the gradient by calculating the partial derivative of the cost function with respect to each of the parameters:
So we iterate through our input data points, and compute the partial derivatives for the current values of our parameters. The resulting gradient tells us the slope of our cost function at our current position and the direction we should move to update our parameters, i.e., if we should increase or decrease the current values — we always take a step in the direction of the negative gradient in order to reduce loss as quickly as possible. The speed with which we can move towards the valley is controlled by the size of our steps also known as learning rate.
A high learning rate means that we can potentially reach our target faster, but at the risk of jumping over the lowest point and ending on the other side (above the valley again, but on the opposite side). On the other hand, with a very low learning rate, we are playing it safe. However, precision comes at the cost of time because calculating gradients frequently and taking small steps is a time consuming process.
Hopefully, now you have a clear understanding of the basics of Neural Networks. I would like to point out that Deep Learning solutions are ideal to tackle large and highly complex ML tasks, such as recommending the best videos to watch to hundreds of millions of users every day (like YouTube, Netflix) or powering speech recognition services (like Alexa, Cortana). Also, Deep Learning algorithms are black-box algorithms, so when starting a new ML project you should not jump to them just because Deep Learning solutions sound cooler — if possible, start simple.
. . .
Thanks for reading! If you want to be notified when I write something new, press follow.