Intro to neural Networks with Keras

03 July 2021 - 4 mins read time
Tags: Hands-On Machine Learning2

The perceptron

The perceptron is one of the simplest ANN architectures. It is based on an artificial neuron called threshold logic unit (TLU). The inputs are numbers and each input connection is associated with an weight . The TLU calculates the weighted sum of the inputs and then applies a step function to that sum and then outputs the result enter image description here The most common step function used in perceptrons is called Heaviside step function.

A single TLU can be used for simple linear binary classification, it computes the linear combination of the inputs and if the result exceeds the threshold, it gives a positive output. For example, you could use a single TLU to classify iris flowers based on the petal length and width.Training a TLU in this case means finding the right values for w0 ,w1 , and w2.

A perceptron is basically a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer, it is called a fully connected layer or dense layer.

The decision output boundary is linear, hence just like logistic regression, perceptrons are unable to learn complex patterns. However if the training instances are linearly separable then the algorithm will converge to a solution. This is called Perceptron Convergence theorem

Scikit provides a Perceptron class that implements a single TLU network

Perceptrons learning algorithm strongly resembles that of Stochastic Gradient Descent.

NOTE:

Perceptrons do not use probabilities like logistic regression does to make predictionsl, instead it makes predictions based on a hard threshold value. This may be a key while choosing whether to use Perceptron or Logistic regression

Perceptrons have a number of weaknesses, the most common being its inability to solve trivial problems, which is true with any linear classification model as well. With time, it was found out that some of the limitations of perceptrons can be overcome by stacking multiple perceptrons. The resulting ANN is called Multi-Layer Perceptron (MLP)

Multi-Layer Perceptron and Backpropagation

An MLP is composed of one (pass) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

enter image description here When an ANN contains a deep stack of hidden layers it is called deep neural network (DNN).

Backpropagation, in short is basically gradient descent. In just two passes ( forward and backward ) the algorithm is able to calculate the gradient of the network with regards to every single parameter.

The algorithm in more detail is:

It handles one mini batch at a time and it goes through the full training set multiple times. Each pass is called an epoch.
Each mini batch is passed into the network’s input layer. This is the step where forward pass takes place. It is important to note that all intermediate results are preserved since they are needed for backpropagation.
The algorithm then measures the output error.
It then computes how much each output connection contributed to the error. This is done using the chain rule.
The algorithm then measures how much of each error contributions came from each connection in the layer below, again using the chain rule. This goes on until the algorithm reaches the input layer.
Finally the algorithm performs a gradient descent step to tweak all the connection weights in the network using the error gradients it computed.

Activation functions for backpropagation

The logistic function

This is a continuous and differentiable function, shaped as a S, with a range of 0 to 1
The hyperbolic tangent function

This is just like the logistic function, that is, S shaped curve. However the range is from -1 to 1. This makes each layers input centered around 0 at the beginning of the training, This also helps in converging sooner
The rectified linear unit function

This is also continuous but not differentiable. In practice it works very well and has the advantage of being fast to compute