when we multiply each of them features with a weight (w1, w2, …, wm) and sum them all together, node output = activation(weighted sum of inputs).
(1)
- Propagation is a procedure to repeatedly adjust the weights so as to minimize the difference between actual output and desired output.
- Hidden Layers is which are neuron nodes stacked in between inputs and outputs, allowing neural networks to learn more complicated features (such as XOR logic).
- Backpropagation is a procedure to repeatedly adjust the weights so as to minimize the difference between actual output and desired output. It allows the information to go back from the cost backward through the network in order to compute the gradient. Therefore, loop over the nodes starting from the final node in reverse topological order to compute the derivative of the final node output. Doing so will help us know who is responsible for the most error and change the parameters appropriate in that direction.
- Gradient Descent is used while training a machine learning model. It is an optimization algorithm, based on a convex function, that tweaks its parameters iteratively to minimize a given function to its local minimum. A gradient measures how much the output of a function changes if you change the inputs a little bit. Note: If gradient descent is working properly, the cost function should decrease after every iteration.
Types of activation Functions:
The Activation Functions are basically two types: 1. Linear Activation Function –
Equation : f(x) = x
Range : (-infinity to infinity)
2. Non-linear Activation Functions –
It makes it easy for the model to generalize with a variety of data and to differentiate between the output. By simulation, it is found that for larger networks ReLUs is much faster. It has been proven that ReLUs result in much faster training for large networks. Non-linear means that the output cannot be reproduced from a linear combination of the inputs.
The main terminologies needed to understand for nonlinear functions are:
1. Derivative: Change in y-axis w.r.t. change in x-axis. It is also known as slope.
2. Monotonic function: A function which is either entirely non-increasing or non-decreasing.
The Nonlinear Activation Functions are mainly divided on the basis of their range or curves as follows: Let’s take a deeper insight in each Activations Functions- 1. Sigmoid:
It is also called as a Binary classifier or Logistic Activation function because function always pick value either 0(False) or 1 (True).
The sigmoid function produces similar results to step function in that the output is between 0 and 1. The curve crosses 0.5 at z=0, which we can set up rules for the activation function, such as: If the sigmoid neuron’s output is larger than or equal to 0.5, it outputs 1; if the output is smaller than 0.5, it outputs 0.
The sigmoid function does not have a jerk on its curve. It is smooth and it has a very nice and simple derivative, which is differentiable everywhere on the curve.
f(x) = max(0, x)The models that are close to linear are easy to optimize. Since ReLU shares a lot of the properties of linear functions, it tends to work well on most of the problems. The only issue is that the derivative is not defined at z = 0, which we can overcome by assigning the derivative to 0 at z = 0. However, this means that for z <= 0 the gradient is zero and again can’t learn.
(2)
tanh(x)=2 sigmoid(2x)-15. Softmax : The sigmoid function can be applied easily and ReLUs will not vanish the effect during your training process. However, when you want to deal with classification problems, they cannot help much. the sigmoid function can only handle two classes, which is not what we expect but we want something more. The softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. and it also divides each output such that the total sum of the outputs is equal to 1. The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.
where 0 is a vector of the inputs to the output layer (if you have 10 output units, then there are 10 elements in z). And again, j indexes the output units, so j = 1, 2, …, K.Properties of Softmax Function – 1. The calculated probabilities will be in the range of 0 to 1. 2. The sum of all the probabilities is equals to 1. Softmax Function Usage – 1. Used in multiple classification logistic regression model. 2. In building neural networks softmax functions used in different layer level and multilayer perceptrons. Example:
(3)
Article Tags :
Recommended Articles