Gradient descent
To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.
source
image source
Stochastic gradient descent (SGD)
A stochastic approximation of the gradient descent for minimizing an objective function that is a sum of functions.
The true gradient is approximated by the gradient of a randomly chosen single function.
source
Initialization of a network
Usually, the biases of a neural network are set to zero, while the weights are initialized with independent and identically distributed zero-mean Gaussian noise.
The variance of the noise is chosen in such a way that the magnitudes of input signals does not change drastically.
source
Learning rate
The scalar by which the negative of the gradient is multiplied in gradient descent.
Backpropagation
An algorithm, relying on an iterative application of the chain rule, for computing efficiently the derivative of a neural network with respect to all of its parameters and feature vectors.
source
image source
Goal function
The function being minimized in an optimization process, such as SGD.
Data preprocessing
The input to a neural network is often mean subtracted, contrast normalized and whitened.
One-hot vector
A vector containing one in a single entry and zero elsewhere.
image source
Cross entropy
Commonly used to quantify the difference between two probability distributions. In the case of neural networks, one of the distributions is the output of the softmax, while the other is a one-hot vector corresponding to the correct class.
Added noise
A perturbation added to the input of the network or one of the feature vectors it computes.
image source