LLM Study Notes - Basic Components of Neural Networks

Basic Structure of Neural Networks#

Neuron (simple neuron)#

$w$ (weight): weight
$x$ : input value
$b$ : bias
The bias is similar to the intercept in linear equations.
Main functions:
Adjusting the threshold of the activation function: The bias $b$ can be seen as adjusting the threshold of the activation function. A neuron without bias relies entirely on the weighted sum of the inputs for its output. When the weighted sum of the inputs is zero, a neuron without bias will always output the same value. However, with the addition of bias $b$ , even if the weighted sum of the inputs is zero, the neuron can still output different values, thereby increasing the flexibility of the model.
Improving the expressive power of the model: By adding bias terms, neural networks can fit more types of data distributions. Bias terms allow neural networks to better learn the complex features of the data, thereby improving the expressive power and generalization ability of the model.
Avoiding underfitting: Bias terms can help the model avoid underfitting problems. Underfitting occurs when the model fails to capture the underlying patterns of the training data. Bias terms allow neurons to activate even without significant input signals, helping the model better fit the training data.
$f$ : activation function
Note:
The output here is a scalar, and $\mathbf{w}^T\mathbf{x}$ is dot product.
The formula can also be written as:

$y=f(∑_iw_ix_i+b)$

Single Layer Neural Network#

Multiple neurons in a single layer neural network can perform parallel computation.

See note for parallel computation method.

Each neuron here can output a scalar.
Note:
The parallel computation method here changes the weight $w$ from a vector to a matrix, and the bias $b$ from a scalar to a vector. Note that this is not parallel computation using multiple threads as we understand it.
The calculation of the activation function is performed after multiplying the input vector with the weight matrix and adding the bias.
Weighted sum calculation:

First, multiply the input vector $\mathbf{x}$ with the weight matrix $\mathbf{W}$ to obtain the weighted sum vector:
$\mathbf{z=Wx}$
Here, $\mathbf{z}$ is a vector, and each element corresponds to the weighted sum of a neuron.
Then, add the bias vector $b$ to the weighted sum vector: $\mathbf{z=Wx}+b$

Activation function calculation:

The activation function $f$ is applied to each element of the weighted sum vector $\mathbf{z}$ to obtain the output vector $\mathbf{y}$ :

$\mathbf{y} = f(\mathbf{z})$
Specifically, if we have $m$ neurons, both $\mathbf{z}$ and $\mathbf{y}$ are vectors of length $m$ . The activation function $f$ is applied element-wise to each element $z_i$ of $\mathbf{z}$ to obtain the corresponding output $y_i$ :

$y_i = f(z_i)$

Multilayer Neural Network#

By stacking multiple similar layers, we can obtain a multilayer neural network.

Forward computation: Calculate the results of each layer in sequence starting from the input.

Input layer -> Hidden layer -> Output layer

Hidden layer: Multiple layers added before the input.

Note: The output of the hidden layer is usually represented by $\mathbf{h}$ , and the output $\mathbf{h}$ here is a vector, mainly obtained by linear transformation and activation function.

Important!!!:

Role of the activation function:

Introducing nonlinearity: The activation function converts a linear model into a nonlinear model. A neural network without an activation function is essentially a stack of linear transformations, equivalent to a single-layer linear transformation regardless of the number of layers. It prevents the collapse of a multilayer neural network into a single neural network and allows the neural network to learn and represent complex nonlinear relationships.

If only linear operations exist in a neural network, a multilayer neural network can be transformed into a single-layer neural network.

Increasing the expressive power of the network: By introducing nonlinearity, the activation function allows the neural network to fit any complex function. This greatly increases the expressive power and generalization ability of the network, enabling it to handle various datasets and tasks.
Aiding convergence of gradient descent: Certain activation functions (such as ReLU) can alleviate the problem of vanishing gradients, thereby helping the gradient descent algorithm converge faster. The choice of activation function can affect the training efficiency and effectiveness of the model.

Activation Functions#

Sigmoid: Converts input ranging from negative infinity to positive infinity to 0 to 1.

Tanh: Converts input ranging from negative infinity to positive infinity to -1 to 1.

Note: Tanh outputs 0 when the input is 0.

ReLU: For positive inputs, the output remains the same as the original number. For negative inputs, the output is 0.

Softmax: Used for the output layer of multi-class classification problems, it converts the input into a probability distribution. The output values are between 0 and 1, and the sum is 1.

Formula:

f(z_i)=\frac{e^zi}{∑_ie^{zj}}

Output Layer#

There are some similarities with the hidden layer, depending on what kind of data you want the model to output.

Linear output: Add another linear layer after the hidden layer to obtain a single value. Mainly used for regression problems.

Sigmoid: Similar to the sigmoid activation function, first use a regular linear layer to obtain a value $y$ , and then apply the sigmoid activation function to compress the output to 0 to 1. Mainly used for solving binary classification problems, where we use the output $y$ to represent the probability of the current input belonging to one class, and use $1-y$ to represent the probability of belonging to the other class.

Softmax: Mainly used for solving multi-class classification problems (number of classes > 2). First, apply a linear layer to the last hidden layer to obtain an output $z$ , and then apply the softmax activation function to obtain a probability distribution of different classes, allowing the model to find the probability of the output belonging to each class.