Cian

Cian

早岁已知世事艰, 仍许飞鸿荡云间。 一路寒风身如絮, 命海沉浮客独行。 千磨万击心铸铁, 殚精竭虑铸一剑。 今朝剑指叠云处, 炼蛊炼人还炼天!
github
bilibili

LLM Study Notes - Basic Components of Neural Networks

Basic Structure of Neural Networks#

Image

Neuron (simple neuron)#

Image

ww (weight): weight
xx: input value
bb: bias
The bias is similar to the intercept in linear equations.
Main functions:
Adjusting the threshold of the activation function: The bias bb can be seen as adjusting the threshold of the activation function. A neuron without bias relies entirely on the weighted sum of the inputs for its output. When the weighted sum of the inputs is zero, a neuron without bias will always output the same value. However, with the addition of bias bb, even if the weighted sum of the inputs is zero, the neuron can still output different values, thereby increasing the flexibility of the model.
Improving the expressive power of the model: By adding bias terms, neural networks can fit more types of data distributions. Bias terms allow neural networks to better learn the complex features of the data, thereby improving the expressive power and generalization ability of the model.
Avoiding underfitting: Bias terms can help the model avoid underfitting problems. Underfitting occurs when the model fails to capture the underlying patterns of the training data. Bias terms allow neurons to activate even without significant input signals, helping the model better fit the training data.
ff: activation function
Note:
The output here is a scalar, and wTx\mathbf{w}^T\mathbf{x} is dot product.
The formula can also be written as:

y=f(iwixi+b)y=f(∑_iw_ix_i+b)

Single Layer Neural Network#

Image

Multiple neurons in a single layer neural network can perform parallel computation.

See note for parallel computation method.

Each neuron here can output a scalar.
Note:
The parallel computation method here changes the weight ww from a vector to a matrix, and the bias bb from a scalar to a vector. Note that this is not parallel computation using multiple threads as we understand it.
The calculation of the activation function is performed after multiplying the input vector with the weight matrix and adding the bias.
Weighted sum calculation:

  • First, multiply the input vector x\mathbf{x} with the weight matrix W\mathbf{W} to obtain the weighted sum vector:
    z=Wx\mathbf{z=Wx}
    Here, z\mathbf{z} is a vector, and each element corresponds to the weighted sum of a neuron.
  • Then, add the bias vector bb to the weighted sum vector: z=Wx+b\mathbf{z=Wx}+b

Activation function calculation:

  • The activation function ff is applied to each element of the weighted sum vector z\mathbf{z} to obtain the output vector y\mathbf{y}:

    y=f(z)\mathbf{y} = f(\mathbf{z})

  • Specifically, if we have mm neurons, both z\mathbf{z} and y\mathbf{y} are vectors of length mm. The activation function ff is applied element-wise to each element ziz_i of z\mathbf{z} to obtain the corresponding output yiy_i:

    yi=f(zi)y_i = f(z_i)

Multilayer Neural Network#

Image

By stacking multiple similar layers, we can obtain a multilayer neural network.

Forward computation: Calculate the results of each layer in sequence starting from the input.

Input layer -> Hidden layer -> Output layer

Hidden layer: Multiple layers added before the input.

Note: The output of the hidden layer is usually represented by h\mathbf{h}, and the output h\mathbf{h} here is a vector, mainly obtained by linear transformation and activation function.

Important!!!:

Role of the activation function:

  1. Introducing nonlinearity: The activation function converts a linear model into a nonlinear model. A neural network without an activation function is essentially a stack of linear transformations, equivalent to a single-layer linear transformation regardless of the number of layers. It prevents the collapse of a multilayer neural network into a single neural network and allows the neural network to learn and represent complex nonlinear relationships.

If only linear operations exist in a neural network, a multilayer neural network can be transformed into a single-layer neural network.

Image

  1. Increasing the expressive power of the network: By introducing nonlinearity, the activation function allows the neural network to fit any complex function. This greatly increases the expressive power and generalization ability of the network, enabling it to handle various datasets and tasks.
  2. Aiding convergence of gradient descent: Certain activation functions (such as ReLU) can alleviate the problem of vanishing gradients, thereby helping the gradient descent algorithm converge faster. The choice of activation function can affect the training efficiency and effectiveness of the model.

Activation Functions#

Image

Sigmoid: Converts input ranging from negative infinity to positive infinity to 0 to 1.

Tanh: Converts input ranging from negative infinity to positive infinity to -1 to 1.

Note: Tanh outputs 0 when the input is 0.

ReLU: For positive inputs, the output remains the same as the original number. For negative inputs, the output is 0.

Softmax: Used for the output layer of multi-class classification problems, it converts the input into a probability distribution. The output values are between 0 and 1, and the sum is 1.

Formula:

f(zi)=eziiezjf(z_i)=\frac{e^zi}{∑_ie^{zj}}

Output Layer#

Image

Image

There are some similarities with the hidden layer, depending on what kind of data you want the model to output.

Linear output: Add another linear layer after the hidden layer to obtain a single value. Mainly used for regression problems.

Sigmoid: Similar to the sigmoid activation function, first use a regular linear layer to obtain a value yy, and then apply the sigmoid activation function to compress the output to 0 to 1. Mainly used for solving binary classification problems, where we use the output yy to represent the probability of the current input belonging to one class, and use 1y1-y to represent the probability of belonging to the other class.

Softmax: Mainly used for solving multi-class classification problems (number of classes > 2). First, apply a linear layer to the last hidden layer to obtain an output zz, and then apply the softmax activation function to obtain a probability distribution of different classes, allowing the model to find the probability of the output belonging to each class.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.