Skip to main content

Artificial Neural Network

Intro

Basic Structure

  • Input nodes: Predictor / Variable values
  • Edges: for incoming value xx, outgoing value is weight\*xweight\*x
  • Hidden layer nodes: Sum up incoming values w/biasw/bias term (指 _i=1nweighti\*xi+bias\sum\_{i=1}^{n} weight_i \* x_i + bias). Then apply activation function to the transformed sum value.
  • Output nodes: may have additional function before final prediction.

  • Neurons(神经元) = nodes

  • Perception(感知器) = single node binary classifier (Only Input and Output layer)

  • Architecture / Topology: number and organization of nodes, layers, connectors

  • Forward Propagation: Input → Output, applying weights, biases, activation functions

  • Backward Propagation: ← Error (pushed back), adjusting parameters

Activation Function

  • ReLU: f(x)=max(0,x)f(x) = max(0, x)
  • Leaky ReLU: f(x)=max(λx,x)f(x) = max(\lambda x, x), λ=0.1(hyperparameter) \lambda = 0.1 (hyperparameter)
  • Sigmoid: f(x)=11+exf(x) = \frac{1}{1+e^{-x}}
  • Tanh: f(x)=tanh(x)f(x) = tanh(x)

Calculation

ANN Calculation

以上:
f(1) = w_1 _ x + b_1
f(2) = w_2 _ x + b_2
f(3) = w_3 _ f(1) + w_4 _ f(2) + b_3

Why multi output nodes?

多分类问题,一个node对应一个预测类别。

Multi Output Example

从这个例子看,多个输入特征(Age, Salary...) => 多个输出类别

(1,2.5,0)(-1, 2.5, 0) 显然不能作为最后的概率,所以加个激活函数,比如Softmax, 将原始得分转换为概率分布:
f(x)=exij=1nexjf(x) = \frac{e^{x*i}}{\sum*{j=1}^{n} e^{x_j}}

{e1=0.3679,e2.5=12.1825,e0=1NormalizationFactor=0.3679+12.1825+1=13.5504\begin{cases} e^{-1} = 0.3679, e^{2.5} = 12.1825, e^{0} = 1 \\ Normalization Factor = 0.3679 + 12.1825 + 1 = 13.5504 \end{cases} {P(1)=0.367913.5504=0.027P(2)=12.182513.5504=0.899P(3)=113.5504=0.074\begin{cases} P(1) = \frac{0.3679}{13.5504} = 0.027 \\ P(2) = \frac{12.1825}{13.5504} = 0.899 \\ P(3) = \frac{1}{13.5504} = 0.074 \end{cases}

所以最终的概率分布为:(0.027,0.899,0.074)(0.027, 0.899, 0.074)

而这只是其中一个样本:
Multi Output Example_2
对于所有结果,再用Cross-Entropy Loss计算损失: L=log2(0.90)+[log2(0.53)]+[log2(0.35)]=0.15+0.92+1.51=2.58L = -log_2(0.90) + [-log_2(0.53)] + [-log_2(0.35)] = 0.15 + 0.92 + 1.51 = 2.58 再拿它去传回给模型,调整参数。

Wide vs. Deep

理论上,一个足够wide的网络可以拟合任何函数

不过实际上,deep > wide:

  1. wide 只是memorize,而不是generalize,也就是说它并不能很好地泛化、学习模式
  2. 神经元随任务复杂程度增加,指数级增长
  3. deep网络可以自动学习抽象特征
  4. 小的隐藏层强制网络进行特征抽象 (Feature Abstraction)
  5. 现实世界的复杂任务本身也通常具有层级结构 (Hierarchical Structure), eg. 图像识别(像素->边缘->形状->物体)

降维 (Dimensionality Reduction) & 压缩 (Compression)

Pooling layers

  • 降维: reduce size (summarize input)
    Pooling Layers

Autoencoders

  • 压缩: compress information Autoencoders

As shown above, Input Layer + Code is the Encoder, while Code + Output Layer is the Decoder. Encoder transforms data into a more computationally friendly format, while Decoder trans into human-friendly.

不过事实上,Autoencoders最常见的用途是,通过学习正常的模式来检测异常

  1. 训练时,只用正常数据
  2. encoder学习正常模式,decoder重建
  3. 重建误差越大,越可能是异常

当然了,此外还有一些其他的应用,比如图像去噪(医学影像)

Convolutional Layers

Sizeoffeaturemap=NF+2PS+1Size-of-feature-map = \frac{N - F + 2P}{S} + 1
  • N: input size
  • F: filter size
  • P: padding
  • S: stride