Artificial Neural Network

Intro

Basic Structure

Input nodes: Predictor / Variable values
Edges: for incoming value $x$ , outgoing value is $weight\*x$
Hidden layer nodes: Sum up incoming values $w/bias$ term (指 $\sum\_{i=1}^{n} weight_i \* x_i + bias$ ). Then apply activation function to the transformed sum value.
Output nodes: may have additional function before final prediction.

Neurons(神经元) = nodes
Perception(感知器) = single node binary classifier (Only Input and Output layer)
Architecture / Topology: number and organization of nodes, layers, connectors
Forward Propagation: Input → Output, applying weights, biases, activation functions
Backward Propagation: ← Error (pushed back), adjusting parameters

Activation Function

ReLU: $f(x) = max(0, x)$
Leaky ReLU: $f(x) = max(\lambda x, x)$ , $\lambda = 0.1 (hyperparameter)$
Sigmoid: $f(x) = \frac{1}{1+e^{-x}}$
Tanh: $f(x) = tanh(x)$

Calculation

ANN Calculation

以上：
$f(1) = w_1 _ x + b_1$
$f(2) = w_2 _ x + b_2$
$f(3) = w_3 _ f(1) + w_4 _ f(2) + b_3$

Why multi output nodes?

多分类问题，一个node对应一个预测类别。

Multi Output Example

从这个例子看，多个输入特征(Age, Salary...) => 多个输出类别

$(-1, 2.5, 0)$ 显然不能作为最后的概率，所以加个激活函数，比如Softmax, 将原始得分转换为概率分布：
$f(x) = \frac{e^{x*i}}{\sum*{j=1}^{n} e^{x_j}}$

\begin{cases} e^{-1} = 0.3679, e^{2.5} = 12.1825, e^{0} = 1 \\ Normalization Factor = 0.3679 + 12.1825 + 1 = 13.5504 \end{cases}

\begin{cases} P(1) = \frac{0.3679}{13.5504} = 0.027 \\ P(2) = \frac{12.1825}{13.5504} = 0.899 \\ P(3) = \frac{1}{13.5504} = 0.074 \end{cases}

所以最终的概率分布为： $(0.027, 0.899, 0.074)$

而这只是其中一个样本：
Multi Output Example_2
对于所有结果，再用Cross-Entropy Loss计算损失： $L = -log_2(0.90) + [-log_2(0.53)] + [-log_2(0.35)] = 0.15 + 0.92 + 1.51 = 2.58$ 再拿它去传回给模型，调整参数。

Wide vs. Deep

理论上，一个足够wide的网络可以拟合任何函数

不过实际上，deep > wide：

wide 只是memorize，而不是generalize，也就是说它并不能很好地泛化、学习模式
神经元随任务复杂程度增加，指数级增长
deep网络可以自动学习抽象特征
小的隐藏层强制网络进行特征抽象 (Feature Abstraction)
现实世界的复杂任务本身也通常具有层级结构 (Hierarchical Structure), eg. 图像识别(像素->边缘->形状->物体)

降维 (Dimensionality Reduction) & 压缩 (Compression)

Pooling layers

降维: reduce size (summarize input)

Autoencoders

压缩: compress information

As shown above, Input Layer + Code is the Encoder, while Code + Output Layer is the Decoder. Encoder transforms data into a more computationally friendly format, while Decoder trans into human-friendly.

不过事实上，Autoencoders最常见的用途是，通过学习正常的模式来检测异常：

训练时，只用正常数据
encoder学习正常模式，decoder重建
重建误差越大，越可能是异常

当然了，此外还有一些其他的应用，比如图像去噪（医学影像）

Convolutional Layers

Size-of-feature-map = \frac{N - F + 2P}{S} + 1

N: input size
F: filter size
P: padding
S: stride

Intro​

Basic Structure​

Activation Function​

Calculation​

Why multi output nodes?​

Wide vs. Deep​

降维 (Dimensionality Reduction) & 压缩 (Compression)​

Pooling layers​

Autoencoders​

Convolutional Layers​