en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

# Introduction to Machine Learning

## Collection of formulas

### Quadratic error function

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 $$

### Quadratic error function with regularization

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
$$ \lambda := \text{Penalty factor} $$

- "ridge regression"

### Gaussian distribution in 1-D

$$ \mathcal{N}(t|\mu,{\sigma}^{2}) = \frac{1}{\sqrt{2 \pi {\sigma}^{2}}} \text{exp}(-\frac{(t - \mu)^2}{2 {\sigma}^{2}}) $$

### Probabilistic modelling: likelihood in 1-D

$$ p(t | x_0, \textbf{w}, \beta) = \mathcal{N}(t | y(\textbf{w}, x_0), {\sigma}^{2}) $$
$$ \beta = \frac{1}{{\sigma}^{2}}\ \text{(\emph{precision})} $$
$$ y(\textbf{w}, x_0) := \text{Output of the model at $x_0$ with parameters \textbf{w}} $$

### Probabilistic modelling: likelihood multidimensional

$$ p(\textbf{t} | \textbf{x}_0, \textbf{w}, {\Sigma}^{-1}) = \mathcal{N}(\textbf{t} | y(\textbf{w}, \textbf{x}_0), {\Sigma}^{-1}) $$
$$ \Sigma := \text{Covariance matrix} $$
$$ y(\textbf{w}, \textbf{x}_0) := \text{Output of the model at $\textbf{x}_0$ with parameters \textbf{w}} $$

### Data-likelihood

- Joint distribution over all data together
- Individual data points are assumed to be independent

$$ L(\textbf{w}) = P(T | X, \textbf{w}, \beta) = \prod\limits_{n=1}^{N} \frac{1}{c} \text{exp}(-\frac{(t_n - y(x_n, \textbf{w}))^2}{2 {\sigma}^{2}}) $$
$$ T := \text{Set of all target points (data)} $$
$$ X := \text{Set of all inputs} $$
$$ c := \text{Normalization constant} $$
$$ N := \text{Number of all input data} $$

### Parameter optimization from data-likelihood

$$ \text{maximize}\ L(\textbf{w}) \Leftrightarrow \text{minimize}\ -\text{log}L(\textbf{w}) $$

- Sum-of-squares-error is contained in $L(\textbf{w})$, rest are constants
- It is sufficient to minimize the sum-of-squares-error

$$ \textbf{w}_{\text{ML}} = \text{argmax}_{\textbf{w}}(L(\textbf{w})) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2) $$

$$ \frac{1}{{\beta}_{\text{ML}}} = \frac{1}{N} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}_{\text{ML}}) - t_n)^2 $$

### Bayesian inference

$$ P(\textbf{w} | D) = \frac{P(D | \textbf{w}) P(\textbf{w})}{P(D)} $$
$$ P(\textbf{w} | D) := \text{Posterior} $$
$$ P(D | \textbf{w}) := \text{Likelihood (model as before)} $$
$$ P(\textbf{w}) := \text{A-priori probability for \textbf{w} (higher probability for smaller parameters)} $$

### Parameter optimization for bayesian approach

$$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{log}P(\textbf{w} | D) $$
$$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$
$$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$

### Bayesian predictive distribution

$$ \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^{2}(x)) $$
$$ s^{2}(x) := \text{Estimated variance} $$

### Incremental bayesian learning

1. $P(\textbf{w} | D_1) = \frac{P(D_1, \textbf{w}) P(\textbf{w})}{P(D_1)}$
2. $P(\textbf{w} | D_2) = \frac{P(D_2, \textbf{w}) P(\textbf{w} | D_1)}{P(D_2)}$

...

### Linear model

$$ y(\textbf{x}, \textbf{w}) = \sum\limits_{m=0}^{M} w_m {\Phi}_{m}(\textbf{x}) = \textbf{w}^T {\Phi}(\textbf{x}) $$
$$ {\Phi}(\textbf{x}) := \text{Design matrix} $$

### Design matrix

$$ {\Phi}(X) = 
\begin{bmatrix}
{\Phi}_{1}(\textbf{x}_1) & \cdots & {\Phi}_{M}(\textbf{x}_1) \\
\vdots & \ddots & \vdots \\
{\Phi}_{1}(\textbf{x}_N) & \cdots & {\Phi}_{M}(\textbf{x}_N)
\end{bmatrix}
$$
$$ X := \text{Input matrix (columns are inputs and rows of each column are input coordinates)} $$

### Quadratic-error-function for linear model

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 $$

### Parameter optimization for linear model

$$ \textbf{w}_{ML} = {\Phi}^{\#} \textbf{t} $$
$$ {\Phi}^{\#} = ({\Phi}^T \Phi)^{-1} {\Phi}^T $$
$$ {\Phi}^{\#} := \text{Moore-Penrose pseudo inverse} $$

- Get the parameters $\textbf{w}$ for the given target vector $\textbf{t}$

### Maximum a-posteriori approach for linear model

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
$$ \textbf{w}_{\text{MAP}} = (\lambda \textbf{I} + {\Phi}^T \Phi)^{-1} {\Phi}^T \textbf{t} $$

### Bayesian predictive distribution

$$ p(t | x, \textbf{x}, \textbf{t}) = \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^2(x)) $$
$$ m(x) = \beta {\Phi}(x)^T \textbf{S} \sum\limits_{n=1}^{N} {\Phi}(x_n) t_n $$
$$ s^2(x) = {\beta}^{-1} + {\Phi}(x)^T \textbf{S} {\Phi}(x) $$
$$ \textbf{S}^{-1} = \alpha \textbf{I} + \beta \sum\limits_{n=1}^{N} {\Phi}(x_n) {\Phi}(x_n)^T $$
$$ \textbf{S} := \text{Covariance matrix} $$

### Gradient descent

- Update parameters $\textbf{w}$ iteratively

$$ \textbf{w}_{k+1} = w_k - \eta \frac{dE(\textbf{w})}{d \textbf{w}} $$
$$ \eta := \text{Learning rate} $$

### Weighted least squares error function

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} d_n (t_n - \textbf{w}^t \textbf{x}_n)^2 $$
$$ d_n := \text{Weight for term $n$} $$

### Weighted least squares parameter optimization

$$ \textbf{w}_{\text{ML}} = (\Phi D \Phi)^{-1} {\Phi}^T D T = (X^T D X)^{-1} X^T D T $$
$$ D := \text{Diagonal weight matrix with $D_{nn} = d_n$} $$

### Locally weighted regression (unified model)

$$ t(\textbf{x}) = \sum\limits_{e=1}^{E} {\Phi}_e (\textbf{x}, {\theta}_e) (\textbf{w}_e^T \textbf{x} + b_e) $$
$$ {\theta}_e = ({\mu}_e, {\Sigma}_e) := \text{Gaussian parameters} $$
$$ {\Phi}(\textbf{x}, \theta) = \mathcal{N}(\textbf{x} | \mu, \Sigma) $$

### Classification (K=2)

$$ \textbf{w}^T \textbf{x} + w_0 = 0\ \text{(Hyperplane)} $$
$$ \textbf{w}^T \textbf{x} + w_0 < 0 \Rightarrow \text{Class 0} $$
$$ \textbf{w}^T \textbf{x} + w_0 > 0 \Rightarrow \text{Class 1} $$
$$ \textbf{w}^T \textbf{x} + w_0 = 0 \Rightarrow \text{On boundary} $$

### Fisher discriminant

- Midpoints $\textbf{m}_1 = \frac{1}{N_1} \sum\limits_{n \in C_1} \textbf{x}_n$, $\textbf{m}_2 = \frac{1}{N_2} \sum\limits_{n \in C_2} \textbf{x}_n$
- Intra-class variance: $s_k^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - \textbf{w}^T \textbf{m}_k)^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - m'_k)^2$
- Optimization: $\textbf{w} = \textbf{S}_{W}^{-1}(\textbf{m}_2 - \textbf{m}_1)$
- Within-class variance: $\textbf{S}_{W} = \sum\limits_{n \in C_1} (\textbf{x}_n - \textbf{m}_1) (\textbf{x}_n - \textbf{m}_1)^T + \sum\limits_{n \in C_2} (\textbf{x}_n - \textbf{m}_2) (\textbf{x}_n - \textbf{m}_2)^T$
- Classification: $\textbf{w}^T (\textbf{x} - \textbf{m}) > 0 \Rightarrow C_1$, $\textbf{w}^T (\textbf{x} - \textbf{m}) < 0 \Rightarrow C_2$

### Bayesian classification

$$ p(C_k | \textbf{x}) = \frac{p(\textbf{x} | C_k) p(C_k)}{\sum\limits_{j} p(\textbf{x} | C_j) p(C_j)} = \frac{\text{exp}(a_k)}{\sum\limits_{j} \text{exp}(a_j)} $$
$$ a_j = \text{ln} p(\textbf{x} | C_j) p(C_j) $$
$$ j \in \{1, ..., K\} $$

### Bayesian classification with gaussian deviation

$$ p(\textbf{x} | C_k) = \frac{1}{2 {\pi}^{\frac{D}{2}}} \frac{1}{{| \Sigma |}^{\frac{1}{2}}} \text{exp}(- \frac{1}{2} (\textbf{x} - {\mu}_k)^T {\Sigma}^{-1} (\textbf{x} - {\mu}_k)) $$
$$ p(C_1 | \textbf{x}) = \sigma (\text{ln} \frac{p(\textbf{x} | C_1) p(C_1)}{p(\textbf{x} | C_2) p(C_2)}) = \sigma (\textbf{w}^T \textbf{x} + w_0) $$
$$ \textbf{w} = {\Sigma}^{-1} ({\mu}_1 - {\mu}_2) $$
$$ w_0 = - \frac{1}{2} {\mu}_1^T {\Sigma}^{-1} {\mu}_1 + {\mu}_2^T {\Sigma}^{-1} {\mu}_2 + \text{ln} \frac{p(C_1)}{p(C_2)} $$

### Generalization algorithm

1. Init $h$ with $< \emptyset , \emptyset , ..., \emptyset >$
2. For all positive examples $x$
3. For all constraints $a_i$
4. If $a_i$ is fulfilled by $x$
5. OK
6. Else
7. Generalize $a_i$

### Specialization algorithm

1. Init $h$ with $< ?, ?, ..., ? >$
2. For all negative examples $x$
3. For all constraints $a_i$
4. If $a_i$ is fulfilled by $x$
5. OK
6. Else
7. Replace $a_i$ with most special

### Candidate elimination

- Initialize $G$ to the set of maximally general hypotheses in $H$
- Initialize $H$ to the set of maximally specific hypotheses in $H$

1. For each training instance $x$:
2. If $x$ is a positive example  
2.1. Remove from $G$ any hypothesis inconsistent with $x$  
2.2. For each hypothesis $s$ in $S$ that is not consistent with $x$  
2.2.1. Remove $s$ from $S$  
2.2.2. Add to $S$ all minimal generalizations $h$ of $s$ such that $h$ is consistent with $x$, and some member of $G$ is more general than $h$  
2.2.3. Remove from $S$ any hypothesis that is more general than another hypothesis in $S$  
3. If $x$ is a negative example  
3.1. Remove from $S$ any hypothesis inconsistent with $x$  
3.2. For each hypothesis $g$ in $G$ that is not consistent with with $x$  
3.2.1. Remove $g$ from $G$  
3.2.2. Add to $G$ all minimal specializations $h$ of $g$ such that $h$ is consistent with $x$, and some member of $S$ is more specific than $h$  
3.2.3. Remove from $G$ any hypothesis that is less general than another hypothesis in $G$

### K-Nearest neighbours

$$ f(x_q) = \frac{1}{k} \sum\limits_{i=0}^{k} f(\hat{x}_i) $$

### Weighted K-Nearest neighbours

$$ f(x_q) = \frac{\sum\limits_{i=1}^{k} w_i f(x_i)}{\sum\limits_{i=1}^{k} w_i} $$
$$ w_i = d(x_i, x_q)^{-1} $$

### Gaussian kernel regression

$$ f(x_q) = \sum\limits_{i} y_i \frac{K_{\sigma}(x_i - x_q)}{\sum\limits_{j} K_{\sigma} (x_j - x_q)} $$

### Vector quantization

1. Choose number of prototypes $K$
2. Initialize position of all prototypes $K$
3. Iterate over data:  
3.1. Choose random $\textbf{x}_n$  
3.2. Find nearest prototype $\textbf{w}_c$  
3.3. Move the prototype to the direction of the data point $\textbf{w}_c^{\text{new}} = \textbf{w}_c^{\text{old}} + \eta (\textbf{w}_c^{\text{old}} - \textbf{x}_n)$

### K-Means algorithm

1. Number of prototypes $K$
2. Initialize prototypes $\textbf{w}_k$  
2.1. Minimize $J$ with respect to $r_{nk}$ while keeping $\textbf{w}_k$ fixed  
2.2. Minimize $J$ with respect to $\textbf{w}_k$ while keeping $r_{nk}$ fixed

$$ J = \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} r_{nk} \left\|\textbf{x}_n - \textbf{w}_k\right\|^2 $$

### Gaussian mixture model

- Number of Gaussians $K$
- Initialize priors ${\pi}_k$, centers $\textbf{w}_k$, covariances ${\Sigma}_k$ (by e.g. K-means)

1. For all data points $\textbf{x}_n$
2. E-Step: Update assignment probabilities  
$$P_k(\textbf{x}_n) = \frac{{\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_k, {\Sigma}_k)}{\sum\limits_{i=1}^K {\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_i, {\Sigma}_i)}$$
3. M-Step: Update all $k$ prior probabilites, cluster centers and covariances  
$${\pi}_k = \frac{1}{N} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n)$$
$$ \textbf{w}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) \textbf{x}_n $$
$$ {\Sigma}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) (\textbf{x}_n - \textbf{w}_k) (\textbf{x}_n - \textbf{w}_k)^T $$

## Definitions

### Likelihood

Function, describing the joint probability of the data $\textbf{x}$ as function of the parameters $\textbf{w}$ of the statistical model.

### Bayesian-approach

Probabilistic model for the parameters, not the actual data.

### Classification approaches

1. Find discriminant function $f(\textbf{x}) : x \rightarrow \{0,1\}$
2. Bayesian approach  
2.1 Model class-conditional likelihood $p(x | C_k)$  
2.2 Model an a-priori probability distribution $p(C_k)$  
2.3 Calculate the posterior $p(C_k | x)$
3. Direct posterior modeling $p(C_k | x)$

### Concept

Trying to identify properties to which to identify the classes.

### Hypothesis

Function to identify if input is in class or not.