# Introduction to Machine Learning

## Collection of formulas

### Quadratic error function

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 $$

### Quadratic error function with regularization

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
$$ \lambda := \text{Penalty factor} $$

- "ridge regression"

### Gaussian distribution in 1-D

$$ \mathcal{N}(t|\mu,{\sigma}^{2}) = \frac{1}{\sqrt{2 \pi {\sigma}^{2}}} \text{exp}(-\frac{(t - \mu)^2}{2 {\sigma}^{2}}) $$

### Probabilistic modelling: likelihood in 1-D

$$ p(t | x_0, \textbf{w}, \beta) = \mathcal{N}(t | y(\textbf{w}, x_0), {\sigma}^{2}) $$
$$ \beta = \frac{1}{{\sigma}^{2}}\ \text{(\emph{precision})} $$
$$ y(\textbf{w}, x_0) := \text{Output of the model at $x_0$ with parameters \textbf{w}} $$

### Probabilistic modelling: likelihood multidimensional

$$ p(\textbf{t} | \textbf{x}_0, \textbf{w}, {\Sigma}^{-1}) = \mathcal{N}(\textbf{t} | y(\textbf{w}, \textbf{x}_0), {\Sigma}^{-1}) $$
$$ \Sigma := \text{Covariance matrix} $$
$$ y(\textbf{w}, \textbf{x}_0) := \text{Output of the model at $\textbf{x}_0$ with parameters \textbf{w}} $$

### Data-likelihood

- Joint distribution over all data together
- Individual data points are assumed to be independent

$$ L(\textbf{w}) = P(T | X, \textbf{w}, \beta) = \prod\limits_{n=1}^{N} \frac{1}{c} \text{exp}(-\frac{(t_n - y(x_n, \textbf{w}))^2}{2 {\sigma}^{2}}) $$
$$ T := \text{Set of all target points (data)} $$
$$ X := \text{Set of all inputs} $$
$$ c := \text{Normalization constant} $$
$$ N := \text{Number of all input data} $$

### Parameter optimization from data-likelihood

$$ \text{maximize}\ L(\textbf{w}) \Leftrightarrow \text{minimize}\ -\text{log}L(\textbf{w}) $$

- Sum-of-squares-error is contained in $L(\textbf{w})$, rest are constants
- It is sufficient to minimize the sum-of-squares-error

$$ \textbf{w}_{\text{ML}} = \text{argmax}_{\textbf{w}}(L(\textbf{w})) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2) $$

$$ \frac{1}{{\beta}_{\text{ML}}} = \frac{1}{N} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}_{\text{ML}}) - t_n)^2 $$

### Bayesian inference

$$ P(\textbf{w} | D) = \frac{P(D | \textbf{w}) P(\textbf{w})}{P(D)} $$
$$ P(\textbf{w} | D) := \text{Posterior} $$
$$ P(D | \textbf{w}) := \text{Likelihood (model as before)} $$
$$ P(\textbf{w}) := \text{A-priori probability for \textbf{w} (higher probability for smaller parameters)} $$

### Parameter optimization for bayesian approach

$$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{log}P(\textbf{w} | D) $$
$$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$
$$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$

### Bayesian predictive distribution

$$ \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^{2}(x)) $$
$$ s^{2}(x) := \text{Estimated variance} $$

### Incremental bayesian learning

1. $P(\textbf{w} | D_1) = \frac{P(D_1, \textbf{w}) P(\textbf{w})}{P(D_1)}$
2. $P(\textbf{w} | D_2) = \frac{P(D_2, \textbf{w}) P(\textbf{w} | D_1)}{P(D_2)}$

...

### Linear model

$$ y(\textbf{x}, \textbf{w}) = \sum\limits_{m=0}^{M} w_m {\Phi}_{m}(\textbf{x}) = \textbf{w}^T {\Phi}(\textbf{x}) $$
$$ {\Phi}(\textbf{x}) := \text{Design matrix} $$

### Design matrix

$$ {\Phi}(X) = 
\begin{bmatrix}
{\Phi}_{1}(\textbf{x}_1) & \cdots & {\Phi}_{M}(\textbf{x}_1) \\
\vdots & \ddots & \vdots \\
{\Phi}_{1}(\textbf{x}_N) & \cdots & {\Phi}_{M}(\textbf{x}_N)
\end{bmatrix}
$$
$$ X := \text{Input matrix (columns are inputs and rows of each column are input coordinates)} $$

### Quadratic-error-function for linear model

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 $$

### Parameter optimization for linear model

$$ \textbf{w}_{ML} = {\Phi}^{\#} \textbf{t} $$
$$ {\Phi}^{\#} = ({\Phi}^T \Phi)^{-1} {\Phi}^T $$
$$ {\Phi}^{\#} := \text{Moore-Penrose pseudo inverse} $$

- Get the parameters $\textbf{w}$ for the given target vector $\textbf{t}$

### Maximum a-posteriori approach for linear model

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
$$ \textbf{w}_{\text{MAP}} = (\lambda \textbf{I} + {\Phi}^T \Phi)^{-1} {\Phi}^T \textbf{t} $$

### Bayesian predictive distribution

$$ p(t | x, \textbf{x}, \textbf{t}) = \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^2(x)) $$
$$ m(x) = \beta {\Phi}(x)^T \textbf{S} \sum\limits_{n=1}^{N} {\Phi}(x_n) t_n $$
$$ s^2(x) = {\beta}^{-1} + {\Phi}(x)^T \textbf{S} {\Phi}(x) $$
$$ \textbf{S}^{-1} = \alpha \textbf{I} + \beta \sum\limits_{n=1}^{N} {\Phi}(x_n) {\Phi}(x_n)^T $$
$$ \textbf{S} := \text{Covariance matrix} $$

### Gradient descent

- Update parameters $\textbf{w}$ iteratively

$$ \textbf{w}_{k+1} = w_k - \eta \frac{dE(\textbf{w})}{d \textbf{w}} $$
$$ \eta := \text{Learning rate} $$

### Weighted least squares error function

$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} d_n (t_n - \textbf{w}^t \textbf{x}_n)^2 $$
$$ d_n := \text{Weight for term $n$} $$

### Weighted least squares parameter optimization

$$ \textbf{w}_{\text{ML}} = (\Phi D \Phi)^{-1} {\Phi}^T D T = (X^T D X)^{-1} X^T D T $$
$$ D := \text{Diagonal weight matrix with $D_{nn} = d_n$} $$

### Locally weighted regression (unified model)

$$ t(\textbf{x}) = \sum\limits_{e=1}^{E} {\Phi}_e (\textbf{x}, {\theta}_e) (\textbf{w}_e^T \textbf{x} + b_e) $$
$$ {\theta}_e = ({\mu}_e, {\Sigma}_e) := \text{Gaussian parameters} $$
$$ {\Phi}(\textbf{x}, \theta) = \mathcal{N}(\textbf{x} | \mu, \Sigma) $$

### Classification (K=2)

$$ \textbf{w}^T \textbf{x} + w_0 = 0\ \text{(Hyperplane)} $$
$$ \textbf{w}^T \textbf{x} + w_0 < 0 \Rightarrow \text{Class 0} $$
$$ \textbf{w}^T \textbf{x} + w_0 > 0 \Rightarrow \text{Class 1} $$
$$ \textbf{w}^T \textbf{x} + w_0 = 0 \Rightarrow \text{On boundary} $$

### Fisher discriminant

- Midpoints $\textbf{m}_1 = \frac{1}{N_1} \sum\limits_{n \in C_1} \textbf{x}_n$, $\textbf{m}_2 = \frac{1}{N_2} \sum\limits_{n \in C_2} \textbf{x}_n$
- Intra-class variance: $s_k^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - \textbf{w}^T \textbf{m}_k)^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - m'_k)^2$
- Optimization: $\textbf{w} = \textbf{S}_{W}^{-1}(\textbf{m}_2 - \textbf{m}_1)$
- Within-class variance: $\textbf{S}_{W} = \sum\limits_{n \in C_1} (\textbf{x}_n - \textbf{m}_1) (\textbf{x}_n - \textbf{m}_1)^T + \sum\limits_{n \in C_2} (\textbf{x}_n - \textbf{m}_2) (\textbf{x}_n - \textbf{m}_2)^T$
- Classification: $\textbf{w}^T (\textbf{x} - \textbf{m}) > 0 \Rightarrow C_1$, $\textbf{w}^T (\textbf{x} - \textbf{m}) < 0 \Rightarrow C_2$

### Bayesian classification

$$ p(C_k | \textbf{x}) = \frac{p(\textbf{x} | C_k) p(C_k)}{\sum\limits_{j} p(\textbf{x} | C_j) p(C_j)} = \frac{\text{exp}(a_k)}{\sum\limits_{j} \text{exp}(a_j)} $$
$$ a_j = \text{ln} p(\textbf{x} | C_j) p(C_j) $$
$$ j \in \{1, ..., K\} $$

### Bayesian classification with gaussian deviation

$$ p(\textbf{x} | C_k) = \frac{1}{2 {\pi}^{\frac{D}{2}}} \frac{1}{{| \Sigma |}^{\frac{1}{2}}} \text{exp}(- \frac{1}{2} (\textbf{x} - {\mu}_k)^T {\Sigma}^{-1} (\textbf{x} - {\mu}_k)) $$
$$ p(C_1 | \textbf{x}) = \sigma (\text{ln} \frac{p(\textbf{x} | C_1) p(C_1)}{p(\textbf{x} | C_2) p(C_2)}) = \sigma (\textbf{w}^T \textbf{x} + w_0) $$
$$ \textbf{w} = {\Sigma}^{-1} ({\mu}_1 - {\mu}_2) $$
$$ w_0 = - \frac{1}{2} {\mu}_1^T {\Sigma}^{-1} {\mu}_1 + {\mu}_2^T {\Sigma}^{-1} {\mu}_2 + \text{ln} \frac{p(C_1)}{p(C_2)} $$

### Generalization algorithm

1. Init $h$ with $< \emptyset , \emptyset , ..., \emptyset >$
2. For all positive examples $x$
3. For all constraints $a_i$
4. If $a_i$ is fulfilled by $x$
5. OK
6. Else
7. Generalize $a_i$

### Specialization algorithm

1. Init $h$ with $< ?, ?, ..., ? >$
2. For all negative examples $x$
3. For all constraints $a_i$
4. If $a_i$ is fulfilled by $x$
5. OK
6. Else
7. Replace $a_i$ with most special

### Candidate elimination

- Initialize $G$ to the set of maximally general hypotheses in $H$
- Initialize $H$ to the set of maximally specific hypotheses in $H$

1. For each training instance $x$:
2. If $x$ is a positive example  
2.1. Remove from $G$ any hypothesis inconsistent with $x$  
2.2. For each hypothesis $s$ in $S$ that is not consistent with $x$  
2.2.1. Remove $s$ from $S$  
2.2.2. Add to $S$ all minimal generalizations $h$ of $s$ such that $h$ is consistent with $x$, and some member of $G$ is more general than $h$  
2.2.3. Remove from $S$ any hypothesis that is more general than another hypothesis in $S$  
3. If $x$ is a negative example  
3.1. Remove from $S$ any hypothesis inconsistent with $x$  
3.2. For each hypothesis $g$ in $G$ that is not consistent with with $x$  
3.2.1. Remove $g$ from $G$  
3.2.2. Add to $G$ all minimal specializations $h$ of $g$ such that $h$ is consistent with $x$, and some member of $S$ is more specific than $h$  
3.2.3. Remove from $G$ any hypothesis that is less general than another hypothesis in $G$

### K-Nearest neighbours

$$ f(x_q) = \frac{1}{k} \sum\limits_{i=0}^{k} f(\hat{x}_i) $$

### Weighted K-Nearest neighbours

$$ f(x_q) = \frac{\sum\limits_{i=1}^{k} w_i f(x_i)}{\sum\limits_{i=1}^{k} w_i} $$
$$ w_i = d(x_i, x_q)^{-1} $$

### Gaussian kernel regression

$$ f(x_q) = \sum\limits_{i} y_i \frac{K_{\sigma}(x_i - x_q)}{\sum\limits_{j} K_{\sigma} (x_j - x_q)} $$

### Vector quantization

1. Choose number of prototypes $K$
2. Initialize position of all prototypes $K$
3. Iterate over data:  
3.1. Choose random $\textbf{x}_n$  
3.2. Find nearest prototype $\textbf{w}_c$  
3.3. Move the prototype to the direction of the data point $\textbf{w}_c^{\text{new}} = \textbf{w}_c^{\text{old}} + \eta (\textbf{w}_c^{\text{old}} - \textbf{x}_n)$

### K-Means algorithm

1. Number of prototypes $K$
2. Initialize prototypes $\textbf{w}_k$  
2.1. Minimize $J$ with respect to $r_{nk}$ while keeping $\textbf{w}_k$ fixed  
2.2. Minimize $J$ with respect to $\textbf{w}_k$ while keeping $r_{nk}$ fixed

$$ J = \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} r_{nk} \left\|\textbf{x}_n - \textbf{w}_k\right\|^2 $$

### Gaussian mixture model

- Number of Gaussians $K$
- Initialize priors ${\pi}_k$, centers $\textbf{w}_k$, covariances ${\Sigma}_k$ (by e.g. K-means)

1. For all data points $\textbf{x}_n$
2. E-Step: Update assignment probabilities  
$$P_k(\textbf{x}_n) = \frac{{\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_k, {\Sigma}_k)}{\sum\limits_{i=1}^K {\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_i, {\Sigma}_i)}$$
3. M-Step: Update all $k$ prior probabilites, cluster centers and covariances  
$${\pi}_k = \frac{1}{N} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n)$$
$$ \textbf{w}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) \textbf{x}_n $$
$$ {\Sigma}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) (\textbf{x}_n - \textbf{w}_k) (\textbf{x}_n - \textbf{w}_k)^T $$

## Definitions

### Likelihood

Function, describing the joint probability of the data $\textbf{x}$ as function of the parameters $\textbf{w}$ of the statistical model.

### Bayesian-approach

Probabilistic model for the parameters, not the actual data.

### Classification approaches

1. Find discriminant function $f(\textbf{x}) : x \rightarrow \{0,1\}$
2. Bayesian approach  
2.1 Model class-conditional likelihood $p(x | C_k)$  
2.2 Model an a-priori probability distribution $p(C_k)$  
2.3 Calculate the posterior $p(C_k | x)$
3. Direct posterior modeling $p(C_k | x)$

### Concept

Trying to identify properties to which to identify the classes.

### Hypothesis

Function to identify if input is in class or not.