# Introduction to Machine Learning

## Collection of formulas

### Quadratic error function

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 $$

### Quadratic error function with regularization

$$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
$$ \lambda := \text{Penalty factor} $$

- "ridge regression"

### Gaussian distribution in 1-D

$$ \mathcal{N}(t|\mu,{\sigma}^{2}) = \frac{1}{\sqrt{2 \pi {\sigma}^{2}}} \text{exp}(-\frac{(t - \mu)^2}{2 {\sigma}^{2}}) $$

### Probabilistic modelling: likelihood in 1-D

$$ p(t | x_0, \textbf{w}, \beta) = \mathcal{N}(t | y(\textbf{w}, x_0), {\sigma}^{2}) $$
$$ \beta = \frac{1}{{\sigma}^{2}}\ \text{(\emph{precision})} $$
$$ y(\textbf{w}, x_0) := \text{Output of the model at $x_0$ with parameters \textbf{w}} $$

### Probabilistic modelling: likelihood multidimensional

$$ p(\textbf{t} | \textbf{x}_0, \textbf{w}, {\Sigma}^{-1}) = \mathcal{N}(\textbf{t} | y(\textbf{w}, \textbf{x}_0), {\Sigma}^{-1}) $$
$$ \Sigma := \text{Covariance matrix} $$
$$ y(\textbf{w}, \textbf{x}_0) := \text{Output of the model at $\textbf{x}_0$ with parameters \textbf{w}} $$

### Data-likelihood

- Joint distribution over all data together
- Individual data points are assumed to be independent

$$ L(\textbf{w}) = P(T | X, \textbf{w}, \beta) = \prod\limits_{n=1}^{N} \frac{1}{c} \text{exp}(-\frac{(t_n - y(x_n, \textbf{w}))^2}{2 {\sigma}^{2}}) $$
$$ T := \text{Set of all target points (data)} $$
$$ X := \text{Set of all inputs} $$
$$ c := \text{Normalization constant} $$
$$ N := \text{Number of all input data} $$

### Parameter optimization from data-likelihood

$$ \text{maximize}\ L(\textbf{w}) \Leftrightarrow \text{minimize}\ -\text{log}L(\textbf{w}) $$

- Sum-of-squares-error is contained in $L(\textbf{w})$, rest are constants
- It is sufficient to minimize the sum-of-squares-error

$$ \textbf{w}_{\text{ML}} = \text{argmax}_{\textbf{w}}(L(\textbf{w})) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2) $$

$$ \frac{1}{{\beta}_{\text{ML}}} = \frac{1}{N} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}_{\text{ML}}) - t_n)^2 $$

### Bayesian inference

$$ P(\textbf{w} | D) = \frac{P(D | \textbf{w}) P(\textbf{w})}{P(D)} $$
$$ P(\textbf{w} | D) := \text{Posterior} $$
$$ P(D | \textbf{w}) := \text{Likelihood (model as before)} $$
$$ P(\textbf{w}) := \text{A-priori probability for \textbf{w} (higher probability for smaller parameters)} $$

### Parameter optimization for bayesian approach

$$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{log}P(\textbf{w} | D) $$
$$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$
$$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$

## Definitions

### Likelihood

Function, describing the joint probability of the data $\textbf{x}$ as function of the parameters $\textbf{w}$ of the statistical model.

### Bayesian-approach

Probabilistic model for the parameters, not the actual data.