# Introduction to Machine Learning ## Collection of formulas ### Quadratic error function $$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 $$ ### Quadratic error function with regularization $$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$ $$ \lambda := \text{Penalty factor} $$ - "ridge regression" ### Gaussian distribution in 1-D $$ \mathcal{N}(t|\mu,{\sigma}^{2}) = \frac{1}{\sqrt{2 \pi {\sigma}^{2}}} \text{exp}(-\frac{(t - \mu)^2}{2 {\sigma}^{2}}) $$ ### Probabilistic modelling: likelihood in 1-D $$ p(t | x_0, \textbf{w}, \beta) = \mathcal{N}(t | y(\textbf{w}, x_0), {\sigma}^{2}) $$ $$ \beta = \frac{1}{{\sigma}^{2}}\ \text{(\emph{precision})} $$ $$ y(\textbf{w}, x_0) := \text{Output of the model at $x_0$ with parameters \textbf{w}} $$ ### Probabilistic modelling: likelihood multidimensional $$ p(\textbf{t} | \textbf{x}_0, \textbf{w}, {\Sigma}^{-1}) = \mathcal{N}(\textbf{t} | y(\textbf{w}, \textbf{x}_0), {\Sigma}^{-1}) $$ $$ \Sigma := \text{Covariance matrix} $$ $$ y(\textbf{w}, \textbf{x}_0) := \text{Output of the model at $\textbf{x}_0$ with parameters \textbf{w}} $$ ### Data-likelihood - Joint distribution over all data together - Individual data points are assumed to be independent $$ L(\textbf{w}) = P(T | X, \textbf{w}, \beta) = \prod\limits_{n=1}^{N} \frac{1}{c} \text{exp}(-\frac{(t_n - y(x_n, \textbf{w}))^2}{2 {\sigma}^{2}}) $$ $$ T := \text{Set of all target points (data)} $$ $$ X := \text{Set of all inputs} $$ $$ c := \text{Normalization constant} $$ $$ N := \text{Number of all input data} $$ ### Parameter optimization from data-likelihood $$ \text{maximize}\ L(\textbf{w}) \Leftrightarrow \text{minimize}\ -\text{log}L(\textbf{w}) $$ - Sum-of-squares-error is contained in $L(\textbf{w})$, rest are constants - It is sufficient to minimize the sum-of-squares-error $$ \textbf{w}_{\text{ML}} = \text{argmax}_{\textbf{w}}(L(\textbf{w})) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2) $$ $$ \frac{1}{{\beta}_{\text{ML}}} = \frac{1}{N} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}_{\text{ML}}) - t_n)^2 $$ ### Bayesian inference $$ P(\textbf{w} | D) = \frac{P(D | \textbf{w}) P(\textbf{w})}{P(D)} $$ $$ P(\textbf{w} | D) := \text{Posterior} $$ $$ P(D | \textbf{w}) := \text{Likelihood (model as before)} $$ $$ P(\textbf{w}) := \text{A-priori probability for \textbf{w} (higher probability for smaller parameters)} $$ ### Parameter optimization for bayesian approach $$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{log}P(\textbf{w} | D) $$ $$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$ $$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$ ## Definitions ### Likelihood Function, describing the joint probability of the data $\textbf{x}$ as function of the parameters $\textbf{w}$ of the statistical model. ### Bayesian-approach Probabilistic model for the parameters, not the actual data.