# Introduction to Machine Learning ## Collection of formulas ### Quadratic error function $$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 $$ ### Quadratic error function with regularization $$ E(\textbf{w})=\frac{1}{2}\sum\limits_{n=1}^{N}(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$ $$ \lambda := \text{Penalty factor} $$ - "ridge regression" ### Gaussian distribution in 1-D $$ \mathcal{N}(t|\mu,{\sigma}^{2}) = \frac{1}{\sqrt{2 \pi {\sigma}^{2}}} \text{exp}(-\frac{(t - \mu)^2}{2 {\sigma}^{2}}) $$ ### Probabilistic modelling: likelihood in 1-D $$ p(t | x_0, \textbf{w}, \beta) = \mathcal{N}(t | y(\textbf{w}, x_0), {\sigma}^{2}) $$ $$ \beta = \frac{1}{{\sigma}^{2}}\ \text{(\emph{precision})} $$ $$ y(\textbf{w}, x_0) := \text{Output of the model at $x_0$ with parameters \textbf{w}} $$ ### Probabilistic modelling: likelihood multidimensional $$ p(\textbf{t} | \textbf{x}_0, \textbf{w}, {\Sigma}^{-1}) = \mathcal{N}(\textbf{t} | y(\textbf{w}, \textbf{x}_0), {\Sigma}^{-1}) $$ $$ \Sigma := \text{Covariance matrix} $$ $$ y(\textbf{w}, \textbf{x}_0) := \text{Output of the model at $\textbf{x}_0$ with parameters \textbf{w}} $$ ### Data-likelihood - Joint distribution over all data together - Individual data points are assumed to be independent $$ L(\textbf{w}) = P(T | X, \textbf{w}, \beta) = \prod\limits_{n=1}^{N} \frac{1}{c} \text{exp}(-\frac{(t_n - y(x_n, \textbf{w}))^2}{2 {\sigma}^{2}}) $$ $$ T := \text{Set of all target points (data)} $$ $$ X := \text{Set of all inputs} $$ $$ c := \text{Normalization constant} $$ $$ N := \text{Number of all input data} $$ ### Parameter optimization from data-likelihood $$ \text{maximize}\ L(\textbf{w}) \Leftrightarrow \text{minimize}\ -\text{log}L(\textbf{w}) $$ - Sum-of-squares-error is contained in $L(\textbf{w})$, rest are constants - It is sufficient to minimize the sum-of-squares-error $$ \textbf{w}_{\text{ML}} = \text{argmax}_{\textbf{w}}(L(\textbf{w})) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2) $$ $$ \frac{1}{{\beta}_{\text{ML}}} = \frac{1}{N} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}_{\text{ML}}) - t_n)^2 $$ ### Bayesian inference $$ P(\textbf{w} | D) = \frac{P(D | \textbf{w}) P(\textbf{w})}{P(D)} $$ $$ P(\textbf{w} | D) := \text{Posterior} $$ $$ P(D | \textbf{w}) := \text{Likelihood (model as before)} $$ $$ P(\textbf{w}) := \text{A-priori probability for \textbf{w} (higher probability for smaller parameters)} $$ ### Parameter optimization for bayesian approach $$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{log}P(\textbf{w} | D) $$ $$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$ $$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$ ### Bayesian predictive distribution $$ \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^{2}(x)) $$ $$ s^{2}(x) := \text{Estimated variance} $$ ### Incremental bayesian learning 1. $P(\textbf{w} | D_1) = \frac{P(D_1, \textbf{w}) P(\textbf{w})}{P(D_1)}$ 2. $P(\textbf{w} | D_2) = \frac{P(D_2, \textbf{w}) P(\textbf{w} | D_1)}{P(D_2)}$ ... ### Linear model $$ y(\textbf{x}, \textbf{w}) = \sum\limits_{m=0}^{M} w_m {\Phi}_{m}(\textbf{x}) = \textbf{w}^T {\Phi}(\textbf{x}) $$ $$ {\Phi}(\textbf{x}) := \text{Design matrix} $$ ### Design matrix $$ {\Phi}(X) = \begin{bmatrix} {\Phi}_{1}(\textbf{x}_1) & \cdots & {\Phi}_{M}(\textbf{x}_1) \\ \vdots & \ddots & \vdots \\ {\Phi}_{1}(\textbf{x}_N) & \cdots & {\Phi}_{M}(\textbf{x}_N) \end{bmatrix} $$ $$ X := \text{Input matrix (columns are inputs and rows of each column are input coordinates)} $$ ### Quadratic-error-function for linear model $$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 $$ ### Parameter optimization for linear model $$ \textbf{w}_{ML} = {\Phi}^{\#} \textbf{t} $$ $$ {\Phi}^{\#} = ({\Phi}^T \Phi)^{-1} {\Phi}^T $$ $$ {\Phi}^{\#} := \text{Moore-Penrose pseudo inverse} $$ - Get the parameters $\textbf{w}$ for the given target vector $\textbf{t}$ ### Maximum a-posteriori approach for linear model $$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$ $$ \textbf{w}_{\text{MAP}} = (\lambda \textbf{I} + {\Phi}^T \Phi)^{-1} {\Phi}^T \textbf{t} $$ ### Bayesian predictive distribution $$ p(t | x, \textbf{x}, \textbf{t}) = \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^2(x)) $$ $$ m(x) = \beta {\Phi}(x)^T \textbf{S} \sum\limits_{n=1}^{N} {\Phi}(x_n) t_n $$ $$ s^2(x) = {\beta}^{-1} + {\Phi}(x)^T \textbf{S} {\Phi}(x) $$ $$ \textbf{S}^{-1} = \alpha \textbf{I} + \beta \sum\limits_{n=1}^{N} {\Phi}(x_n) {\Phi}(x_n)^T $$ $$ \textbf{S} := \text{Covariance matrix} $$ ### Gradient descent - Update parameters $\textbf{w}$ iteratively $$ \textbf{w}_{k+1} = w_k - \eta \frac{dE(\textbf{w})}{d \textbf{w}} $$ $$ \eta := \text{Learning rate} $$ ### Weighted least squares error function $$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} d_n (t_n - \textbf{w}^t \textbf{x}_n)^2 $$ $$ d_n := \text{Weight for term $n$} $$ ### Weighted least squares parameter optimization $$ \textbf{w}_{\text{ML}} = (\Phi D \Phi)^{-1} {\Phi}^T D T = (X^T D X)^{-1} X^T D T $$ $$ D := \text{Diagonal weight matrix with $D_{nn} = d_n$} $$ ### Locally weighted regression (unified model) $$ t(\textbf{x}) = \sum\limits_{e=1}^{E} {\Phi}_e (\textbf{x}, {\theta}_e) (\textbf{w}_e^T \textbf{x} + b_e) $$ $$ {\theta}_e = ({\mu}_e, {\Sigma}_e) := \text{Gaussian parameters} $$ $$ {\Phi}(\textbf{x}, \theta) = \mathcal{N}(\textbf{x} | \mu, \Sigma) $$ ### Classification (K=2) $$ \textbf{w}^T \textbf{x} + w_0 = 0\ \text{(Hyperplane)} $$ $$ \textbf{w}^T \textbf{x} + w_0 < 0 \Rightarrow \text{Class 0} $$ $$ \textbf{w}^T \textbf{x} + w_0 > 0 \Rightarrow \text{Class 1} $$ $$ \textbf{w}^T \textbf{x} + w_0 = 0 \Rightarrow \text{On boundary} $$ ### Fisher discriminant - Midpoints $\textbf{m}_1 = \frac{1}{N_1} \sum\limits_{n \in C_1} \textbf{x}_n$, $\textbf{m}_2 = \frac{1}{N_2} \sum\limits_{n \in C_2} \textbf{x}_n$ - Intra-class variance: $s_k^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - \textbf{w}^T \textbf{m}_k)^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - m'_k)^2$ - Optimization: $\textbf{w} = \textbf{S}_{W}^{-1}(\textbf{m}_2 - \textbf{m}_1)$ - Within-class variance: $\textbf{S}_{W} = \sum\limits_{n \in C_1} (\textbf{x}_n - \textbf{m}_1) (\textbf{x}_n - \textbf{m}_1)^T + \sum\limits_{n \in C_2} (\textbf{x}_n - \textbf{m}_2) (\textbf{x}_n - \textbf{m}_2)^T$ - Classification: $\textbf{w}^T (\textbf{x} - \textbf{m}) > 0 \Rightarrow C_1$, $\textbf{w}^T (\textbf{x} - \textbf{m}) < 0 \Rightarrow C_2$ ### Bayesian classification $$ p(C_k | \textbf{x}) = \frac{p(\textbf{x} | C_k) p(C_k)}{\sum\limits_{j} p(\textbf{x} | C_j) p(C_j)} = \frac{\text{exp}(a_k)}{\sum\limits_{j} \text{exp}(a_j)} $$ $$ a_j = \text{ln} p(\textbf{x} | C_j) p(C_j) $$ $$ j \in \{1, ..., K\} $$ ### Bayesian classification with gaussian deviation $$ p(\textbf{x} | C_k) = \frac{1}{2 {\pi}^{\frac{D}{2}}} \frac{1}{{| \Sigma |}^{\frac{1}{2}}} \text{exp}(- \frac{1}{2} (\textbf{x} - {\mu}_k)^T {\Sigma}^{-1} (\textbf{x} - {\mu}_k)) $$ $$ p(C_1 | \textbf{x}) = \sigma (\text{ln} \frac{p(\textbf{x} | C_1) p(C_1)}{p(\textbf{x} | C_2) p(C_2)}) = \sigma (\textbf{w}^T \textbf{x} + w_0) $$ $$ \textbf{w} = {\Sigma}^{-1} ({\mu}_1 - {\mu}_2) $$ $$ w_0 = - \frac{1}{2} {\mu}_1^T {\Sigma}^{-1} {\mu}_1 + {\mu}_2^T {\Sigma}^{-1} {\mu}_2 + \text{ln} \frac{p(C_1)}{p(C_2)} $$ ### Generalization algorithm 1. Init $h$ with $< \emptyset , \emptyset , ..., \emptyset >$ 2. For all positive examples $x$ 3. For all constraints $a_i$ 4. If $a_i$ is fulfilled by $x$ 5. OK 6. Else 7. Generalize $a_i$ ### Specialization algorithm 1. Init $h$ with $< ?, ?, ..., ? >$ 2. For all negative examples $x$ 3. For all constraints $a_i$ 4. If $a_i$ is fulfilled by $x$ 5. OK 6. Else 7. Replace $a_i$ with most special ### Candidate elimination - Initialize $G$ to the set of maximally general hypotheses in $H$ - Initialize $H$ to the set of maximally specific hypotheses in $H$ 1. For each training instance $x$: 2. If $x$ is a positive example 2.1. Remove from $G$ any hypothesis inconsistent with $x$ 2.2. For each hypothesis $s$ in $S$ that is not consistent with $x$ 2.2.1. Remove $s$ from $S$ 2.2.2. Add to $S$ all minimal generalizations $h$ of $s$ such that $h$ is consistent with $x$, and some member of $G$ is more general than $h$ 2.2.3. Remove from $S$ any hypothesis that is more general than another hypothesis in $S$ 3. If $x$ is a negative example 3.1. Remove from $S$ any hypothesis inconsistent with $x$ 3.2. For each hypothesis $g$ in $G$ that is not consistent with with $x$ 3.2.1. Remove $g$ from $G$ 3.2.2. Add to $G$ all minimal specializations $h$ of $g$ such that $h$ is consistent with $x$, and some member of $S$ is more specific than $h$ 3.2.3. Remove from $G$ any hypothesis that is less general than another hypothesis in $G$ ### K-Nearest neighbours $$ f(x_q) = \frac{1}{k} \sum\limits_{i=0}^{k} f(\hat{x}_i) $$ ### Weighted K-Nearest neighbours $$ f(x_q) = \frac{\sum\limits_{i=1}^{k} w_i f(x_i)}{\sum\limits_{i=1}^{k} w_i} $$ $$ w_i = d(x_i, x_q)^{-1} $$ ### Gaussian kernel regression $$ f(x_q) = \sum\limits_{i} y_i \frac{K_{\sigma}(x_i - x_q)}{\sum\limits_{j} K_{\sigma} (x_j - x_q)} $$ ### Vector quantization 1. Choose number of prototypes $K$ 2. Initialize position of all prototypes $K$ 3. Iterate over data: 3.1. Choose random $\textbf{x}_n$ 3.2. Find nearest prototype $\textbf{w}_c$ 3.3. Move the prototype to the direction of the data point $\textbf{w}_c^{\text{new}} = \textbf{w}_c^{\text{old}} + \eta (\textbf{w}_c^{\text{old}} - \textbf{x}_n)$ ### K-Means algorithm 1. Number of prototypes $K$ 2. Initialize prototypes $\textbf{w}_k$ 2.1. Minimize $J$ with respect to $r_{nk}$ while keeping $\textbf{w}_k$ fixed 2.2. Minimize $J$ with respect to $\textbf{w}_k$ while keeping $r_{nk}$ fixed $$ J = \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} r_{nk} \left\|\textbf{x}_n - \textbf{w}_k\right\|^2 $$ ### Gaussian mixture model - Number of Gaussians $K$ - Initialize priors ${\pi}_k$, centers $\textbf{w}_k$, covariances ${\Sigma}_k$ (by e.g. K-means) 1. For all data points $\textbf{x}_n$ 2. E-Step: Update assignment probabilities $$P_k(\textbf{x}_n) = \frac{{\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_k, {\Sigma}_k)}{\sum\limits_{i=1}^K {\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_i, {\Sigma}_i)}$$ 3. M-Step: Update all $k$ prior probabilites, cluster centers and covariances $${\pi}_k = \frac{1}{N} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n)$$ $$ \textbf{w}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) \textbf{x}_n $$ $$ {\Sigma}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) (\textbf{x}_n - \textbf{w}_k) (\textbf{x}_n - \textbf{w}_k)^T $$ ## Definitions ### Likelihood Function, describing the joint probability of the data $\textbf{x}$ as function of the parameters $\textbf{w}$ of the statistical model. ### Bayesian-approach Probabilistic model for the parameters, not the actual data. ### Classification approaches 1. Find discriminant function $f(\textbf{x}) : x \rightarrow \{0,1\}$ 2. Bayesian approach 2.1 Model class-conditional likelihood $p(x | C_k)$ 2.2 Model an a-priori probability distribution $p(C_k)$ 2.3 Calculate the posterior $p(C_k | x)$ 3. Direct posterior modeling $p(C_k | x)$ ### Concept Trying to identify properties to which to identify the classes. ### Hypothesis Function to identify if input is in class or not.