From 96605c48fda8fa631d8ba2fcb14eccb9e78b66ae Mon Sep 17 00:00:00 2001
From: Leonard Kugis <leonard@kug.is>
Date: Tue, 6 Sep 2022 23:29:55 +0200
Subject: Introduction to Machine Learning

Added last few chapters.
---
 .../introduction_to_machine_learning.md            | 206 ++++++++++++++++++++-
 1 file changed, 205 insertions(+), 1 deletion(-)

diff --git a/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md b/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md
index 0e17258..a765486 100644
--- a/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md	
+++ b/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md	
@@ -64,6 +64,193 @@ $$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{lo
 $$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$
 $$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$
 
+### Bayesian predictive distribution
+
+$$ \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^{2}(x)) $$
+$$ s^{2}(x) := \text{Estimated variance} $$
+
+### Incremental bayesian learning
+
+1. $P(\textbf{w} | D_1) = \frac{P(D_1, \textbf{w}) P(\textbf{w})}{P(D_1)}$
+2. $P(\textbf{w} | D_2) = \frac{P(D_2, \textbf{w}) P(\textbf{w} | D_1)}{P(D_2)}$
+
+...
+
+### Linear model
+
+$$ y(\textbf{x}, \textbf{w}) = \sum\limits_{m=0}^{M} w_m {\Phi}_{m}(\textbf{x}) = \textbf{w}^T {\Phi}(\textbf{x}) $$
+$$ {\Phi}(\textbf{x}) := \text{Design matrix} $$
+
+### Design matrix
+
+$$ {\Phi}(X) = 
+\begin{bmatrix}
+{\Phi}_{1}(\textbf{x}_1) & \cdots & {\Phi}_{M}(\textbf{x}_1) \\
+\vdots & \ddots & \vdots \\
+{\Phi}_{1}(\textbf{x}_N) & \cdots & {\Phi}_{M}(\textbf{x}_N)
+\end{bmatrix}
+$$
+$$ X := \text{Input matrix (columns are inputs and rows of each column are input coordinates)} $$
+
+### Quadratic-error-function for linear model
+
+$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 $$
+
+### Parameter optimization for linear model
+
+$$ \textbf{w}_{ML} = {\Phi}^{\#} \textbf{t} $$
+$$ {\Phi}^{\#} = ({\Phi}^T \Phi)^{-1} {\Phi}^T $$
+$$ {\Phi}^{\#} := \text{Moore-Penrose pseudo inverse} $$
+
+- Get the parameters $\textbf{w}$ for the given target vector $\textbf{t}$
+
+### Maximum a-posteriori approach for linear model
+
+$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$
+$$ \textbf{w}_{\text{MAP}} = (\lambda \textbf{I} + {\Phi}^T \Phi)^{-1} {\Phi}^T \textbf{t} $$
+
+### Bayesian predictive distribution
+
+$$ p(t | x, \textbf{x}, \textbf{t}) = \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^2(x)) $$
+$$ m(x) = \beta {\Phi}(x)^T \textbf{S} \sum\limits_{n=1}^{N} {\Phi}(x_n) t_n $$
+$$ s^2(x) = {\beta}^{-1} + {\Phi}(x)^T \textbf{S} {\Phi}(x) $$
+$$ \textbf{S}^{-1} = \alpha \textbf{I} + \beta \sum\limits_{n=1}^{N} {\Phi}(x_n) {\Phi}(x_n)^T $$
+$$ \textbf{S} := \text{Covariance matrix} $$
+
+### Gradient descent
+
+- Update parameters $\textbf{w}$ iteratively
+
+$$ \textbf{w}_{k+1} = w_k - \eta \frac{dE(\textbf{w})}{d \textbf{w}} $$
+$$ \eta := \text{Learning rate} $$
+
+### Weighted least squares error function
+
+$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} d_n (t_n - \textbf{w}^t \textbf{x}_n)^2 $$
+$$ d_n := \text{Weight for term $n$} $$
+
+### Weighted least squares parameter optimization
+
+$$ \textbf{w}_{\text{ML}} = (\Phi D \Phi)^{-1} {\Phi}^T D T = (X^T D X)^{-1} X^T D T $$
+$$ D := \text{Diagonal weight matrix with $D_{nn} = d_n$} $$
+
+### Locally weighted regression (unified model)
+
+$$ t(\textbf{x}) = \sum\limits_{e=1}^{E} {\Phi}_e (\textbf{x}, {\theta}_e) (\textbf{w}_e^T \textbf{x} + b_e) $$
+$$ {\theta}_e = ({\mu}_e, {\Sigma}_e) := \text{Gaussian parameters} $$
+$$ {\Phi}(\textbf{x}, \theta) = \mathcal{N}(\textbf{x} | \mu, \Sigma) $$
+
+### Classification (K=2)
+
+$$ \textbf{w}^T \textbf{x} + w_0 = 0\ \text{(Hyperplane)} $$
+$$ \textbf{w}^T \textbf{x} + w_0 < 0 \Rightarrow \text{Class 0} $$
+$$ \textbf{w}^T \textbf{x} + w_0 > 0 \Rightarrow \text{Class 1} $$
+$$ \textbf{w}^T \textbf{x} + w_0 = 0 \Rightarrow \text{On boundary} $$
+
+### Fisher discriminant
+
+- Midpoints $\textbf{m}_1 = \frac{1}{N_1} \sum\limits_{n \in C_1} \textbf{x}_n$, $\textbf{m}_2 = \frac{1}{N_2} \sum\limits_{n \in C_2} \textbf{x}_n$
+- Intra-class variance: $s_k^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - \textbf{w}^T \textbf{m}_k)^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - m'_k)^2$
+- Optimization: $\textbf{w} = \textbf{S}_{W}^{-1}(\textbf{m}_2 - \textbf{m}_1)$
+- Within-class variance: $\textbf{S}_{W} = \sum\limits_{n \in C_1} (\textbf{x}_n - \textbf{m}_1) (\textbf{x}_n - \textbf{m}_1)^T + \sum\limits_{n \in C_2} (\textbf{x}_n - \textbf{m}_2) (\textbf{x}_n - \textbf{m}_2)^T$
+- Classification: $\textbf{w}^T (\textbf{x} - \textbf{m}) > 0 \Rightarrow C_1$, $\textbf{w}^T (\textbf{x} - \textbf{m}) < 0 \Rightarrow C_2$
+
+### Bayesian classification
+
+$$ p(C_k | \textbf{x}) = \frac{p(\textbf{x} | C_k) p(C_k)}{\sum\limits_{j} p(\textbf{x} | C_j) p(C_j)} = \frac{\text{exp}(a_k)}{\sum\limits_{j} \text{exp}(a_j)} $$
+$$ a_j = \text{ln} p(\textbf{x} | C_j) p(C_j) $$
+$$ j \in \{1, ..., K\} $$
+
+### Bayesian classification with gaussian deviation
+
+$$ p(\textbf{x} | C_k) = \frac{1}{2 {\pi}^{\frac{D}{2}}} \frac{1}{{| \Sigma |}^{\frac{1}{2}}} \text{exp}(- \frac{1}{2} (\textbf{x} - {\mu}_k)^T {\Sigma}^{-1} (\textbf{x} - {\mu}_k)) $$
+$$ p(C_1 | \textbf{x}) = \sigma (\text{ln} \frac{p(\textbf{x} | C_1) p(C_1)}{p(\textbf{x} | C_2) p(C_2)}) = \sigma (\textbf{w}^T \textbf{x} + w_0) $$
+$$ \textbf{w} = {\Sigma}^{-1} ({\mu}_1 - {\mu}_2) $$
+$$ w_0 = - \frac{1}{2} {\mu}_1^T {\Sigma}^{-1} {\mu}_1 + {\mu}_2^T {\Sigma}^{-1} {\mu}_2 + \text{ln} \frac{p(C_1)}{p(C_2)} $$
+
+### Generalization algorithm
+
+1. Init $h$ with $< \emptyset , \emptyset , ..., \emptyset >$
+2. For all positive examples $x$
+3. For all constraints $a_i$
+4. If $a_i$ is fulfilled by $x$
+5. OK
+6. Else
+7. Generalize $a_i$
+
+### Specialization algorithm
+
+1. Init $h$ with $< ?, ?, ..., ? >$
+2. For all negative examples $x$
+3. For all constraints $a_i$
+4. If $a_i$ is fulfilled by $x$
+5. OK
+6. Else
+7. Replace $a_i$ with most special
+
+### Candidate elimination
+
+- Initialize $G$ to the set of maximally general hypotheses in $H$
+- Initialize $H$ to the set of maximally specific hypotheses in $H$
+
+1. For each training instance $x$:
+2. If $x$ is a positive example  
+2.1. Remove from $G$ any hypothesis inconsistent with $x$  
+2.2. For each hypothesis $s$ in $S$ that is not consistent with $x$  
+2.2.1. Remove $s$ from $S$  
+2.2.2. Add to $S$ all minimal generalizations $h$ of $s$ such that $h$ is consistent with $x$, and some member of $G$ is more general than $h$  
+2.2.3. Remove from $S$ any hypothesis that is more general than another hypothesis in $S$  
+3. If $x$ is a negative example  
+3.1. Remove from $S$ any hypothesis inconsistent with $x$  
+3.2. For each hypothesis $g$ in $G$ that is not consistent with with $x$  
+3.2.1. Remove $g$ from $G$  
+3.2.2. Add to $G$ all minimal specializations $h$ of $g$ such that $h$ is consistent with $x$, and some member of $S$ is more specific than $h$  
+3.2.3. Remove from $G$ any hypothesis that is less general than another hypothesis in $G$
+
+### K-Nearest neighbours
+
+$$ f(x_q) = \frac{1}{k} \sum\limits_{i=0}^{k} f(\hat{x}_i) $$
+
+### Weighted K-Nearest neighbours
+
+$$ f(x_q) = \frac{\sum\limits_{i=1}^{k} w_i f(x_i)}{\sum\limits_{i=1}^{k} w_i} $$
+$$ w_i = d(x_i, x_q)^{-1} $$
+
+### Gaussian kernel regression
+
+$$ f(x_q) = \sum\limits_{i} y_i \frac{K_{\sigma}(x_i - x_q)}{\sum\limits_{j} K_{\sigma} (x_j - x_q)} $$
+
+### Vector quantization
+
+1. Choose number of prototypes $K$
+2. Initialize position of all prototypes $K$
+3. Iterate over data:  
+3.1. Choose random $\textbf{x}_n$  
+3.2. Find nearest prototype $\textbf{w}_c$  
+3.3. Move the prototype to the direction of the data point $\textbf{w}_c^{\text{new}} = \textbf{w}_c^{\text{old}} + \eta (\textbf{w}_c^{\text{old}} - \textbf{x}_n)$
+
+### K-Means algorithm
+
+1. Number of prototypes $K$
+2. Initialize prototypes $\textbf{w}_k$  
+2.1. Minimize $J$ with respect to $r_{nk}$ while keeping $\textbf{w}_k$ fixed  
+2.2. Minimize $J$ with respect to $\textbf{w}_k$ while keeping $r_{nk}$ fixed
+
+$$ J = \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} r_{nk} \left\|\textbf{x}_n - \textbf{w}_k\right\|^2 $$
+
+### Gaussian mixture model
+
+- Number of Gaussians $K$
+- Initialize priors ${\pi}_k$, centers $\textbf{w}_k$, covariances ${\Sigma}_k$ (by e.g. K-means)
+
+1. For all data points $\textbf{x}_n$
+2. E-Step: Update assignment probabilities  
+$$P_k(\textbf{x}_n) = \frac{{\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_k, {\Sigma}_k)}{\sum\limits_{i=1}^K {\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_i, {\Sigma}_i)}$$
+3. M-Step: Update all $k$ prior probabilites, cluster centers and covariances  
+$${\pi}_k = \frac{1}{N} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n)$$
+$$ \textbf{w}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) \textbf{x}_n $$
+$$ {\Sigma}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) (\textbf{x}_n - \textbf{w}_k) (\textbf{x}_n - \textbf{w}_k)^T $$
+
 ## Definitions
 
 ### Likelihood
@@ -72,4 +259,21 @@ Function, describing the joint probability of the data $\textbf{x}$ as function
 
 ### Bayesian-approach
 
-Probabilistic model for the parameters, not the actual data.
\ No newline at end of file
+Probabilistic model for the parameters, not the actual data.
+
+### Classification approaches
+
+1. Find discriminant function $f(\textbf{x}) : x \rightarrow \{0,1\}$
+2. Bayesian approach  
+2.1 Model class-conditional likelihood $p(x | C_k)$  
+2.2 Model an a-priori probability distribution $p(C_k)$  
+2.3 Calculate the posterior $p(C_k | x)$
+3. Direct posterior modeling $p(C_k | x)$
+
+### Concept
+
+Trying to identify properties to which to identify the classes.
+
+### Hypothesis
+
+Function to identify if input is in class or not.
\ No newline at end of file
-- 
cgit v1.2.1