add to final
This commit is contained in:
85
final/1113.md
Normal file
85
final/1113.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier
|
||||
|
||||
**Date:** 2025.11.13
|
||||
**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.
|
||||
|
||||
---
|
||||
|
||||
### **1. Overview: Learning in Generative Methods**
|
||||
The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.
|
||||
|
||||
* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
|
||||
* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.
|
||||
|
||||
#### **Why Gaussian?**
|
||||
The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.
|
||||
|
||||
|
||||
|
||||
[Image of multivariate gaussian distribution 3d plot]
|
||||
|
||||
|
||||
---
|
||||
|
||||
### **2. The Learning Process: Parameter Estimation**
|
||||
"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.
|
||||
|
||||
#### **Step 1: Define the Objective Function**
|
||||
We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
|
||||
* **Goal:** We want to assign **high probability** to the observed (empirical) data points.
|
||||
* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
|
||||
$$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$
|
||||
|
||||
#### **Step 2: Log-Likelihood (MLE)**
|
||||
Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
|
||||
* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.
|
||||
|
||||
#### **Step 3: Optimization (Derivation)**
|
||||
We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.
|
||||
|
||||
* **Optimal Mean ($\hat{\mu}$):**
|
||||
The derivation yields the **Empirical Mean**. It is simply the average of the data points.
|
||||
$$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
|
||||
|
||||
* **Optimal Covariance ($\hat{\Sigma}$):**
|
||||
The derivation yields the **Empirical Covariance**.
|
||||
$$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$
|
||||
|
||||
**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.
|
||||
|
||||
---
|
||||
|
||||
### **3. Inference: Making Predictions**
|
||||
Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.
|
||||
|
||||
#### **Classification**
|
||||
To classify a new data point $x_{new}$:
|
||||
1. We aim to calculate the conditional probability $P(y | x_{new})$.
|
||||
2. Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
|
||||
3. We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).
|
||||
|
||||
#### **Handling Missing Data**
|
||||
Generative models offer a theoretically robust way to handle missing variables.
|
||||
* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
|
||||
* **Method:** **Marginalization**.
|
||||
1. Start with the Joint PDF.
|
||||
2. Integrate (marginalize) out the missing variable $x_2$.
|
||||
$$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
|
||||
3. Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
|
||||
* This is superior to heuristic methods like imputing the mean.
|
||||
|
||||
---
|
||||
|
||||
### **4. Bayes Optimal Classifier**
|
||||
The lecture introduces the concept of the theoretical "perfect" classifier.
|
||||
|
||||
* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
|
||||
* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
|
||||
$$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$
|
||||
|
||||
#### **Bayes Error**
|
||||
* Even the optimal classifier has an irreducible error called the **Bayes Error**.
|
||||
* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
|
||||
* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
|
||||
* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
|
||||
$$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$
|
||||
Reference in New Issue
Block a user