add to final

2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions
--- a/final/1113.md
+++ b/final/1113.md
@@ -0,0 +1,85 @@
+# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier
+
+**Date:** 2025.11.13
+**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.
+
+---
+
+### **1. Overview: Learning in Generative Methods**
+The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.
+
+* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
+* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.
+
+#### **Why Gaussian?**
+The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.
+
+
+
+[Image of multivariate gaussian distribution 3d plot]
+
+
+---
+
+### **2. The Learning Process: Parameter Estimation**
+"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.
+
+#### **Step 1: Define the Objective Function**
+We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
+* **Goal:** We want to assign **high probability** to the observed (empirical) data points.
+* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
+    $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$
+
+#### **Step 2: Log-Likelihood (MLE)**
+Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
+* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.
+
+#### **Step 3: Optimization (Derivation)**
+We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.
+
+* **Optimal Mean ($\hat{\mu}$):**
+    The derivation yields the **Empirical Mean**. It is simply the average of the data points.
+    $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
+
+* **Optimal Covariance ($\hat{\Sigma}$):**
+    The derivation yields the **Empirical Covariance**.
+    $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$
+
+**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.
+
+---
+
+### **3. Inference: Making Predictions**
+Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.
+
+#### **Classification**
+To classify a new data point $x_{new}$:
+1.  We aim to calculate the conditional probability $P(y | x_{new})$.
+2.  Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
+3.  We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).
+
+#### **Handling Missing Data**
+Generative models offer a theoretically robust way to handle missing variables.
+* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
+* **Method:** **Marginalization**.
+    1.  Start with the Joint PDF.
+    2.  Integrate (marginalize) out the missing variable $x_2$.
+        $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
+    3.  Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
+* This is superior to heuristic methods like imputing the mean.
+
+---
+
+### **4. Bayes Optimal Classifier**
+The lecture introduces the concept of the theoretical "perfect" classifier.
+
+* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
+* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
+    $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$
+
+#### **Bayes Error**
+* Even the optimal classifier has an irreducible error called the **Bayes Error**.
+* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
+* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
+* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
+    $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$