Study Guide: Learning in Generative Methods & Bayes Optimal Classifier

Date: 2025.11.13 Topic: Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.

1. Overview: Learning in Generative Methods

The fundamental goal of generative methods is to estimate the underlying distribution of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.

Discriminative Model: Learns specific parameters (like w, b in linear models) to separate classes.
Generative Model: Learns parameters (like \mu, \Sigma in Gaussian models) that best describe how the data is distributed.

Why Gaussian?

The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: both its conditional and marginal distributions are also Gaussian. This property simplifies probabilistic inference significantly.

[Image of multivariate gaussian distribution 3d plot]

2. The Learning Process: Parameter Estimation

"Learning" in this context means finding the best parameters (\mu, \Sigma) for the Gaussian model given the training data.

Step 1: Define the Objective Function

We need a metric to evaluate how well our model fits the data. The core idea is Likelihood:

Goal: We want to assign high probability to the observed (empirical) data points.
Likelihood Function: For independent data points, the likelihood is the product of their individual probabilities. P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)

Step 2: Log-Likelihood (MLE)

Directly maximizing the product is difficult. We apply the logarithm to convert the product into a sum, creating the Log-Likelihood function. This does not change the location of the maximum.

Objective: Maximize \sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma).

Step 3: Optimization (Derivation)

We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.

Optimal Mean (\hat{\mu}): The derivation yields the Empirical Mean. It is simply the average of the data points.
\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i
Optimal Covariance (\hat{\Sigma}): The derivation yields the Empirical Covariance.
\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T

Conclusion: The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.

3. Inference: Making Predictions

Once the joint distribution P(z) (where z contains both input features x and class labels y) is learned, we can perform inference.

Classification

To classify a new data point x_{new}:

We aim to calculate the conditional probability P(y | x_{new}).
Using the properties of the multivariate Gaussian, we treat the label y as just another dimension in the random vector.
We calculate probabilities for each class and compare them (e.g., P(y=1 | x) vs P(y=0 | x)).

Handling Missing Data

Generative models offer a theoretically robust way to handle missing variables.

Scenario: We have inputs x = [x_1, x_2], but x_2 is missing during inference.
Method: Marginalization.
1. Start with the Joint PDF.
2. Integrate (marginalize) out the missing variable x_2. P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}
3. Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
This is superior to heuristic methods like imputing the mean.

4. Bayes Optimal Classifier

The lecture introduces the concept of the theoretical "perfect" classifier.

Definition: The Bayes Optimal Classifier is the ideal classifier that would exist if we knew the true underlying distribution of the data.
Decision Rule: It assigns the class with the highest posterior probability P(C_k | x). P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}

Bayes Error

Even the optimal classifier has an irreducible error called the Bayes Error.
Cause: Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
Implication: No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
Mathematical Definition: The error is the integral of the minimum probability density over the overlapping region: \text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx

4.9 KiB Raw Permalink Blame History