Files
2025-02-AI/final/1120.md
2025-12-06 18:32:08 +09:00

5.1 KiB

Lecture Summary: Directed Graphical Models and Naive Bayes

Date: 2025.11.20 Topic: Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.


1. Motivation: The Need for Parameter Reduction

The lecture begins by reviewing Generative Methods using the Gaussian distribution.

  • The Problem: In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
    • For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector \mu (D parameters) and the Covariance Matrix \Sigma (symmetric D \times D matrix).
    • The total number of parameters is roughly O(D^2), specifically D + \frac{D(D+1)}{2}.
    • For large D, this requires a massive amount of training data to avoid overfitting.
  • The Solution: We use Prior Knowledge (domain knowledge) about the relationships between variables to reduce the number of parameters.
    • By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.

2. Directed Graphical Models (Bayesian Networks)

A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.

Decomposition via Chain Rule

  • The joint probability P(x) can be decomposed using the chain rule: P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))
  • Example Structure: If we have a graph where x_1 has no parents, x_2 depends on x_1, etc., the joint distribution splits into: P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...

Parameter Counting Example (Gaussian Case)

The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.

  • Full Gaussian: Assumes all variables are correlated.
    • For a 10-dimensional vector (D=10), parameters = 10 + \frac{10 \times 11}{2} = 65.
  • Reduced Model: Uses a graph structure where variables are conditionally independent.
    • Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
    • Calculation: For a univariate conditional Gaussian P(x_i | x_j), we need parameters for the linear relationship (mean coefficients) and variance.
    • In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.

3. The Naive Bayes Classifier

The Naive Bayes classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.

  • Assumption: Given the class label y, all input features x_1, ..., x_D are mutually independent.
  • Structure: The class y is the parent of all feature nodes x_i. There are no connections between the features themselves.
  • Formula: P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)
  • Advantage: We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.

4. Application: Spam Classifier

The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as Spam (y=1) or Not Spam (y=0).

Feature Engineering

  • Input: Emails with varying text lengths.
  • Transformation: A "Bag of Words" approach is used.
    1. Create a dictionary of N words (e.g., N=10,000).
    2. Represent each email as a fixed-length binary vector x \in \{0, 1\}^{10,000}.
    3. x_i = 1 if the $i$-th word appears in the email, 0 otherwise.

The "Curse of Dimensionality" (Without Naive Bayes)

  • Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
  • If we tried to model the full joint distribution P(x_1, ..., x_{10000} | y), we would need a probability table for every possible combination of words.
  • Parameter Count: 2^{10,000} entries. This is computationally impossible.

Applying Naive Bayes

  • By assuming word independence given the class, we decompose the problem: P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)
  • Parameter Estimation:
    • We only need to estimate P(x_i=1 | y=1) and P(x_i=1 | y=0) for each word.
    • This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
  • Reduced Parameter Count:
    • Instead of 2^{10,000}, we need roughly 2 \times 10,000 parameters (one probability per word per class).
    • This transforms an impossible problem into a highly efficient and simple one.

5. Summary

  • Generative Methods aim to model the underlying distribution P(x, y).
  • Graphical Models allow us to inject prior knowledge (independence assumptions) to make this feasible.
  • Naive Bayes assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.