Lecture Summary: Directed Graphical Models and Naive Bayes

Date: 2025.11.20 Topic: Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.

The lecture begins by reviewing Generative Methods using the Gaussian distribution.

The Problem: In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
- For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector \mu (D parameters) and the Covariance Matrix \Sigma (symmetric D \times D matrix).
- The total number of parameters is roughly O(D^2), specifically D + \frac{D(D+1)}{2}.
- For large D, this requires a massive amount of training data to avoid overfitting.
The Solution: We use Prior Knowledge (domain knowledge) about the relationships between variables to reduce the number of parameters.
- By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.

A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.

The joint probability P(x) can be decomposed using the chain rule: P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))
Example Structure: If we have a graph where x_1 has no parents, x_2 depends on x_1, etc., the joint distribution splits into: P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...

The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.

Full Gaussian: Assumes all variables are correlated.
- For a 10-dimensional vector (D=10), parameters = 10 + \frac{10 \times 11}{2} = 65.
Reduced Model: Uses a graph structure where variables are conditionally independent.
- Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
- Calculation: For a univariate conditional Gaussian P(x_i | x_j), we need parameters for the linear relationship (mean coefficients) and variance.
- In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.

The Naive Bayes classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.

Assumption: Given the class label y, all input features x_1, ..., x_D are mutually independent.
Structure: The class y is the parent of all feature nodes x_i. There are no connections between the features themselves.
Formula: P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)
Advantage: We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.

The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as Spam (y=1) or Not Spam (y=0).

Input: Emails with varying text lengths.
Transformation: A "Bag of Words" approach is used.
1. Create a dictionary of N words (e.g., N=10,000).
2. Represent each email as a fixed-length binary vector x \in \{0, 1\}^{10,000}.
3. x_i = 1 if the $i$-th word appears in the email, 0 otherwise.

Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
If we tried to model the full joint distribution P(x_1, ..., x_{10000} | y), we would need a probability table for every possible combination of words.
Parameter Count: 2^{10,000} entries. This is computationally impossible.

By assuming word independence given the class, we decompose the problem: P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)
Parameter Estimation:
- We only need to estimate P(x_i=1 | y=1) and P(x_i=1 | y=0) for each word.
- This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
Reduced Parameter Count:
- Instead of 2^{10,000}, we need roughly 2 \times 10,000 parameters (one probability per word per class).
- This transforms an impossible problem into a highly efficient and simple one.

Generative Methods aim to model the underlying distribution P(x, y).
Graphical Models allow us to inject prior knowledge (independence assumptions) to make this feasible.
Naive Bayes assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.