5.1 KiB
5.1 KiB
Lecture Summary: Directed Graphical Models and Naive Bayes
Date: 2025.11.20 Topic: Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.
1. Motivation: The Need for Parameter Reduction
The lecture begins by reviewing Generative Methods using the Gaussian distribution.
- The Problem: In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
- For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector
\mu(Dparameters) and the Covariance Matrix\Sigma(symmetricD \times Dmatrix). - The total number of parameters is roughly
O(D^2), specificallyD + \frac{D(D+1)}{2}. - For large
D, this requires a massive amount of training data to avoid overfitting.
- For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector
- The Solution: We use Prior Knowledge (domain knowledge) about the relationships between variables to reduce the number of parameters.
- By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.
2. Directed Graphical Models (Bayesian Networks)
A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.
Decomposition via Chain Rule
- The joint probability
P(x)can be decomposed using the chain rule:P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i)) - Example Structure:
If we have a graph where
x_1has no parents,x_2depends onx_1, etc., the joint distribution splits into:P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...
Parameter Counting Example (Gaussian Case)
The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.
- Full Gaussian: Assumes all variables are correlated.
- For a 10-dimensional vector (
D=10), parameters =10 + \frac{10 \times 11}{2} = 65.
- For a 10-dimensional vector (
- Reduced Model: Uses a graph structure where variables are conditionally independent.
- Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
- Calculation: For a univariate conditional Gaussian
P(x_i | x_j), we need parameters for the linear relationship (mean coefficients) and variance. - In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.
3. The Naive Bayes Classifier
The Naive Bayes classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.
- Assumption: Given the class label
y, all input featuresx_1, ..., x_Dare mutually independent. - Structure: The class
yis the parent of all feature nodesx_i. There are no connections between the features themselves. - Formula:
P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y) - Advantage: We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.
4. Application: Spam Classifier
The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as Spam (y=1) or Not Spam (y=0).
Feature Engineering
- Input: Emails with varying text lengths.
- Transformation: A "Bag of Words" approach is used.
- Create a dictionary of
Nwords (e.g.,N=10,000). - Represent each email as a fixed-length binary vector
x \in \{0, 1\}^{10,000}. x_i = 1if the $i$-th word appears in the email,0otherwise.
- Create a dictionary of
The "Curse of Dimensionality" (Without Naive Bayes)
- Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
- If we tried to model the full joint distribution
P(x_1, ..., x_{10000} | y), we would need a probability table for every possible combination of words. - Parameter Count:
2^{10,000}entries. This is computationally impossible.
Applying Naive Bayes
- By assuming word independence given the class, we decompose the problem:
P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y) - Parameter Estimation:
- We only need to estimate
P(x_i=1 | y=1)andP(x_i=1 | y=0)for each word. - This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
- We only need to estimate
- Reduced Parameter Count:
- Instead of
2^{10,000}, we need roughly2 \times 10,000parameters (one probability per word per class). - This transforms an impossible problem into a highly efficient and simple one.
- Instead of
5. Summary
- Generative Methods aim to model the underlying distribution
P(x, y). - Graphical Models allow us to inject prior knowledge (independence assumptions) to make this feasible.
- Naive Bayes assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.