add to final

This commit is contained in:
2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions

85
final/1124.md Normal file
View File

@@ -0,0 +1,85 @@
# Study Guide: Discrete Probability Models & Undirected Graphical Models
**Date:** 2025.11.24
**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
---
### **1. Discrete Probability Distributions**
The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
#### **Binomial Distribution**
* **Scenario:** A coin toss (Binary outcome: Head/Tail).
* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
$$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$
#### **Multinomial Distribution**
* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
* **Definition:**
* We have $N$ total events (trials).
* We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
* Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
* **Probability Mass Function:**
$$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$
---
### **2. Learning: Maximum Likelihood Estimation (MLE)**
How do we estimate the parameters ($\mu_k$) from data?
* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
* **Method:** **Lagrange Multipliers**.
1. **Objective:** Maximize Log-Likelihood:
$$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
2. **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
3. **Lagrangian:**
$$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
(Note: Constant terms like $N!$ vanish during differentiation).
4. **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.
* **Result:**
$$\mu_k = \frac{m_k}{N}$$
* The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
* This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
---
### **3. Undirected Graphical Models (Markov Random Fields)**
When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).
#### **Comparison**
* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.
#### **Conditional Independence in MRF**
Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
* *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.
---
### **4. Factorization in Undirected Graphs**
Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.
#### **Cliques and Maximal Cliques**
* **Clique:** A subgraph where every pair of nodes is connected (fully connected).
* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.
#### **The Joint Distribution Formula**
We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
$$Z = \sum_x \prod_{C} \psi_C(x_C)$$
#### **Example Decomposition**
Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$
#### **Hammersley-Clifford Theorem**
This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.