add to final

2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions
--- a/final/1124.md
+++ b/final/1124.md
@@ -0,0 +1,85 @@
+# Study Guide: Discrete Probability Models & Undirected Graphical Models
+
+**Date:** 2025.11.24
+**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
+
+---
+
+### **1. Discrete Probability Distributions**
+The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
+
+#### **Binomial Distribution**
+* **Scenario:** A coin toss (Binary outcome: Head/Tail).
+* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
+* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
+* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
+    $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$
+
+#### **Multinomial Distribution**
+* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
+* **Definition:**
+    * We have $N$ total events (trials).
+    * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
+    * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
+* **Probability Mass Function:**
+    $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$
+
+---
+
+### **2. Learning: Maximum Likelihood Estimation (MLE)**
+How do we estimate the parameters ($\mu_k$) from data?
+
+* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
+* **Method:** **Lagrange Multipliers**.
+    1.  **Objective:** Maximize Log-Likelihood:
+        $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
+    2.  **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
+    3.  **Lagrangian:**
+        $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
+        (Note: Constant terms like $N!$ vanish during differentiation).
+    4.  **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.
+
+* **Result:**
+    $$\mu_k = \frac{m_k}{N}$$
+    * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
+    * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
+
+---
+
+### **3. Undirected Graphical Models (Markov Random Fields)**
+
+When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).
+
+#### **Comparison**
+* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
+* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.
+
+#### **Conditional Independence in MRF**
+Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
+* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
+    * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.
+
+---
+
+### **4. Factorization in Undirected Graphs**
+
+Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.
+
+#### **Cliques and Maximal Cliques**
+* **Clique:** A subgraph where every pair of nodes is connected (fully connected).
+* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.
+
+#### **The Joint Distribution Formula**
+We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
+$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
+
+* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
+* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
+    $$Z = \sum_x \prod_{C} \psi_C(x_C)$$
+
+#### **Example Decomposition**
+Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
+$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$
+
+#### **Hammersley-Clifford Theorem**
+This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.