add to final

2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions
--- a/final/1027.md
+++ b/final/1027.md
@@ -0,0 +1,68 @@
 # Large Margin Classifiers and Optimization
 **Date:** 2025.10.27
 **Topic:** Large Margin Classifiers, Optimization, Margin Definition
 ---
 ### 1. Introduction to Robust Classification
 The lecture begins by shifting focus from generative methods to discriminative methods, specifically within a **linearly separable setting**.
 * **Problem Setting:** The goal is to classify data that can be perfectly separated by a linear boundary (hyperplane).
 * **Robustness:** While infinite linear classifiers may separate the data, the objective is to find the "best" one. The best classifier is defined as the one that is most **robust**, meaning it generalizes well to new test data and handles potential outliers effectively.
 * **Intuition:** A robust classifier places the decision boundary in the middle of the gap between classes, maximizing the distance to the nearest data points.
 ### 2. Defining the Margin
 The concept of the **margin** is introduced to mathematically define robustness.
 * **Definition:** The margin is the distance between the decision hyperplane and the closest data points.
 * **Hyperplane Equation:** The decision boundary is defined as $w^T x - b = 0$.
 * **Support Lines:** To define the margin, we establish two parallel lines passing through the closest data points:
    * $w^T x - b = 1$ (for class +1).
    * $w^T x - b = -1$ (for class -1).
    * The region between these lines contains no data points.
 ### 3. Calculating the Margin Width
 The lecture derives the mathematical expression for the margin width using vector projection.
 * **Vector Projection:** The margin is calculated by projecting the vector connecting a point on the boundary ($x_0$) to a support vector ($x$) onto the normal vector $w$.
 * **Derivation:**
    * The distance is the projection of vector $(x - x_0)$ onto the unit normal vector $\frac{w}{||w||}$.
    * Using the constraint $w^T x - b = 1$ and $w^T x_0 - b = 0$, the derived margin distance is $\frac{1}{||w||}$.
 * **Conclusion:** Maximizing the margin is equivalent to **minimizing the norm of the weight vector $||w||$**.
 ### 4. The Optimization Problem
 The task of finding the best classifier is formulated as a constrained optimization problem.
 * **Objective Function:**
    $$\min ||w||^2$$
    (Note: Minimizing $||w||$ is computationally equivalent to minimizing $||w||^2$)
 * **Constraints:** All data points must be correctly classified and lie outside the margin. This is formalized as:
    * $w^T x_i - b \ge 1$ for $y_i = 1$.
    * $w^T x_i - b \le -1$ for $y_i = -1$.
    * **Combined Constraint:** $y_i (w^T x_i - b) \ge 1$ for all $i$.
 ### 5. Optimization with Constraints (Lagrange Multipliers)
 The lecture explains how to solve this optimization problem using **Lagrange Multipliers**, using a general example first.
 * **Problem Setup:** Minimize an objective function $L(x)$ subject to a constraint $g(x) \ge 0$.
 * **Lagrangian:** A new objective function is defined by combining the original loss and the constraint with a multiplier $\lambda$:
    $$L'(x) = L(x) - \lambda g(x)$$
    (Note: The transcript discusses combining components; the sign depends on the specific maximization/minimization formulation)
 * **Solution Cases:**
    The solution involves taking the derivative $\frac{dL'}{dx} = 0$ and considering two cases:
    1.  **Feasible Region ($\lambda = 0$):** The unconstrained minimum of $L(x)$ naturally satisfies the constraint ($g(x) > 0$). In this case, the constraint is inactive.
    2.  **Boundary Case ($\lambda > 0$):** The unconstrained minimum violates the constraint. Therefore, the optimal solution lies *on* the boundary where $g(x) = 0$.
 ### 6. Example: Constrained Minimization
 A specific mathematical example is worked through to demonstrate the method.
 * **Objective:** Minimize $x_1^2 + x_2^2$ (distance from origin).
 * **Constraint:** $x_2 - x_1^2 - 1 \ge 0$ (must be above a parabola).
 * **Solving:**
    * The Lagrangian is set up: $L' = x_1^2 + x_2^2 - \lambda(x_2 - x_1^2 - 1)$.
    * **Case 1 ($\lambda = 0$):** Leads to $x_1=0, x_2=0$, which violates the constraint ($0 - 0 - 1 = -1 \not\ge 0$). This solution is discarded.
    * **Case 2 (Boundary, $\lambda \ne 0$):** The solution must lie on $x_2 - x_1^2 - 1 = 0$. Solving the system of equations yields the valid minimum at $x_1=0, x_2=1$.
 ### 7. Next Steps: Support Vector Machines
 The lecture concludes by linking this optimization framework back to the classifier.
 * **Support Vectors:** The data points that lie exactly on the margin boundary ($g(x)=0$) are called "Support Vectors".
 * **Future Topic:** This foundation leads into the **Support Vector Machine (SVM)** algorithm, which will be discussed in the next session to handle non-linearly separable data.
--- a/final/1030.md
+++ b/final/1030.md
@@ -0,0 +1,125 @@
 # Support Vector Machines: Optimization, Dual Problem & Kernel Methods
 **Date:** 2025.10.30 and 2025.11.03
 **Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
 ---
 ### 1. Introduction to SVM Mathematics
 The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
 * **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them.
 * **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
 * **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier.
 ### 2. General Optimization with Constraints
 The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
 * **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$).
 * **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
    1.  **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$.
    2.  **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$.
 ### 3. Multi-Constraint Example
 A specific example is provided to demonstrate optimization with multiple constraints.
 * **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints.
 * **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$.
 * **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive).
    * The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints.
    * The valid solution is found where the constraints intersect (Boundary Case).
 ### 4. SVM Mathematical Formulation (Primal Problem)
 The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
 * **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin).
 * **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$.
 * **Lagrangian Formulation:**
    $$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$
    Here, $\alpha_i$ represents the Lagrange multipliers.
 ### 5. Deriving the Dual Problem
 To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero.
 * **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points.
 * **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$.
 * **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
 ### 6. The Dual Form and Kernel Intuition
 The final derived Dual objective function depends entirely on the dot product of data points.
 * **Dual Equation:**
    $$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
    Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$.
 * **Primal vs. Dual:**
    * **Primal:** Depends on the number of features/parameters ($D$).
    * **Dual:** Depends on the number of data points ($N$).
 * **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
 ---
 ### 7. The Dual Form and Inner Products
 In the previous section, the **Dual Form** of the SVM optimization problem was derived.
 * **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points:
    $$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$
 * **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method.
 ---
 ### 8. Feature Mapping and Cover's Theorem
 When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
 * **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space.
    * **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
    * **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$.
 * **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
    * **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space.
 ---
 ### 9. The Kernel Trick
 Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
 * **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors:
    $$K(x, y) = \Phi(x)^T \Phi(y)$$
 * **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$.
 * **Derivation Example (Polynomial Kernel):**
    For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$.
    $$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$
    This is mathematically equivalent to the dot product of two mapped vectors where:
    $$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
    Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$.
 ---
 ### 10. Mercer's Theorem & Positive Definite Functions
 How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition.
 * **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$.
 * **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly.
 #### **Positive Definiteness (Matrix Definition)**
 To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
 * For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$.
 * **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**.
 ---
 ### 11. Infinite Dimensionality (RBF Kernel)
 The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
 * The exponential function can be expanded using a Taylor Series into an infinite sum.
 * This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**.
 * Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space.
 ---
 ### 12. Final SVM Formulation with Kernels
 By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
 * **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$:
    $$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
 * **Decision Rule:** For a new test point $x'$, the classification is determined by:
    $$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$
 **Next Lecture:** The course will move on to Generative Methods (probability methods).
--- a/final/1106.md
+++ b/final/1106.md
@@ -0,0 +1,92 @@
 # Lecture Summary: Generative Methods & Probability Review
 **Date:** 2025.11.06
 **Topic:** Discriminative vs. Generative Models, Probability Theory, Probabilistic Inference, and Gaussian Distributions.
 ---
 ### 1. Classification Approaches: Discriminative vs. Generative
 The lecture begins by distinguishing between two fundamental approaches to machine learning classification, specifically for binary problems (labels 0 or 1).
 #### **Discriminative Methods (e.g., Logistic Regression)**
 * **Goal:** Directly model the decision boundary or the conditional probability $P(y|x)$.
 * **Mechanism:** Focuses on distinguishing classes. It learns a function that maps inputs $x$ directly to class labels $y$.
 * **Limitation:** It does not model the underlying distribution of the data itself.
 #### **Generative Methods**
 * **Goal:** Model the joint probability or the class-conditional density $P(x|y)$ and the class prior $P(y)$.
 * **Mechanism:** It learns "how the data is generated" for each class.
 * **Classification:** To classify a new point, it uses **Bayes' Rule** to invert the probabilities:
    $$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$$
 * **Advantage:** If you know the generative model, you can solve the classification problem *and* generate new data samples.
 ---
 ### 2. Probability Theory Review
 To understand Generative Methods, a strong foundation in probability is required.
 #### **Random Variables**
 * **Definition:** A random variable is technically a **function** (mapping) that assigns a real number to an outcome (event $\omega$) in the sample space $\Omega$.
 * **Example:** Tossing a coin 4 times. An event might be "HHTH", and the random variable $X(\omega)$ could be "number of heads" (which equals 3).
 #### **Probability vs. Probability Density Function (PDF)**
 The lecture emphasizes distinguishing between discrete probability ($P$) and continuous density ($p$).
 * **Discrete Probability ($P$):** Defined as the ratio of cardinalities (counts) or areas in discrete sets (e.g., Venn diagrams).
    * **Probability Density Function ($p$):** Used for continuous variables.
    * **Properties:** $p(x) \ge 0$ for all $x$, and $\int p(x)dx = 1$.
    * **Relationship:** The probability of $x$ falling within a range is the **integral** (area under the curve) of the PDF. The probability of a specific point $P(x=x_0)$ is 0.
 #### **Key Statistics**
 * **Expectation ($E[x]$):** The mean or weighted average of a random variable.
    $$E[x] = \int x p(x) dx$$
 * **Covariance:** Measures the spread or variance of the data. For vectors, this results in a Covariance Matrix.
    $$Cov[x] = E[(x - \mu)(x - \mu)^T]$$
 ---
 ### 3. The Trinity of Distributions: Joint, Conditional, and Marginal
 Understanding the relationship between these three is crucial for probabilistic modeling.
 #### **Joint PDF ($P(x_1, x_2)$)**
 * This represents the probability of $x_1$ and $x_2$ occurring together.
 * **Importance:** If you know the Joint PDF, you know *everything* about the system. You can derive all other probabilities (marginal, conditional) from it.
 #### **Conditional PDF ($P(x_1 | x_2)$)**
 * Represents the probability of $x_1$ given that $x_2$ is fixed to a specific value.
 * Visually, this is like taking a "slice" of the joint distribution 3D surface at $x_2 = a$.
 #### **Marginal PDF ($P(x_1)$)**
 * Represents the probability of $x_1$ regardless of $x_2$.
 * **Calculation:** You "marginalize out" (integrate or sum) the other variables.
    * Continuous: $P(x_1) = \int P(x_1, x_2) dx_2$.
    * Discrete: Summing rows or columns in a probability table.
 ---
 ### 4. Probabilistic Inference
 **Inference** is defined as calculating a desired probability (e.g., a prediction) starting from the Joint Probability function using rules like Bayes' theorem and marginalization.
 #### **Handling Missing Data**
 A major practical benefit of generative models (Joint PDF modeling) over discriminative models (like Logistic Regression) is robust handling of missing data.
 * **Scenario:** You have a model predicting disease ($y$) based on Age ($x_1$), Blood Pressure ($x_2$), and Oxygen ($x_3$).
 * **Problem:** A patient arrives, but you cannot measure Age ($x_1$). A discriminative model might fail or require value imputation (guessing averages).
 * **Probabilistic Solution:** You integrate (marginalize) out the missing variable $x_1$ from the joint distribution to get the probability based only on observed data:
    $$P(y | x_2, x_3) = \frac{\int p(x_1, x_2, x_3, y) dx_1}{P(x_2, x_3)}$$.
 ---
 ### 5. The Gaussian Distribution
 The lecture concludes with a review of the Gaussian (Normal) distribution, the most important function in AI/ML.
 * **Univariate Gaussian:** Defined by mean $\mu$ and variance $\sigma^2$.
 * **Multivariate Gaussian:** Defined for a vector $x \in R^D$.
    $$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$.
 * **Parameters:**
    * $\mu$: Mean vector ($D$-dimensional).
    * $\Sigma$: Covariance Matrix ($D \times D$). It must be **Symmetric** and **Positive Definite**.
--- a/final/1110.md
+++ b/final/1110.md
@@ -0,0 +1,104 @@
 # Study Guide: Generative Methods & Multivariate Gaussian Distributions
 **Date:** 2025.12.01
 **Topic:** Generative vs. Discriminative Models, Multivariate Gaussian Properties, Conditional and Marginal Distributions.
 ---
 ### **1. Generative vs. Discriminative Methods**
 The lecture begins by contrasting the new topic (Generative Methods) with previous topics (Discriminative Methods like Linear Regression, Logistic Regression, and SVM).
 * **Discriminative Methods (Separating):**
    * These methods focus on finding a boundary (separating line or hyperplane) between classes.
    * **Limitation:** They cannot generate new data samples because they do not model the data distribution; they only know the boundary.
    * **Hypothesis:** They assume a linear line or function as the hypothesis to separate data.
 * **Generative Methods (Inferring Distribution):**
    * **Goal:** To infer the **underlying distribution** (the rule or pattern) from which the data samples were drawn.
    * **Assumption:** Data is not random; it follows a specific probabilistic structure (e.g., drawn from a distribution).
    * **Capabilities:** Once the Joint Probability Distribution (underlying distribution) is known:
        1.  **Classification:** Can be performed using Bayes' Rule.
        2.  **Generation:** New samples can be created that follow the same patterns as the training data (e.g., generating new images or text).
 ---
 ### **2. The Gaussian (Normal) Distribution**
 The Gaussian distribution is the most popular choice for modeling the "hypothesis" of the underlying distribution in generative models.
 #### **Why Gaussian?**
 1.  **Simplicity:** Defined entirely by two parameters: Mean ($\mu$) and Covariance ($\Sigma$).
 2.  **Central Limit Theorem:** Sums of independent random events tend to follow a Gaussian distribution.
 3.  **Mathematical "Closure":** The most critical reason for its use in AI is that **Conditional** and **Marginal** distributions of a Multivariate Gaussian are *also* Gaussian.
 #### **Multivariate Gaussian Definition**
 For a $D$-dimensional vector $x$:
 $$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$
 * $\mu$: Mean vector ($D$-dimensional).
 * $\Sigma$: Covariance Matrix ($D \times D$).
 [Image of multivariate gaussian distribution 3d plot]
 #### **Properties of the Covariance Matrix ($\Sigma$)**
 * **Symmetric:** $\Sigma_{ij} = \Sigma_{ji}$.
 * **Positive Definite:** All eigenvalues are positive.
 * **Diagonal Terms:** Represent the variance of individual variables.
 * **Off-Diagonal Terms:** Represent the correlation (covariance) between variables.
    * If $\sigma_{12} = 0$, the variables are **independent** (for Gaussians).
    * The matrix shape determines the geometry of the distribution contours (spherical vs. elliptical).
 ---
 ### **3. Independence and Factorization**
 If the Covariance Matrix is **diagonal** (all off-diagonal elements are 0), the variables are independent.
 * Mathematically, the inverse matrix $\Sigma^{-1}$ is also diagonal.
 * The joint probability factorizes into the product of marginals:
    $$P(x_1, x_2) = P(x_1)P(x_2)$$
 * The "quadratic form" inside the exponential splits into a sum of separate squared terms.
 ---
 ### **4. Conditional Gaussian Distribution**
 The lecture derives what happens when we observe a subset of variables (e.g., $x_2$) and want to determine the distribution of the remaining variables ($x_1$). This is $P(x_1 | x_2)$.
 * **Concept:** Visually, this is equivalent to "slicing" the joint distribution at a specific value of $x_2$ (fixed constant).
 * **Result:** The resulting cross-section is **also a Gaussian distribution**.
 * **Parameters:** If we partition $x$, $\mu$, and $\Sigma$ into subsets, the conditional mean ($\mu_{1|2}$) and covariance ($\Sigma_{1|2}$) are given by:
    * **Mean:** $\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$.
    * **Covariance:** $\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$.
    *(Note: The derivation involves completing the square to identify the Gaussian form).*
 ---
 ### **5. Marginal Gaussian Distribution**
 The lecture explains how to find the distribution of a subset of variables ($x_1$) by ignoring the others ($x_2$). This is $P(x_1)$.
 * **Concept:** This is equivalent to integrating out the unobserved variables:
    $$P(x_1) = \int P(x_1, x_2) dx_2$$
 * **Result:** The marginal distribution is **also a Gaussian distribution**.
 * **Parameters:** Unlike the conditional case, calculating the marginal parameters is trivial. You simply select the corresponding sub-vector and sub-matrix from the joint parameters.
    * Mean: $\mu_1$.
    * Covariance: $\Sigma_{11}$.
 ### **Summary Table**
 | Distribution | Type | Parameters Derived From Joint $(\mu, \Sigma)$ |
 | :--- | :--- | :--- |
 | **Joint** $P(x)$ | Gaussian | Given as $\mu, \Sigma$ |
 | **Conditional** $P(x_1 \| x_2)$ | Gaussian | Complex formula (involves matrix inversion of $\Sigma_{22}$) |
 | **Marginal** $P(x_1)$ | Gaussian | Simple subset (extract $\mu_1$ and $\Sigma_{11}$) |
 The lecture concludes by emphasizing that understanding these Gaussian properties is essential for the second half of the semester, as they form the basis for probabilistic generative models.
--- a/final/1113.md
+++ b/final/1113.md
@@ -0,0 +1,85 @@
 # Study Guide: Learning in Generative Methods & Bayes Optimal Classifier
 **Date:** 2025.11.13
 **Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.
 ---
 ### **1. Overview: Learning in Generative Methods**
 The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.
 * **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
 * **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.
 #### **Why Gaussian?**
 The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.
 [Image of multivariate gaussian distribution 3d plot]
 ---
 ### **2. The Learning Process: Parameter Estimation**
 "Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.
 #### **Step 1: Define the Objective Function**
 We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
 * **Goal:** We want to assign **high probability** to the observed (empirical) data points.
 * **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
    $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$
 #### **Step 2: Log-Likelihood (MLE)**
 Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
 * **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.
 #### **Step 3: Optimization (Derivation)**
 We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.
 * **Optimal Mean ($\hat{\mu}$):**
    The derivation yields the **Empirical Mean**. It is simply the average of the data points.
    $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
 * **Optimal Covariance ($\hat{\Sigma}$):**
    The derivation yields the **Empirical Covariance**.
    $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$
 **Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.
 ---
 ### **3. Inference: Making Predictions**
 Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.
 #### **Classification**
 To classify a new data point $x_{new}$:
 1.  We aim to calculate the conditional probability $P(y | x_{new})$.
 2.  Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
 3.  We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).
 #### **Handling Missing Data**
 Generative models offer a theoretically robust way to handle missing variables.
 * **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
 * **Method:** **Marginalization**.
    1.  Start with the Joint PDF.
    2.  Integrate (marginalize) out the missing variable $x_2$.
        $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
    3.  Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
 * This is superior to heuristic methods like imputing the mean.
 ---
 ### **4. Bayes Optimal Classifier**
 The lecture introduces the concept of the theoretical "perfect" classifier.
 * **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
 * **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
    $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$
 #### **Bayes Error**
 * Even the optimal classifier has an irreducible error called the **Bayes Error**.
 * **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
 * **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
 * **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
    $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$
--- a/final/1117.md
+++ b/final/1117.md
@@ -0,0 +1,99 @@
 # Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)
 **Date:** 2025.11.17
 **Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.
 ---
 ### **1. Recap: Bayes Optimal Classifier and Bayes Error**
 The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**.
 * **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability.
 * **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.
 #### **Bayes Error (Irreducible Error)**
 * **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**.
 * **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
 * **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
 * **Formula:** The risk (expected error) is the integral of the minimum probability over the domain:
    $$R^* = \int \min[P_1(x), P_2(x)] dx$$
    If priors are equal, this simplifies to the integral of the overlap region.
 ---
 ### **2. Introduction to Graphical Models**
 The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks).
 * **Motivation:**
    * A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements.
    * The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters.
    * For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
 * **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.
 ---
 ### **3. The Chain Rule and Independence**
 Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities.
 * **General Chain Rule:**
    $$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$
 * **Simplification with Independence:**
    If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$.
 * **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where:
    * **Nodes** represent random variables.
    * **Edges (Arrows)** represent conditional dependencies (causality).
 ---
 ### **4. Building a Bayesian Network (Causal Graph)**
 The lecture illustrates this with a practical example involving a crying baby.
 * **Scenario:** We want to model the causes of a baby crying.
 * **Variables:**
    * **Cry:** The observable effect.
    * **Hungry, Sick, Diaper:** Direct causes of crying.
    * **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying.
 * **Dependencies:**
    * "Hungry" and "Sick" might be independent of each other generally.
    * "Cry" depends on all of them.
    * "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry".
 ---
 ### **5. The Three Canonical Patterns of Independence**
 To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.
 #### **1. Tail-to-Tail (Common Cause)**
 * **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y).
 * **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**.
 * **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$.
 #### **2. Head-to-Tail (Causal Chain)**
 * **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y).
 * **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**.
 * **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further.
 #### **3. Head-to-Head (Common Effect / V-Structure)**
 * **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z).
 * **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away").
 * **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick).
    * Being hungry tells us nothing about being sick (Independent).
    * But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect.
 ---
 ### **6. D-Separation**
 These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph.
 * If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
 * A path is blocked if:
    * It contains a chain or fork where the middle node is **observed**.
    * It contains a collider where the middle node (and all its descendants) are **NOT observed**.
--- a/final/1120.md
+++ b/final/1120.md
@@ -0,0 +1,79 @@
 # Lecture Summary: Directed Graphical Models and Naive Bayes
 **Date:** 2025.11.20
 **Topic:** Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.
 ---
 ### **1. Motivation: The Need for Parameter Reduction**
 The lecture begins by reviewing Generative Methods using the Gaussian distribution.
 * **The Problem:** In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
    * For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector $\mu$ ($D$ parameters) and the Covariance Matrix $\Sigma$ (symmetric $D \times D$ matrix).
    * The total number of parameters is roughly $O(D^2)$, specifically $D + \frac{D(D+1)}{2}$.
    * For large $D$, this requires a massive amount of training data to avoid overfitting.
 * **The Solution:** We use **Prior Knowledge** (domain knowledge) about the relationships between variables to reduce the number of parameters.
    * By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.
 ---
 ### **2. Directed Graphical Models (Bayesian Networks)**
 A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.
 #### **Decomposition via Chain Rule**
 * The joint probability $P(x)$ can be decomposed using the chain rule:
    $$P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))$$
 * **Example Structure:**
    If we have a graph where $x_1$ has no parents, $x_2$ depends on $x_1$, etc., the joint distribution splits into:
    $$P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...$$
 #### **Parameter Counting Example (Gaussian Case)**
 The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.
 * **Full Gaussian:** Assumes all variables are correlated.
    * For a 10-dimensional vector ($D=10$), parameters = $10 + \frac{10 \times 11}{2} = 65$.
 * **Reduced Model:** Uses a graph structure where variables are conditionally independent.
    * Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
    * **Calculation:** For a univariate conditional Gaussian $P(x_i | x_j)$, we need parameters for the linear relationship (mean coefficients) and variance.
    * In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.
 ---
 ### **3. The Naive Bayes Classifier**
 The **Naive Bayes** classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.
 * **Assumption:** Given the class label $y$, all input features $x_1, ..., x_D$ are **mutually independent**.
 * **Structure:** The class $y$ is the parent of all feature nodes $x_i$. There are no connections between the features themselves.
 * **Formula:**
    $$P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)$$
 * **Advantage:** We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.
 ---
 ### **4. Application: Spam Classifier**
 The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as **Spam ($y=1$)** or **Not Spam ($y=0$)**.
 #### **Feature Engineering**
 * **Input:** Emails with varying text lengths.
 * **Transformation:** A "Bag of Words" approach is used.
    1.  Create a dictionary of $N$ words (e.g., $N=10,000$).
    2.  Represent each email as a fixed-length binary vector $x \in \{0, 1\}^{10,000}$.
    3.  $x_i = 1$ if the $i$-th word appears in the email, $0$ otherwise.
 #### **The "Curse of Dimensionality" (Without Naive Bayes)**
 * Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
 * If we tried to model the full joint distribution $P(x_1, ..., x_{10000} | y)$, we would need a probability table for every possible combination of words.
 * **Parameter Count:** $2^{10,000}$ entries. This is computationally impossible.
 #### **Applying Naive Bayes**
 * By assuming word independence given the class, we decompose the problem:
    $$P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)$$
 * **Parameter Estimation:**
    * We only need to estimate $P(x_i=1 | y=1)$ and $P(x_i=1 | y=0)$ for each word.
    * This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
 * **Reduced Parameter Count:**
    * Instead of $2^{10,000}$, we need roughly $2 \times 10,000$ parameters (one probability per word per class).
    * This transforms an impossible problem into a highly efficient and simple one.
 ### **5. Summary**
 * **Generative Methods** aim to model the underlying distribution $P(x, y)$.
 * **Graphical Models** allow us to inject prior knowledge (independence assumptions) to make this feasible.
 * **Naive Bayes** assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.
--- a/final/1124.md
+++ b/final/1124.md
@@ -0,0 +1,85 @@
 # Study Guide: Discrete Probability Models & Undirected Graphical Models
 **Date:** 2025.11.24
 **Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
 ---
 ### **1. Discrete Probability Distributions**
 The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
 #### **Binomial Distribution**
 * **Scenario:** A coin toss (Binary outcome: Head/Tail).
 * **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
 * **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
 * **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
    $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$
 #### **Multinomial Distribution**
 * **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
 * **Definition:**
    * We have $N$ total events (trials).
    * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
    * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
 * **Probability Mass Function:**
    $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$
 ---
 ### **2. Learning: Maximum Likelihood Estimation (MLE)**
 How do we estimate the parameters ($\mu_k$) from data?
 * **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
 * **Method:** **Lagrange Multipliers**.
    1.  **Objective:** Maximize Log-Likelihood:
        $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
    2.  **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
    3.  **Lagrangian:**
        $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
        (Note: Constant terms like $N!$ vanish during differentiation).
    4.  **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.
 * **Result:**
    $$\mu_k = \frac{m_k}{N}$$
    * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
    * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
 ---
 ### **3. Undirected Graphical Models (Markov Random Fields)**
 When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).
 #### **Comparison**
 * **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
 * **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.
 #### **Conditional Independence in MRF**
 Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
 * **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
    * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.
 ---
 ### **4. Factorization in Undirected Graphs**
 Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.
 #### **Cliques and Maximal Cliques**
 * **Clique:** A subgraph where every pair of nodes is connected (fully connected).
 * **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.
 #### **The Joint Distribution Formula**
 We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
 $$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
 * **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
 * **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
    $$Z = \sum_x \prod_{C} \psi_C(x_C)$$
 #### **Example Decomposition**
 Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
 $$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$
 #### **Hammersley-Clifford Theorem**
 This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.
--- a/final/1127.md
+++ b/final/1127.md
@@ -0,0 +1,63 @@
 # Study Guide: Undirected Graphical Models (Markov Random Fields)
 **Date:** 2025.11.27
 **Topic:** Potential Functions, Partition Function, and Conditional Independence in MRFs.
 ---
 ### **1. Recap: Decomposition in Undirected Graphs**
 Unlike Directed Graphical Models (Bayesian Networks) which use conditional probabilities, **Undirected Graphical Models (Markov Random Fields - MRFs)** cannot directly use probabilities because there is no direction/causality. Instead, they decompose the joint distribution based on **Maximal Cliques**.
 * **Cliques:** Subsets of nodes where every node is connected to every other node.
 * **Maximal Clique:** A clique that cannot be expanded (e.g., in the example graph, the maximal cliques covers the graph).
 * **Decomposition Rule:** The joint distribution is the product of functions defined over these maximal cliques.
 ---
 ### **2. Potential Functions ($\psi$)**
 * **Definition:** For each maximal clique $C$, we define a **Potential Function** $\psi_C(x_C)$ (often denoted as $\phi$ or $\psi$).
    * It is a **positive function** ($\psi(x) \ge 0$) mapping the state of the clique variables to a real number.
    * It represents the "compatibility" or "energy" of that configuration.
 * **Key Distinction:** A potential function is **NOT a probability**. It does not sum to 1. It is just a score (non-negative function).
    * *Example:* $\psi_{12}(x_1, x_2)$ scores the interaction between $x_1$ and $x_2$.
 ---
 ### **3. The Partition Function ($Z$)**
 Since the product of potential functions is not a probability distribution (it doesn't sum to 1), we must normalize it.
 * **Definition:** The normalization constant is called the **Partition Function** ($Z$).
    $$Z = \sum_{x} \prod_{C} \psi_C(x_C)$$
 * **Role:** It ensures that the resulting distribution sums to 1, making it a valid probability distribution.
    $$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
 * **Calculation:** To find $Z$, we must sum the product of potentials over **all possible states** (combinations) of the random variables. This summation is often computationally expensive.
 #### **Example Calculation**
 The lecture walks through a simple example with 4 binary variables and two cliques: $\{x_1, x_2, x_3\}$ and $\{x_3, x_4\}$.
 * **Step 1:** Define potential tables for $\psi_{123}$ and $\psi_{34}$.
 * **Step 2:** Calculate the score for every combination.
 * **Step 3:** Sum all scores to get $Z$. In the example, $Z=10$.
 * **Step 4:** The probability of any specific state (e.g., $P(1,0,0,0)$) is its specific score divided by $Z$ (e.g., $(1 \times 3)/10$ or similar depending on values).
 ---
 ### **4. Parameter Estimation**
 * **Discrete Case:** If variables are discrete (like the email spam example), the parameters are the entries in the potential tables. We estimate these values from data to maximize the likelihood.
 * **Continuous Case:** If variables are continuous, potential functions are typically Gaussian distributions. We estimate means and covariances.
 * **Reduction:** Just like in Bayesian Networks, using the graph structure reduces the number of parameters.
    * *Without Graph:* A full table for 4 binary variables needs $2^4 = 16$ entries.
    * *With Graph:* We only need tables for the cliques, significantly reducing complexity.
 ---
 ### **5. Verifying Conditional Independence**
 The lecture demonstrates analytically that the potential function formulation preserves the conditional independence properties of the graph.
 * **Scenario:** Graph with structure $x_1 - x_2 - x_3 - x_4$.
    * Is $x_4$ independent of $x_1$ given $x_3$?
 * **Analytical Check:**
    * We calculate $P(x_4=1 | x_1=0, x_2=1, x_3=1)$.
    * We also calculate $P(x_4=1 | x_1=0, x_2=0, x_3=1)$.
 * **Result:** The calculation shows that as long as $x_3$ is fixed (given), the value of $x_1$ and $x_2$ cancels out in the probability ratio.
    * $P(x_4|x_1, x_2, x_3) = \frac{\phi_1}{\phi_1 + \phi_0}$ (depends only on potentials involving $x_4$ and $x_3$).
 * **Conclusion:** This confirms that $x_4 \perp \{x_1, x_2\} | x_3$. The formulation correctly encodes the global Markov property.
--- a/final/1201.md
+++ b/final/1201.md
@@ -0,0 +1,102 @@
 # Study Guide: Bayesian Networks & Probabilistic Inference
 **Date:** 2025.12.01 (Final Lecture)
 **Topic:** Bayesian Networks, Probabilistic Inference Examples, Marginalization.
 ---
 ### **1. Recap: Directed vs. Undirected Models**
 The lecture begins by briefly contrasting the two types of graphical models discussed:
 * **Undirected Graphs (MRF):** Use potential functions ($\psi$) defined on maximal cliques. Requires a normalization constant (partition function $Z$) to become a probability distribution.
 * **Directed Graphs (Bayesian Networks):** Use conditional probability distributions (CPDs). The joint distribution is the product of local conditional probabilities.
    $$P(X) = \prod_{i} P(x_i | \text{parents}(x_i))$$
 ---
 ### **2. Example 1: The "Alarm" Network (Burglary/Earthquake)**
 This is a classic example used to demonstrate inference in Bayesian Networks.
 #### **Scenario & Structure**
 * **Nodes:**
    * **B:** Burglary (Parent, no prior causes).
    * **E:** Earthquake (Parent, no prior causes).
    * **A:** Alarm (Triggered by Burglary or Earthquake).
    * **J:** JohnCalls (Triggered by Alarm).
    * **M:** MaryCalls (Triggered by Alarm).
 * **Dependencies:** $B \rightarrow A \leftarrow E$, $A \rightarrow J$, $A \rightarrow M$.
 * **Probabilities (Given):**
    * $P(B) = 0.05$, $P(E) = 0.1$.
    * $P(A|B, E)$: Table given (e.g., $P(A|B, \neg E) = 0.85$, $P(A|\neg B, \neg E) = 0.05$, etc.).
    * $P(J|A) = 0.7$, $P(M|A) = 0.8$.
 #### **Task 1: Calculate a Specific Joint Probability**
 Calculate the probability of the event: **Burglary, No Earthquake, Alarm rings, John calls, Mary does not call**.
 $$P(B, \neg E, A, J, \neg M)$$
 * **Decomposition:** Apply the Chain Rule based on the graph structure.
    $$= P(B) \cdot P(\neg E) \cdot P(A | B, \neg E) \cdot P(J | A) \cdot P(\neg M | A)$$
 * **Calculation:**
    $$= 0.05 \times 0.9 \times 0.85 \times 0.7 \times 0.2$$
 #### **Task 2: Inference (Conditional Probability)**
 Calculate the probability that a **Burglary occurred**, given that **John called** and **Mary did not call**.
 $$P(B | J, \neg M)$$
 * **Formula (Bayes Rule):**
    $$P(B | J, \neg M) = \frac{P(B, J, \neg M)}{P(J, \neg M)}$$
 * **Numerator Calculation ($P(B, J, \neg M)$):**
    We must **marginalize out** the unknown variables ($A$ and $E$) from the joint distribution.
    $$P(B, J, \neg M) = \sum_{A \in \{T,F\}} \sum_{E \in \{T,F\}} P(B, E, A, J, \neg M)$$
    This involves summing 4 terms (combinations of A and E).
 * **Denominator Calculation ($P(J, \neg M)$):**
    We further marginalize out $B$ from the numerator result.
    $$P(J, \neg M) = P(B, J, \neg M) + P(\neg B, J, \neg M)$$
 ---
 ### **3. Example 2: 4-Node Tree Structure**
 A simpler example to demonstrate how sums simplify during marginalization.
 #### **Scenario & Structure**
 * **Nodes:** $X_1, X_2, X_3, X_4 \in \{0, 1\}$ (Binary).
 * **Dependencies:**
    * $X_1 \rightarrow X_2$
    * $X_2 \rightarrow X_3$
    * $X_2 \rightarrow X_4$
 * **Decomposition:** $P(X) = P(X_1)P(X_2|X_1)P(X_3|X_2)P(X_4|X_2)$.
 * **Given Tables:** Probabilities for all priors and conditionals are provided.
 #### **Task: Calculate Marginal Probability $P(X_3 = 1)$**
 We need to find the probability of $X_3=1$ regardless of the other variables.
 * **Definition:** Sum the joint probability over all other variables ($X_1, X_2, X_4$).
    $$P(X_3=1) = \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1, x_2, x_3=1, x_4)$$
 * **Step 1: Expand using Graph Structure**
    $$= \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1)P(x_2|x_1)P(X_3=1|x_2)P(x_4|x_2)$$
 * **Step 2: Simplify (Key Insight)**
    Move the summation signs to push them as far right as possible. The sum over $x_4$ only affects the last term $P(x_4|x_2)$.
    $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2) \left[ \sum_{x_4} P(x_4|x_2) \right]$$
    * **Property:** $\sum_{x_4} P(x_4|x_2) = 1$ (Sum of probabilities for a variable given a condition is always 1).
    * Therefore, the $X_4$ term vanishes. This makes sense intuitively: $X_4$ is a "leaf" node distinct from $X_3$; knowing nothing about it doesn't change $X_3$'s probability if $X_2$ is handled.
 * **Step 3: Final Calculation**
    We are left with summing over $X_1$ and $X_2$:
    $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2)$$
    This expands to 4 terms (combinations of $x_1 \in \{0,1\}$ and $x_2 \in \{0,1\}$).
 ---
 ### **4. Semester Summary & Conclusion**
 The lecture concludes the semester's material.
 * **Key Themes Covered:**
    * **Discriminative vs. Generative Methods:** The fundamental difference in approach (boundary vs. distribution).
    * **Objective Functions:** Designing Loss functions vs. Likelihood functions.
    * **Optimization:** Parameter estimation via derivatives (MLE).
    * **Graphical Models:** Reducing parameter complexity using independence assumptions (Bayes Nets, MRFs).
 * **Final Exam:** Scheduled for Thursday, December 11th. It will cover the concepts discussed, focusing on understanding the fundamentals (e.g., Likelihood, Generative principles) rather than rote memorization.
--- a/final/AI_Lecture_note_1027.pdf
+++ b/final/AI_Lecture_note_1027.pdf
--- a/final/AI_Lecture_note_1030.pdf
+++ b/final/AI_Lecture_note_1030.pdf
--- a/final/AI_Lecture_note_1103.pdf
+++ b/final/AI_Lecture_note_1103.pdf
--- a/final/AI_Lecture_note_1106.pdf
+++ b/final/AI_Lecture_note_1106.pdf
--- a/final/AI_Lecture_note_1110.pdf
+++ b/final/AI_Lecture_note_1110.pdf
--- a/final/AI_Lecture_note_1113.pdf
+++ b/final/AI_Lecture_note_1113.pdf
--- a/final/AI_Lecture_note_1117.pdf
+++ b/final/AI_Lecture_note_1117.pdf
--- a/final/AI_Lecture_note_1120.pdf
+++ b/final/AI_Lecture_note_1120.pdf
--- a/final/AI_Lecture_note_1124.pdf
+++ b/final/AI_Lecture_note_1124.pdf
--- a/final/AI_Lecture_note_1127.pdf
+++ b/final/AI_Lecture_note_1127.pdf
--- a/final/AI_Lecture_note_1201.pdf
+++ b/final/AI_Lecture_note_1201.pdf