add to final

This commit is contained in:
2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions

68
final/1027.md Normal file
View File

@@ -0,0 +1,68 @@
# Large Margin Classifiers and Optimization
**Date:** 2025.10.27
**Topic:** Large Margin Classifiers, Optimization, Margin Definition
---
### 1. Introduction to Robust Classification
The lecture begins by shifting focus from generative methods to discriminative methods, specifically within a **linearly separable setting**.
* **Problem Setting:** The goal is to classify data that can be perfectly separated by a linear boundary (hyperplane).
* **Robustness:** While infinite linear classifiers may separate the data, the objective is to find the "best" one. The best classifier is defined as the one that is most **robust**, meaning it generalizes well to new test data and handles potential outliers effectively.
* **Intuition:** A robust classifier places the decision boundary in the middle of the gap between classes, maximizing the distance to the nearest data points.
### 2. Defining the Margin
The concept of the **margin** is introduced to mathematically define robustness.
* **Definition:** The margin is the distance between the decision hyperplane and the closest data points.
* **Hyperplane Equation:** The decision boundary is defined as $w^T x - b = 0$.
* **Support Lines:** To define the margin, we establish two parallel lines passing through the closest data points:
* $w^T x - b = 1$ (for class +1).
* $w^T x - b = -1$ (for class -1).
* The region between these lines contains no data points.
### 3. Calculating the Margin Width
The lecture derives the mathematical expression for the margin width using vector projection.
* **Vector Projection:** The margin is calculated by projecting the vector connecting a point on the boundary ($x_0$) to a support vector ($x$) onto the normal vector $w$.
* **Derivation:**
* The distance is the projection of vector $(x - x_0)$ onto the unit normal vector $\frac{w}{||w||}$.
* Using the constraint $w^T x - b = 1$ and $w^T x_0 - b = 0$, the derived margin distance is $\frac{1}{||w||}$.
* **Conclusion:** Maximizing the margin is equivalent to **minimizing the norm of the weight vector $||w||$**.
### 4. The Optimization Problem
The task of finding the best classifier is formulated as a constrained optimization problem.
* **Objective Function:**
$$\min ||w||^2$$
(Note: Minimizing $||w||$ is computationally equivalent to minimizing $||w||^2$)
* **Constraints:** All data points must be correctly classified and lie outside the margin. This is formalized as:
* $w^T x_i - b \ge 1$ for $y_i = 1$.
* $w^T x_i - b \le -1$ for $y_i = -1$.
* **Combined Constraint:** $y_i (w^T x_i - b) \ge 1$ for all $i$.
### 5. Optimization with Constraints (Lagrange Multipliers)
The lecture explains how to solve this optimization problem using **Lagrange Multipliers**, using a general example first.
* **Problem Setup:** Minimize an objective function $L(x)$ subject to a constraint $g(x) \ge 0$.
* **Lagrangian:** A new objective function is defined by combining the original loss and the constraint with a multiplier $\lambda$:
$$L'(x) = L(x) - \lambda g(x)$$
(Note: The transcript discusses combining components; the sign depends on the specific maximization/minimization formulation)
* **Solution Cases:**
The solution involves taking the derivative $\frac{dL'}{dx} = 0$ and considering two cases:
1. **Feasible Region ($\lambda = 0$):** The unconstrained minimum of $L(x)$ naturally satisfies the constraint ($g(x) > 0$). In this case, the constraint is inactive.
2. **Boundary Case ($\lambda > 0$):** The unconstrained minimum violates the constraint. Therefore, the optimal solution lies *on* the boundary where $g(x) = 0$.
### 6. Example: Constrained Minimization
A specific mathematical example is worked through to demonstrate the method.
* **Objective:** Minimize $x_1^2 + x_2^2$ (distance from origin).
* **Constraint:** $x_2 - x_1^2 - 1 \ge 0$ (must be above a parabola).
* **Solving:**
* The Lagrangian is set up: $L' = x_1^2 + x_2^2 - \lambda(x_2 - x_1^2 - 1)$.
* **Case 1 ($\lambda = 0$):** Leads to $x_1=0, x_2=0$, which violates the constraint ($0 - 0 - 1 = -1 \not\ge 0$). This solution is discarded.
* **Case 2 (Boundary, $\lambda \ne 0$):** The solution must lie on $x_2 - x_1^2 - 1 = 0$. Solving the system of equations yields the valid minimum at $x_1=0, x_2=1$.
### 7. Next Steps: Support Vector Machines
The lecture concludes by linking this optimization framework back to the classifier.
* **Support Vectors:** The data points that lie exactly on the margin boundary ($g(x)=0$) are called "Support Vectors".
* **Future Topic:** This foundation leads into the **Support Vector Machine (SVM)** algorithm, which will be discussed in the next session to handle non-linearly separable data.

125
final/1030.md Normal file
View File

@@ -0,0 +1,125 @@
# Support Vector Machines: Optimization, Dual Problem & Kernel Methods
**Date:** 2025.10.30 and 2025.11.03
**Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
---
### 1. Introduction to SVM Mathematics
The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
* **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them.
* **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
* **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier.
### 2. General Optimization with Constraints
The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
* **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$).
* **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
1. **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$.
2. **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$.
### 3. Multi-Constraint Example
A specific example is provided to demonstrate optimization with multiple constraints.
* **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints.
* **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$.
* **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive).
* The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints.
* The valid solution is found where the constraints intersect (Boundary Case).
### 4. SVM Mathematical Formulation (Primal Problem)
The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
* **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin).
* **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$.
* **Lagrangian Formulation:**
$$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$
Here, $\alpha_i$ represents the Lagrange multipliers.
### 5. Deriving the Dual Problem
To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero.
* **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points.
* **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$.
* **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
### 6. The Dual Form and Kernel Intuition
The final derived Dual objective function depends entirely on the dot product of data points.
* **Dual Equation:**
$$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$.
* **Primal vs. Dual:**
* **Primal:** Depends on the number of features/parameters ($D$).
* **Dual:** Depends on the number of data points ($N$).
* **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
---
### 7. The Dual Form and Inner Products
In the previous section, the **Dual Form** of the SVM optimization problem was derived.
* **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points:
$$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$
* **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method.
---
### 8. Feature Mapping and Cover's Theorem
When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
* **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space.
* **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
* **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$.
* **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
* **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space.
---
### 9. The Kernel Trick
Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
* **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors:
$$K(x, y) = \Phi(x)^T \Phi(y)$$
* **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$.
* **Derivation Example (Polynomial Kernel):**
For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$.
$$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$
This is mathematically equivalent to the dot product of two mapped vectors where:
$$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$.
---
### 10. Mercer's Theorem & Positive Definite Functions
How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition.
* **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$.
* **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly.
#### **Positive Definiteness (Matrix Definition)**
To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
* For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$.
* **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**.
---
### 11. Infinite Dimensionality (RBF Kernel)
The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
* The exponential function can be expanded using a Taylor Series into an infinite sum.
* This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**.
* Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space.
---
### 12. Final SVM Formulation with Kernels
By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
* **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$:
$$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
* **Decision Rule:** For a new test point $x'$, the classification is determined by:
$$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$
**Next Lecture:** The course will move on to Generative Methods (probability methods).

92
final/1106.md Normal file
View File

@@ -0,0 +1,92 @@
# Lecture Summary: Generative Methods & Probability Review
**Date:** 2025.11.06
**Topic:** Discriminative vs. Generative Models, Probability Theory, Probabilistic Inference, and Gaussian Distributions.
---
### 1. Classification Approaches: Discriminative vs. Generative
The lecture begins by distinguishing between two fundamental approaches to machine learning classification, specifically for binary problems (labels 0 or 1).
#### **Discriminative Methods (e.g., Logistic Regression)**
* **Goal:** Directly model the decision boundary or the conditional probability $P(y|x)$.
* **Mechanism:** Focuses on distinguishing classes. It learns a function that maps inputs $x$ directly to class labels $y$.
* **Limitation:** It does not model the underlying distribution of the data itself.
#### **Generative Methods**
* **Goal:** Model the joint probability or the class-conditional density $P(x|y)$ and the class prior $P(y)$.
* **Mechanism:** It learns "how the data is generated" for each class.
* **Classification:** To classify a new point, it uses **Bayes' Rule** to invert the probabilities:
$$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$$
* **Advantage:** If you know the generative model, you can solve the classification problem *and* generate new data samples.
---
### 2. Probability Theory Review
To understand Generative Methods, a strong foundation in probability is required.
#### **Random Variables**
* **Definition:** A random variable is technically a **function** (mapping) that assigns a real number to an outcome (event $\omega$) in the sample space $\Omega$.
* **Example:** Tossing a coin 4 times. An event might be "HHTH", and the random variable $X(\omega)$ could be "number of heads" (which equals 3).
#### **Probability vs. Probability Density Function (PDF)**
The lecture emphasizes distinguishing between discrete probability ($P$) and continuous density ($p$).
* **Discrete Probability ($P$):** Defined as the ratio of cardinalities (counts) or areas in discrete sets (e.g., Venn diagrams).
* **Probability Density Function ($p$):** Used for continuous variables.
* **Properties:** $p(x) \ge 0$ for all $x$, and $\int p(x)dx = 1$.
* **Relationship:** The probability of $x$ falling within a range is the **integral** (area under the curve) of the PDF. The probability of a specific point $P(x=x_0)$ is 0.
#### **Key Statistics**
* **Expectation ($E[x]$):** The mean or weighted average of a random variable.
$$E[x] = \int x p(x) dx$$
* **Covariance:** Measures the spread or variance of the data. For vectors, this results in a Covariance Matrix.
$$Cov[x] = E[(x - \mu)(x - \mu)^T]$$
---
### 3. The Trinity of Distributions: Joint, Conditional, and Marginal
Understanding the relationship between these three is crucial for probabilistic modeling.
#### **Joint PDF ($P(x_1, x_2)$)**
* This represents the probability of $x_1$ and $x_2$ occurring together.
* **Importance:** If you know the Joint PDF, you know *everything* about the system. You can derive all other probabilities (marginal, conditional) from it.
#### **Conditional PDF ($P(x_1 | x_2)$)**
* Represents the probability of $x_1$ given that $x_2$ is fixed to a specific value.
* Visually, this is like taking a "slice" of the joint distribution 3D surface at $x_2 = a$.
#### **Marginal PDF ($P(x_1)$)**
* Represents the probability of $x_1$ regardless of $x_2$.
* **Calculation:** You "marginalize out" (integrate or sum) the other variables.
* Continuous: $P(x_1) = \int P(x_1, x_2) dx_2$.
* Discrete: Summing rows or columns in a probability table.
---
### 4. Probabilistic Inference
**Inference** is defined as calculating a desired probability (e.g., a prediction) starting from the Joint Probability function using rules like Bayes' theorem and marginalization.
#### **Handling Missing Data**
A major practical benefit of generative models (Joint PDF modeling) over discriminative models (like Logistic Regression) is robust handling of missing data.
* **Scenario:** You have a model predicting disease ($y$) based on Age ($x_1$), Blood Pressure ($x_2$), and Oxygen ($x_3$).
* **Problem:** A patient arrives, but you cannot measure Age ($x_1$). A discriminative model might fail or require value imputation (guessing averages).
* **Probabilistic Solution:** You integrate (marginalize) out the missing variable $x_1$ from the joint distribution to get the probability based only on observed data:
$$P(y | x_2, x_3) = \frac{\int p(x_1, x_2, x_3, y) dx_1}{P(x_2, x_3)}$$.
---
### 5. The Gaussian Distribution
The lecture concludes with a review of the Gaussian (Normal) distribution, the most important function in AI/ML.
* **Univariate Gaussian:** Defined by mean $\mu$ and variance $\sigma^2$.
* **Multivariate Gaussian:** Defined for a vector $x \in R^D$.
$$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$.
* **Parameters:**
* $\mu$: Mean vector ($D$-dimensional).
* $\Sigma$: Covariance Matrix ($D \times D$). It must be **Symmetric** and **Positive Definite**.

104
final/1110.md Normal file
View File

@@ -0,0 +1,104 @@
# Study Guide: Generative Methods & Multivariate Gaussian Distributions
**Date:** 2025.12.01
**Topic:** Generative vs. Discriminative Models, Multivariate Gaussian Properties, Conditional and Marginal Distributions.
---
### **1. Generative vs. Discriminative Methods**
The lecture begins by contrasting the new topic (Generative Methods) with previous topics (Discriminative Methods like Linear Regression, Logistic Regression, and SVM).
* **Discriminative Methods (Separating):**
* These methods focus on finding a boundary (separating line or hyperplane) between classes.
* **Limitation:** They cannot generate new data samples because they do not model the data distribution; they only know the boundary.
* **Hypothesis:** They assume a linear line or function as the hypothesis to separate data.
* **Generative Methods (Inferring Distribution):**
* **Goal:** To infer the **underlying distribution** (the rule or pattern) from which the data samples were drawn.
* **Assumption:** Data is not random; it follows a specific probabilistic structure (e.g., drawn from a distribution).
* **Capabilities:** Once the Joint Probability Distribution (underlying distribution) is known:
1. **Classification:** Can be performed using Bayes' Rule.
2. **Generation:** New samples can be created that follow the same patterns as the training data (e.g., generating new images or text).
---
### **2. The Gaussian (Normal) Distribution**
The Gaussian distribution is the most popular choice for modeling the "hypothesis" of the underlying distribution in generative models.
#### **Why Gaussian?**
1. **Simplicity:** Defined entirely by two parameters: Mean ($\mu$) and Covariance ($\Sigma$).
2. **Central Limit Theorem:** Sums of independent random events tend to follow a Gaussian distribution.
3. **Mathematical "Closure":** The most critical reason for its use in AI is that **Conditional** and **Marginal** distributions of a Multivariate Gaussian are *also* Gaussian.
#### **Multivariate Gaussian Definition**
For a $D$-dimensional vector $x$:
$$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$
* $\mu$: Mean vector ($D$-dimensional).
* $\Sigma$: Covariance Matrix ($D \times D$).
[Image of multivariate gaussian distribution 3d plot]
#### **Properties of the Covariance Matrix ($\Sigma$)**
* **Symmetric:** $\Sigma_{ij} = \Sigma_{ji}$.
* **Positive Definite:** All eigenvalues are positive.
* **Diagonal Terms:** Represent the variance of individual variables.
* **Off-Diagonal Terms:** Represent the correlation (covariance) between variables.
* If $\sigma_{12} = 0$, the variables are **independent** (for Gaussians).
* The matrix shape determines the geometry of the distribution contours (spherical vs. elliptical).
---
### **3. Independence and Factorization**
If the Covariance Matrix is **diagonal** (all off-diagonal elements are 0), the variables are independent.
* Mathematically, the inverse matrix $\Sigma^{-1}$ is also diagonal.
* The joint probability factorizes into the product of marginals:
$$P(x_1, x_2) = P(x_1)P(x_2)$$
* The "quadratic form" inside the exponential splits into a sum of separate squared terms.
---
### **4. Conditional Gaussian Distribution**
The lecture derives what happens when we observe a subset of variables (e.g., $x_2$) and want to determine the distribution of the remaining variables ($x_1$). This is $P(x_1 | x_2)$.
* **Concept:** Visually, this is equivalent to "slicing" the joint distribution at a specific value of $x_2$ (fixed constant).
* **Result:** The resulting cross-section is **also a Gaussian distribution**.
* **Parameters:** If we partition $x$, $\mu$, and $\Sigma$ into subsets, the conditional mean ($\mu_{1|2}$) and covariance ($\Sigma_{1|2}$) are given by:
* **Mean:** $\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$.
* **Covariance:** $\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$.
*(Note: The derivation involves completing the square to identify the Gaussian form).*
---
### **5. Marginal Gaussian Distribution**
The lecture explains how to find the distribution of a subset of variables ($x_1$) by ignoring the others ($x_2$). This is $P(x_1)$.
* **Concept:** This is equivalent to integrating out the unobserved variables:
$$P(x_1) = \int P(x_1, x_2) dx_2$$
* **Result:** The marginal distribution is **also a Gaussian distribution**.
* **Parameters:** Unlike the conditional case, calculating the marginal parameters is trivial. You simply select the corresponding sub-vector and sub-matrix from the joint parameters.
* Mean: $\mu_1$.
* Covariance: $\Sigma_{11}$.
### **Summary Table**
| Distribution | Type | Parameters Derived From Joint $(\mu, \Sigma)$ |
| :--- | :--- | :--- |
| **Joint** $P(x)$ | Gaussian | Given as $\mu, \Sigma$ |
| **Conditional** $P(x_1 \| x_2)$ | Gaussian | Complex formula (involves matrix inversion of $\Sigma_{22}$) |
| **Marginal** $P(x_1)$ | Gaussian | Simple subset (extract $\mu_1$ and $\Sigma_{11}$) |
The lecture concludes by emphasizing that understanding these Gaussian properties is essential for the second half of the semester, as they form the basis for probabilistic generative models.

85
final/1113.md Normal file
View File

@@ -0,0 +1,85 @@
# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier
**Date:** 2025.11.13
**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.
---
### **1. Overview: Learning in Generative Methods**
The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.
* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.
#### **Why Gaussian?**
The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.
[Image of multivariate gaussian distribution 3d plot]
---
### **2. The Learning Process: Parameter Estimation**
"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.
#### **Step 1: Define the Objective Function**
We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
* **Goal:** We want to assign **high probability** to the observed (empirical) data points.
* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
$$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$
#### **Step 2: Log-Likelihood (MLE)**
Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.
#### **Step 3: Optimization (Derivation)**
We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.
* **Optimal Mean ($\hat{\mu}$):**
The derivation yields the **Empirical Mean**. It is simply the average of the data points.
$$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
* **Optimal Covariance ($\hat{\Sigma}$):**
The derivation yields the **Empirical Covariance**.
$$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$
**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.
---
### **3. Inference: Making Predictions**
Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.
#### **Classification**
To classify a new data point $x_{new}$:
1. We aim to calculate the conditional probability $P(y | x_{new})$.
2. Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
3. We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).
#### **Handling Missing Data**
Generative models offer a theoretically robust way to handle missing variables.
* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
* **Method:** **Marginalization**.
1. Start with the Joint PDF.
2. Integrate (marginalize) out the missing variable $x_2$.
$$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
3. Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
* This is superior to heuristic methods like imputing the mean.
---
### **4. Bayes Optimal Classifier**
The lecture introduces the concept of the theoretical "perfect" classifier.
* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
$$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$
#### **Bayes Error**
* Even the optimal classifier has an irreducible error called the **Bayes Error**.
* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
$$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$

99
final/1117.md Normal file
View File

@@ -0,0 +1,99 @@
# Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)
**Date:** 2025.11.17
**Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.
---
### **1. Recap: Bayes Optimal Classifier and Bayes Error**
The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**.
* **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability.
* **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.
#### **Bayes Error (Irreducible Error)**
* **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**.
* **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
* **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
* **Formula:** The risk (expected error) is the integral of the minimum probability over the domain:
$$R^* = \int \min[P_1(x), P_2(x)] dx$$
If priors are equal, this simplifies to the integral of the overlap region.
---
### **2. Introduction to Graphical Models**
The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks).
* **Motivation:**
* A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements.
* The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters.
* For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
* **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.
---
### **3. The Chain Rule and Independence**
Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities.
* **General Chain Rule:**
$$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$
* **Simplification with Independence:**
If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$.
* **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where:
* **Nodes** represent random variables.
* **Edges (Arrows)** represent conditional dependencies (causality).
---
### **4. Building a Bayesian Network (Causal Graph)**
The lecture illustrates this with a practical example involving a crying baby.
* **Scenario:** We want to model the causes of a baby crying.
* **Variables:**
* **Cry:** The observable effect.
* **Hungry, Sick, Diaper:** Direct causes of crying.
* **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying.
* **Dependencies:**
* "Hungry" and "Sick" might be independent of each other generally.
* "Cry" depends on all of them.
* "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry".
---
### **5. The Three Canonical Patterns of Independence**
To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.
#### **1. Tail-to-Tail (Common Cause)**
* **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y).
* **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**.
* **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$.
#### **2. Head-to-Tail (Causal Chain)**
* **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y).
* **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**.
* **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further.
#### **3. Head-to-Head (Common Effect / V-Structure)**
* **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z).
* **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away").
* **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick).
* Being hungry tells us nothing about being sick (Independent).
* But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect.
---
### **6. D-Separation**
These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph.
* If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
* A path is blocked if:
* It contains a chain or fork where the middle node is **observed**.
* It contains a collider where the middle node (and all its descendants) are **NOT observed**.

79
final/1120.md Normal file
View File

@@ -0,0 +1,79 @@
# Lecture Summary: Directed Graphical Models and Naive Bayes
**Date:** 2025.11.20
**Topic:** Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.
---
### **1. Motivation: The Need for Parameter Reduction**
The lecture begins by reviewing Generative Methods using the Gaussian distribution.
* **The Problem:** In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
* For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector $\mu$ ($D$ parameters) and the Covariance Matrix $\Sigma$ (symmetric $D \times D$ matrix).
* The total number of parameters is roughly $O(D^2)$, specifically $D + \frac{D(D+1)}{2}$.
* For large $D$, this requires a massive amount of training data to avoid overfitting.
* **The Solution:** We use **Prior Knowledge** (domain knowledge) about the relationships between variables to reduce the number of parameters.
* By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.
---
### **2. Directed Graphical Models (Bayesian Networks)**
A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.
#### **Decomposition via Chain Rule**
* The joint probability $P(x)$ can be decomposed using the chain rule:
$$P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))$$
* **Example Structure:**
If we have a graph where $x_1$ has no parents, $x_2$ depends on $x_1$, etc., the joint distribution splits into:
$$P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...$$
#### **Parameter Counting Example (Gaussian Case)**
The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.
* **Full Gaussian:** Assumes all variables are correlated.
* For a 10-dimensional vector ($D=10$), parameters = $10 + \frac{10 \times 11}{2} = 65$.
* **Reduced Model:** Uses a graph structure where variables are conditionally independent.
* Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
* **Calculation:** For a univariate conditional Gaussian $P(x_i | x_j)$, we need parameters for the linear relationship (mean coefficients) and variance.
* In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.
---
### **3. The Naive Bayes Classifier**
The **Naive Bayes** classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.
* **Assumption:** Given the class label $y$, all input features $x_1, ..., x_D$ are **mutually independent**.
* **Structure:** The class $y$ is the parent of all feature nodes $x_i$. There are no connections between the features themselves.
* **Formula:**
$$P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)$$
* **Advantage:** We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.
---
### **4. Application: Spam Classifier**
The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as **Spam ($y=1$)** or **Not Spam ($y=0$)**.
#### **Feature Engineering**
* **Input:** Emails with varying text lengths.
* **Transformation:** A "Bag of Words" approach is used.
1. Create a dictionary of $N$ words (e.g., $N=10,000$).
2. Represent each email as a fixed-length binary vector $x \in \{0, 1\}^{10,000}$.
3. $x_i = 1$ if the $i$-th word appears in the email, $0$ otherwise.
#### **The "Curse of Dimensionality" (Without Naive Bayes)**
* Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
* If we tried to model the full joint distribution $P(x_1, ..., x_{10000} | y)$, we would need a probability table for every possible combination of words.
* **Parameter Count:** $2^{10,000}$ entries. This is computationally impossible.
#### **Applying Naive Bayes**
* By assuming word independence given the class, we decompose the problem:
$$P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)$$
* **Parameter Estimation:**
* We only need to estimate $P(x_i=1 | y=1)$ and $P(x_i=1 | y=0)$ for each word.
* This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
* **Reduced Parameter Count:**
* Instead of $2^{10,000}$, we need roughly $2 \times 10,000$ parameters (one probability per word per class).
* This transforms an impossible problem into a highly efficient and simple one.
### **5. Summary**
* **Generative Methods** aim to model the underlying distribution $P(x, y)$.
* **Graphical Models** allow us to inject prior knowledge (independence assumptions) to make this feasible.
* **Naive Bayes** assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.

85
final/1124.md Normal file
View File

@@ -0,0 +1,85 @@
# Study Guide: Discrete Probability Models & Undirected Graphical Models
**Date:** 2025.11.24
**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
---
### **1. Discrete Probability Distributions**
The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
#### **Binomial Distribution**
* **Scenario:** A coin toss (Binary outcome: Head/Tail).
* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
$$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$
#### **Multinomial Distribution**
* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
* **Definition:**
* We have $N$ total events (trials).
* We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
* Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
* **Probability Mass Function:**
$$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$
---
### **2. Learning: Maximum Likelihood Estimation (MLE)**
How do we estimate the parameters ($\mu_k$) from data?
* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
* **Method:** **Lagrange Multipliers**.
1. **Objective:** Maximize Log-Likelihood:
$$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
2. **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
3. **Lagrangian:**
$$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
(Note: Constant terms like $N!$ vanish during differentiation).
4. **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.
* **Result:**
$$\mu_k = \frac{m_k}{N}$$
* The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
* This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
---
### **3. Undirected Graphical Models (Markov Random Fields)**
When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).
#### **Comparison**
* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.
#### **Conditional Independence in MRF**
Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
* *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.
---
### **4. Factorization in Undirected Graphs**
Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.
#### **Cliques and Maximal Cliques**
* **Clique:** A subgraph where every pair of nodes is connected (fully connected).
* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.
#### **The Joint Distribution Formula**
We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
$$Z = \sum_x \prod_{C} \psi_C(x_C)$$
#### **Example Decomposition**
Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$
#### **Hammersley-Clifford Theorem**
This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.

63
final/1127.md Normal file
View File

@@ -0,0 +1,63 @@
# Study Guide: Undirected Graphical Models (Markov Random Fields)
**Date:** 2025.11.27
**Topic:** Potential Functions, Partition Function, and Conditional Independence in MRFs.
---
### **1. Recap: Decomposition in Undirected Graphs**
Unlike Directed Graphical Models (Bayesian Networks) which use conditional probabilities, **Undirected Graphical Models (Markov Random Fields - MRFs)** cannot directly use probabilities because there is no direction/causality. Instead, they decompose the joint distribution based on **Maximal Cliques**.
* **Cliques:** Subsets of nodes where every node is connected to every other node.
* **Maximal Clique:** A clique that cannot be expanded (e.g., in the example graph, the maximal cliques covers the graph).
* **Decomposition Rule:** The joint distribution is the product of functions defined over these maximal cliques.
---
### **2. Potential Functions ($\psi$)**
* **Definition:** For each maximal clique $C$, we define a **Potential Function** $\psi_C(x_C)$ (often denoted as $\phi$ or $\psi$).
* It is a **positive function** ($\psi(x) \ge 0$) mapping the state of the clique variables to a real number.
* It represents the "compatibility" or "energy" of that configuration.
* **Key Distinction:** A potential function is **NOT a probability**. It does not sum to 1. It is just a score (non-negative function).
* *Example:* $\psi_{12}(x_1, x_2)$ scores the interaction between $x_1$ and $x_2$.
---
### **3. The Partition Function ($Z$)**
Since the product of potential functions is not a probability distribution (it doesn't sum to 1), we must normalize it.
* **Definition:** The normalization constant is called the **Partition Function** ($Z$).
$$Z = \sum_{x} \prod_{C} \psi_C(x_C)$$
* **Role:** It ensures that the resulting distribution sums to 1, making it a valid probability distribution.
$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
* **Calculation:** To find $Z$, we must sum the product of potentials over **all possible states** (combinations) of the random variables. This summation is often computationally expensive.
#### **Example Calculation**
The lecture walks through a simple example with 4 binary variables and two cliques: $\{x_1, x_2, x_3\}$ and $\{x_3, x_4\}$.
* **Step 1:** Define potential tables for $\psi_{123}$ and $\psi_{34}$.
* **Step 2:** Calculate the score for every combination.
* **Step 3:** Sum all scores to get $Z$. In the example, $Z=10$.
* **Step 4:** The probability of any specific state (e.g., $P(1,0,0,0)$) is its specific score divided by $Z$ (e.g., $(1 \times 3)/10$ or similar depending on values).
---
### **4. Parameter Estimation**
* **Discrete Case:** If variables are discrete (like the email spam example), the parameters are the entries in the potential tables. We estimate these values from data to maximize the likelihood.
* **Continuous Case:** If variables are continuous, potential functions are typically Gaussian distributions. We estimate means and covariances.
* **Reduction:** Just like in Bayesian Networks, using the graph structure reduces the number of parameters.
* *Without Graph:* A full table for 4 binary variables needs $2^4 = 16$ entries.
* *With Graph:* We only need tables for the cliques, significantly reducing complexity.
---
### **5. Verifying Conditional Independence**
The lecture demonstrates analytically that the potential function formulation preserves the conditional independence properties of the graph.
* **Scenario:** Graph with structure $x_1 - x_2 - x_3 - x_4$.
* Is $x_4$ independent of $x_1$ given $x_3$?
* **Analytical Check:**
* We calculate $P(x_4=1 | x_1=0, x_2=1, x_3=1)$.
* We also calculate $P(x_4=1 | x_1=0, x_2=0, x_3=1)$.
* **Result:** The calculation shows that as long as $x_3$ is fixed (given), the value of $x_1$ and $x_2$ cancels out in the probability ratio.
* $P(x_4|x_1, x_2, x_3) = \frac{\phi_1}{\phi_1 + \phi_0}$ (depends only on potentials involving $x_4$ and $x_3$).
* **Conclusion:** This confirms that $x_4 \perp \{x_1, x_2\} | x_3$. The formulation correctly encodes the global Markov property.

102
final/1201.md Normal file
View File

@@ -0,0 +1,102 @@
# Study Guide: Bayesian Networks & Probabilistic Inference
**Date:** 2025.12.01 (Final Lecture)
**Topic:** Bayesian Networks, Probabilistic Inference Examples, Marginalization.
---
### **1. Recap: Directed vs. Undirected Models**
The lecture begins by briefly contrasting the two types of graphical models discussed:
* **Undirected Graphs (MRF):** Use potential functions ($\psi$) defined on maximal cliques. Requires a normalization constant (partition function $Z$) to become a probability distribution.
* **Directed Graphs (Bayesian Networks):** Use conditional probability distributions (CPDs). The joint distribution is the product of local conditional probabilities.
$$P(X) = \prod_{i} P(x_i | \text{parents}(x_i))$$
---
### **2. Example 1: The "Alarm" Network (Burglary/Earthquake)**
This is a classic example used to demonstrate inference in Bayesian Networks.
#### **Scenario & Structure**
* **Nodes:**
* **B:** Burglary (Parent, no prior causes).
* **E:** Earthquake (Parent, no prior causes).
* **A:** Alarm (Triggered by Burglary or Earthquake).
* **J:** JohnCalls (Triggered by Alarm).
* **M:** MaryCalls (Triggered by Alarm).
* **Dependencies:** $B \rightarrow A \leftarrow E$, $A \rightarrow J$, $A \rightarrow M$.
* **Probabilities (Given):**
* $P(B) = 0.05$, $P(E) = 0.1$.
* $P(A|B, E)$: Table given (e.g., $P(A|B, \neg E) = 0.85$, $P(A|\neg B, \neg E) = 0.05$, etc.).
* $P(J|A) = 0.7$, $P(M|A) = 0.8$.
#### **Task 1: Calculate a Specific Joint Probability**
Calculate the probability of the event: **Burglary, No Earthquake, Alarm rings, John calls, Mary does not call**.
$$P(B, \neg E, A, J, \neg M)$$
* **Decomposition:** Apply the Chain Rule based on the graph structure.
$$= P(B) \cdot P(\neg E) \cdot P(A | B, \neg E) \cdot P(J | A) \cdot P(\neg M | A)$$
* **Calculation:**
$$= 0.05 \times 0.9 \times 0.85 \times 0.7 \times 0.2$$
#### **Task 2: Inference (Conditional Probability)**
Calculate the probability that a **Burglary occurred**, given that **John called** and **Mary did not call**.
$$P(B | J, \neg M)$$
* **Formula (Bayes Rule):**
$$P(B | J, \neg M) = \frac{P(B, J, \neg M)}{P(J, \neg M)}$$
* **Numerator Calculation ($P(B, J, \neg M)$):**
We must **marginalize out** the unknown variables ($A$ and $E$) from the joint distribution.
$$P(B, J, \neg M) = \sum_{A \in \{T,F\}} \sum_{E \in \{T,F\}} P(B, E, A, J, \neg M)$$
This involves summing 4 terms (combinations of A and E).
* **Denominator Calculation ($P(J, \neg M)$):**
We further marginalize out $B$ from the numerator result.
$$P(J, \neg M) = P(B, J, \neg M) + P(\neg B, J, \neg M)$$
---
### **3. Example 2: 4-Node Tree Structure**
A simpler example to demonstrate how sums simplify during marginalization.
#### **Scenario & Structure**
* **Nodes:** $X_1, X_2, X_3, X_4 \in \{0, 1\}$ (Binary).
* **Dependencies:**
* $X_1 \rightarrow X_2$
* $X_2 \rightarrow X_3$
* $X_2 \rightarrow X_4$
* **Decomposition:** $P(X) = P(X_1)P(X_2|X_1)P(X_3|X_2)P(X_4|X_2)$.
* **Given Tables:** Probabilities for all priors and conditionals are provided.
#### **Task: Calculate Marginal Probability $P(X_3 = 1)$**
We need to find the probability of $X_3=1$ regardless of the other variables.
* **Definition:** Sum the joint probability over all other variables ($X_1, X_2, X_4$).
$$P(X_3=1) = \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1, x_2, x_3=1, x_4)$$
* **Step 1: Expand using Graph Structure**
$$= \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1)P(x_2|x_1)P(X_3=1|x_2)P(x_4|x_2)$$
* **Step 2: Simplify (Key Insight)**
Move the summation signs to push them as far right as possible. The sum over $x_4$ only affects the last term $P(x_4|x_2)$.
$$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2) \left[ \sum_{x_4} P(x_4|x_2) \right]$$
* **Property:** $\sum_{x_4} P(x_4|x_2) = 1$ (Sum of probabilities for a variable given a condition is always 1).
* Therefore, the $X_4$ term vanishes. This makes sense intuitively: $X_4$ is a "leaf" node distinct from $X_3$; knowing nothing about it doesn't change $X_3$'s probability if $X_2$ is handled.
* **Step 3: Final Calculation**
We are left with summing over $X_1$ and $X_2$:
$$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2)$$
This expands to 4 terms (combinations of $x_1 \in \{0,1\}$ and $x_2 \in \{0,1\}$).
---
### **4. Semester Summary & Conclusion**
The lecture concludes the semester's material.
* **Key Themes Covered:**
* **Discriminative vs. Generative Methods:** The fundamental difference in approach (boundary vs. distribution).
* **Objective Functions:** Designing Loss functions vs. Likelihood functions.
* **Optimization:** Parameter estimation via derivatives (MLE).
* **Graphical Models:** Reducing parameter complexity using independence assumptions (Bayes Nets, MRFs).
* **Final Exam:** Scheduled for Thursday, December 11th. It will cover the concepts discussed, focusing on understanding the fundamentals (e.g., Likelihood, Generative principles) rather than rote memorization.

BIN
final/AI_Lecture_note_1027.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1030.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1103.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1106.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1110.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1113.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1117.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1120.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1124.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1127.pdf (Stored with Git LFS) Normal file

Binary file not shown.

BIN
final/AI_Lecture_note_1201.pdf (Stored with Git LFS) Normal file

Binary file not shown.