diff --git a/final/1027.md b/final/1027.md new file mode 100644 index 0000000..54d0781 --- /dev/null +++ b/final/1027.md @@ -0,0 +1,68 @@ +# Large Margin Classifiers and Optimization + +**Date:** 2025.10.27 +**Topic:** Large Margin Classifiers, Optimization, Margin Definition + +--- + +### 1. Introduction to Robust Classification +The lecture begins by shifting focus from generative methods to discriminative methods, specifically within a **linearly separable setting**. +* **Problem Setting:** The goal is to classify data that can be perfectly separated by a linear boundary (hyperplane). +* **Robustness:** While infinite linear classifiers may separate the data, the objective is to find the "best" one. The best classifier is defined as the one that is most **robust**, meaning it generalizes well to new test data and handles potential outliers effectively. +* **Intuition:** A robust classifier places the decision boundary in the middle of the gap between classes, maximizing the distance to the nearest data points. + +### 2. Defining the Margin +The concept of the **margin** is introduced to mathematically define robustness. +* **Definition:** The margin is the distance between the decision hyperplane and the closest data points. +* **Hyperplane Equation:** The decision boundary is defined as $w^T x - b = 0$. +* **Support Lines:** To define the margin, we establish two parallel lines passing through the closest data points: + * $w^T x - b = 1$ (for class +1). + * $w^T x - b = -1$ (for class -1). + * The region between these lines contains no data points. + +### 3. Calculating the Margin Width +The lecture derives the mathematical expression for the margin width using vector projection. +* **Vector Projection:** The margin is calculated by projecting the vector connecting a point on the boundary ($x_0$) to a support vector ($x$) onto the normal vector $w$. +* **Derivation:** + * The distance is the projection of vector $(x - x_0)$ onto the unit normal vector $\frac{w}{||w||}$. + * Using the constraint $w^T x - b = 1$ and $w^T x_0 - b = 0$, the derived margin distance is $\frac{1}{||w||}$. +* **Conclusion:** Maximizing the margin is equivalent to **minimizing the norm of the weight vector $||w||$**. + +### 4. The Optimization Problem +The task of finding the best classifier is formulated as a constrained optimization problem. + +* **Objective Function:** + $$\min ||w||^2$$ + (Note: Minimizing $||w||$ is computationally equivalent to minimizing $||w||^2$) + +* **Constraints:** All data points must be correctly classified and lie outside the margin. This is formalized as: + * $w^T x_i - b \ge 1$ for $y_i = 1$. + * $w^T x_i - b \le -1$ for $y_i = -1$. + * **Combined Constraint:** $y_i (w^T x_i - b) \ge 1$ for all $i$. + +### 5. Optimization with Constraints (Lagrange Multipliers) +The lecture explains how to solve this optimization problem using **Lagrange Multipliers**, using a general example first. + +* **Problem Setup:** Minimize an objective function $L(x)$ subject to a constraint $g(x) \ge 0$. +* **Lagrangian:** A new objective function is defined by combining the original loss and the constraint with a multiplier $\lambda$: + $$L'(x) = L(x) - \lambda g(x)$$ + (Note: The transcript discusses combining components; the sign depends on the specific maximization/minimization formulation) + +* **Solution Cases:** + The solution involves taking the derivative $\frac{dL'}{dx} = 0$ and considering two cases: + 1. **Feasible Region ($\lambda = 0$):** The unconstrained minimum of $L(x)$ naturally satisfies the constraint ($g(x) > 0$). In this case, the constraint is inactive. + 2. **Boundary Case ($\lambda > 0$):** The unconstrained minimum violates the constraint. Therefore, the optimal solution lies *on* the boundary where $g(x) = 0$. + +### 6. Example: Constrained Minimization +A specific mathematical example is worked through to demonstrate the method. +* **Objective:** Minimize $x_1^2 + x_2^2$ (distance from origin). +* **Constraint:** $x_2 - x_1^2 - 1 \ge 0$ (must be above a parabola). +* **Solving:** + * The Lagrangian is set up: $L' = x_1^2 + x_2^2 - \lambda(x_2 - x_1^2 - 1)$. + * **Case 1 ($\lambda = 0$):** Leads to $x_1=0, x_2=0$, which violates the constraint ($0 - 0 - 1 = -1 \not\ge 0$). This solution is discarded. + * **Case 2 (Boundary, $\lambda \ne 0$):** The solution must lie on $x_2 - x_1^2 - 1 = 0$. Solving the system of equations yields the valid minimum at $x_1=0, x_2=1$. + +### 7. Next Steps: Support Vector Machines +The lecture concludes by linking this optimization framework back to the classifier. +* **Support Vectors:** The data points that lie exactly on the margin boundary ($g(x)=0$) are called "Support Vectors". +* **Future Topic:** This foundation leads into the **Support Vector Machine (SVM)** algorithm, which will be discussed in the next session to handle non-linearly separable data. diff --git a/final/1030.md b/final/1030.md new file mode 100644 index 0000000..471b85d --- /dev/null +++ b/final/1030.md @@ -0,0 +1,125 @@ +# Support Vector Machines: Optimization, Dual Problem & Kernel Methods + +**Date:** 2025.10.30 and 2025.11.03 +**Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem + +--- + +### 1. Introduction to SVM Mathematics +The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier. +* **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them. +* **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types. +* **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier. + +### 2. General Optimization with Constraints +The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints. +* **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$). +* **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases: + 1. **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$. + 2. **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$. + +### 3. Multi-Constraint Example +A specific example is provided to demonstrate optimization with multiple constraints. +* **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints. +* **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$. +* **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive). + * The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints. + * The valid solution is found where the constraints intersect (Boundary Case). + +### 4. SVM Mathematical Formulation (Primal Problem) +The lecture applies these optimization principles specifically to the SVM Large Margin Classifier. +* **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin). +* **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$. +* **Lagrangian Formulation:** + $$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$ + Here, $\alpha_i$ represents the Lagrange multipliers. + +### 5. Deriving the Dual Problem +To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero. +* **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points. +* **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$. +* **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem. + +### 6. The Dual Form and Kernel Intuition +The final derived Dual objective function depends entirely on the dot product of data points. +* **Dual Equation:** + $$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$ + Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$. +* **Primal vs. Dual:** + * **Primal:** Depends on the number of features/parameters ($D$). + * **Dual:** Depends on the number of data points ($N$). +* **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation. + +--- + +### 7. The Dual Form and Inner Products +In the previous section, the **Dual Form** of the SVM optimization problem was derived. +* **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points: + $$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$ + +* **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method. + +--- + +### 8. Feature Mapping and Cover's Theorem +When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists. + +* **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space. + * **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line. + * **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$. + + + +* **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator. + * **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space. + +--- + +### 9. The Kernel Trick +Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors. + +* **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors: + $$K(x, y) = \Phi(x)^T \Phi(y)$$ + +* **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$. + +* **Derivation Example (Polynomial Kernel):** + For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$. + $$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$ + This is mathematically equivalent to the dot product of two mapped vectors where: + $$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$ + Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$. + +--- + +### 10. Mercer's Theorem & Positive Definite Functions +How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition. + +* **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$. +* **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly. + +#### **Positive Definiteness (Matrix Definition)** +To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points. +* For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$. +* **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**. + +--- + +### 11. Infinite Dimensionality (RBF Kernel) +The lecture briefly touches upon the exponential (Gaussian/RBF) kernel. +* The exponential function can be expanded using a Taylor Series into an infinite sum. +* This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**. +* Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space. + +--- + +### 12. Final SVM Formulation with Kernels +By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems. + +* **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$: + $$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$ + +* **Decision Rule:** For a new test point $x'$, the classification is determined by: + $$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$ + +**Next Lecture:** The course will move on to Generative Methods (probability methods). diff --git a/final/1106.md b/final/1106.md new file mode 100644 index 0000000..4bf4760 --- /dev/null +++ b/final/1106.md @@ -0,0 +1,92 @@ +# Lecture Summary: Generative Methods & Probability Review + +**Date:** 2025.11.06 +**Topic:** Discriminative vs. Generative Models, Probability Theory, Probabilistic Inference, and Gaussian Distributions. + +--- + +### 1. Classification Approaches: Discriminative vs. Generative + +The lecture begins by distinguishing between two fundamental approaches to machine learning classification, specifically for binary problems (labels 0 or 1). + +#### **Discriminative Methods (e.g., Logistic Regression)** +* **Goal:** Directly model the decision boundary or the conditional probability $P(y|x)$. +* **Mechanism:** Focuses on distinguishing classes. It learns a function that maps inputs $x$ directly to class labels $y$. +* **Limitation:** It does not model the underlying distribution of the data itself. + +#### **Generative Methods** +* **Goal:** Model the joint probability or the class-conditional density $P(x|y)$ and the class prior $P(y)$. +* **Mechanism:** It learns "how the data is generated" for each class. +* **Classification:** To classify a new point, it uses **Bayes' Rule** to invert the probabilities: + $$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$$ +* **Advantage:** If you know the generative model, you can solve the classification problem *and* generate new data samples. + +--- + +### 2. Probability Theory Review + +To understand Generative Methods, a strong foundation in probability is required. + +#### **Random Variables** +* **Definition:** A random variable is technically a **function** (mapping) that assigns a real number to an outcome (event $\omega$) in the sample space $\Omega$. +* **Example:** Tossing a coin 4 times. An event might be "HHTH", and the random variable $X(\omega)$ could be "number of heads" (which equals 3). + +#### **Probability vs. Probability Density Function (PDF)** +The lecture emphasizes distinguishing between discrete probability ($P$) and continuous density ($p$). + +* **Discrete Probability ($P$):** Defined as the ratio of cardinalities (counts) or areas in discrete sets (e.g., Venn diagrams). + * **Probability Density Function ($p$):** Used for continuous variables. + * **Properties:** $p(x) \ge 0$ for all $x$, and $\int p(x)dx = 1$. + * **Relationship:** The probability of $x$ falling within a range is the **integral** (area under the curve) of the PDF. The probability of a specific point $P(x=x_0)$ is 0. + +#### **Key Statistics** +* **Expectation ($E[x]$):** The mean or weighted average of a random variable. + $$E[x] = \int x p(x) dx$$ +* **Covariance:** Measures the spread or variance of the data. For vectors, this results in a Covariance Matrix. + $$Cov[x] = E[(x - \mu)(x - \mu)^T]$$ + +--- + +### 3. The Trinity of Distributions: Joint, Conditional, and Marginal + +Understanding the relationship between these three is crucial for probabilistic modeling. + +#### **Joint PDF ($P(x_1, x_2)$)** +* This represents the probability of $x_1$ and $x_2$ occurring together. +* **Importance:** If you know the Joint PDF, you know *everything* about the system. You can derive all other probabilities (marginal, conditional) from it. + +#### **Conditional PDF ($P(x_1 | x_2)$)** +* Represents the probability of $x_1$ given that $x_2$ is fixed to a specific value. +* Visually, this is like taking a "slice" of the joint distribution 3D surface at $x_2 = a$. + +#### **Marginal PDF ($P(x_1)$)** +* Represents the probability of $x_1$ regardless of $x_2$. +* **Calculation:** You "marginalize out" (integrate or sum) the other variables. + * Continuous: $P(x_1) = \int P(x_1, x_2) dx_2$. + * Discrete: Summing rows or columns in a probability table. + +--- + +### 4. Probabilistic Inference + +**Inference** is defined as calculating a desired probability (e.g., a prediction) starting from the Joint Probability function using rules like Bayes' theorem and marginalization. + +#### **Handling Missing Data** +A major practical benefit of generative models (Joint PDF modeling) over discriminative models (like Logistic Regression) is robust handling of missing data. +* **Scenario:** You have a model predicting disease ($y$) based on Age ($x_1$), Blood Pressure ($x_2$), and Oxygen ($x_3$). +* **Problem:** A patient arrives, but you cannot measure Age ($x_1$). A discriminative model might fail or require value imputation (guessing averages). +* **Probabilistic Solution:** You integrate (marginalize) out the missing variable $x_1$ from the joint distribution to get the probability based only on observed data: + $$P(y | x_2, x_3) = \frac{\int p(x_1, x_2, x_3, y) dx_1}{P(x_2, x_3)}$$. + +--- + +### 5. The Gaussian Distribution + +The lecture concludes with a review of the Gaussian (Normal) distribution, the most important function in AI/ML. + +* **Univariate Gaussian:** Defined by mean $\mu$ and variance $\sigma^2$. +* **Multivariate Gaussian:** Defined for a vector $x \in R^D$. + $$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$. +* **Parameters:** + * $\mu$: Mean vector ($D$-dimensional). + * $\Sigma$: Covariance Matrix ($D \times D$). It must be **Symmetric** and **Positive Definite**. diff --git a/final/1110.md b/final/1110.md new file mode 100644 index 0000000..2969aa5 --- /dev/null +++ b/final/1110.md @@ -0,0 +1,104 @@ +# Study Guide: Generative Methods & Multivariate Gaussian Distributions + +**Date:** 2025.12.01 +**Topic:** Generative vs. Discriminative Models, Multivariate Gaussian Properties, Conditional and Marginal Distributions. + +--- + +### **1. Generative vs. Discriminative Methods** + +The lecture begins by contrasting the new topic (Generative Methods) with previous topics (Discriminative Methods like Linear Regression, Logistic Regression, and SVM). + +* **Discriminative Methods (Separating):** + * These methods focus on finding a boundary (separating line or hyperplane) between classes. + * **Limitation:** They cannot generate new data samples because they do not model the data distribution; they only know the boundary. + * **Hypothesis:** They assume a linear line or function as the hypothesis to separate data. + +* **Generative Methods (Inferring Distribution):** + * **Goal:** To infer the **underlying distribution** (the rule or pattern) from which the data samples were drawn. + * **Assumption:** Data is not random; it follows a specific probabilistic structure (e.g., drawn from a distribution). + * **Capabilities:** Once the Joint Probability Distribution (underlying distribution) is known: + 1. **Classification:** Can be performed using Bayes' Rule. + 2. **Generation:** New samples can be created that follow the same patterns as the training data (e.g., generating new images or text). + + + +--- + +### **2. The Gaussian (Normal) Distribution** + +The Gaussian distribution is the most popular choice for modeling the "hypothesis" of the underlying distribution in generative models. + +#### **Why Gaussian?** +1. **Simplicity:** Defined entirely by two parameters: Mean ($\mu$) and Covariance ($\Sigma$). +2. **Central Limit Theorem:** Sums of independent random events tend to follow a Gaussian distribution. +3. **Mathematical "Closure":** The most critical reason for its use in AI is that **Conditional** and **Marginal** distributions of a Multivariate Gaussian are *also* Gaussian. + +#### **Multivariate Gaussian Definition** +For a $D$-dimensional vector $x$: +$$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$ +* $\mu$: Mean vector ($D$-dimensional). +* $\Sigma$: Covariance Matrix ($D \times D$). + + + +[Image of multivariate gaussian distribution 3d plot] + + +#### **Properties of the Covariance Matrix ($\Sigma$)** +* **Symmetric:** $\Sigma_{ij} = \Sigma_{ji}$. +* **Positive Definite:** All eigenvalues are positive. +* **Diagonal Terms:** Represent the variance of individual variables. +* **Off-Diagonal Terms:** Represent the correlation (covariance) between variables. + * If $\sigma_{12} = 0$, the variables are **independent** (for Gaussians). + * The matrix shape determines the geometry of the distribution contours (spherical vs. elliptical). + +--- + +### **3. Independence and Factorization** + +If the Covariance Matrix is **diagonal** (all off-diagonal elements are 0), the variables are independent. +* Mathematically, the inverse matrix $\Sigma^{-1}$ is also diagonal. +* The joint probability factorizes into the product of marginals: + $$P(x_1, x_2) = P(x_1)P(x_2)$$ +* The "quadratic form" inside the exponential splits into a sum of separate squared terms. + +--- + +### **4. Conditional Gaussian Distribution** + +The lecture derives what happens when we observe a subset of variables (e.g., $x_2$) and want to determine the distribution of the remaining variables ($x_1$). This is $P(x_1 | x_2)$. + +* **Concept:** Visually, this is equivalent to "slicing" the joint distribution at a specific value of $x_2$ (fixed constant). +* **Result:** The resulting cross-section is **also a Gaussian distribution**. +* **Parameters:** If we partition $x$, $\mu$, and $\Sigma$ into subsets, the conditional mean ($\mu_{1|2}$) and covariance ($\Sigma_{1|2}$) are given by: + * **Mean:** $\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$. + * **Covariance:** $\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$. + *(Note: The derivation involves completing the square to identify the Gaussian form).* + + + +--- + +### **5. Marginal Gaussian Distribution** + +The lecture explains how to find the distribution of a subset of variables ($x_1$) by ignoring the others ($x_2$). This is $P(x_1)$. + +* **Concept:** This is equivalent to integrating out the unobserved variables: + $$P(x_1) = \int P(x_1, x_2) dx_2$$ +* **Result:** The marginal distribution is **also a Gaussian distribution**. +* **Parameters:** Unlike the conditional case, calculating the marginal parameters is trivial. You simply select the corresponding sub-vector and sub-matrix from the joint parameters. + * Mean: $\mu_1$. + * Covariance: $\Sigma_{11}$. + + + +### **Summary Table** + +| Distribution | Type | Parameters Derived From Joint $(\mu, \Sigma)$ | +| :--- | :--- | :--- | +| **Joint** $P(x)$ | Gaussian | Given as $\mu, \Sigma$ | +| **Conditional** $P(x_1 \| x_2)$ | Gaussian | Complex formula (involves matrix inversion of $\Sigma_{22}$) | +| **Marginal** $P(x_1)$ | Gaussian | Simple subset (extract $\mu_1$ and $\Sigma_{11}$) | + +The lecture concludes by emphasizing that understanding these Gaussian properties is essential for the second half of the semester, as they form the basis for probabilistic generative models. diff --git a/final/1113.md b/final/1113.md new file mode 100644 index 0000000..d1cafb4 --- /dev/null +++ b/final/1113.md @@ -0,0 +1,85 @@ +# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier + +**Date:** 2025.11.13 +**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier. + +--- + +### **1. Overview: Learning in Generative Methods** +The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters. + +* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes. +* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed. + +#### **Why Gaussian?** +The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly. + + + +[Image of multivariate gaussian distribution 3d plot] + + +--- + +### **2. The Learning Process: Parameter Estimation** +"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data. + +#### **Step 1: Define the Objective Function** +We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**: +* **Goal:** We want to assign **high probability** to the observed (empirical) data points. +* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities. + $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$ + +#### **Step 2: Log-Likelihood (MLE)** +Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum. +* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$. + +#### **Step 3: Optimization (Derivation)** +We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum. + +* **Optimal Mean ($\hat{\mu}$):** + The derivation yields the **Empirical Mean**. It is simply the average of the data points. + $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$ + +* **Optimal Covariance ($\hat{\Sigma}$):** + The derivation yields the **Empirical Covariance**. + $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$ + +**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary. + +--- + +### **3. Inference: Making Predictions** +Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference. + +#### **Classification** +To classify a new data point $x_{new}$: +1. We aim to calculate the conditional probability $P(y | x_{new})$. +2. Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector. +3. We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$). + +#### **Handling Missing Data** +Generative models offer a theoretically robust way to handle missing variables. +* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference. +* **Method:** **Marginalization**. + 1. Start with the Joint PDF. + 2. Integrate (marginalize) out the missing variable $x_2$. + $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$ + 3. Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables. +* This is superior to heuristic methods like imputing the mean. + +--- + +### **4. Bayes Optimal Classifier** +The lecture introduces the concept of the theoretical "perfect" classifier. + +* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data. +* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$. + $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$ + +#### **Bayes Error** +* Even the optimal classifier has an irreducible error called the **Bayes Error**. +* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability. +* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit. +* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region: + $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$ diff --git a/final/1117.md b/final/1117.md new file mode 100644 index 0000000..37da912 --- /dev/null +++ b/final/1117.md @@ -0,0 +1,99 @@ +# Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks) + +**Date:** 2025.11.17 +**Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation. + +--- + +### **1. Recap: Bayes Optimal Classifier and Bayes Error** + +The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**. +* **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability. +* **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate. + +#### **Bayes Error (Irreducible Error)** +* **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**. +* **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations. +* **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible. +* **Formula:** The risk (expected error) is the integral of the minimum probability over the domain: + $$R^* = \int \min[P_1(x), P_2(x)] dx$$ + If priors are equal, this simplifies to the integral of the overlap region. + +--- + +### **2. Introduction to Graphical Models** + +The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks). + +* **Motivation:** + * A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements. + * The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters. + * For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible. +* **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn. + +--- + +### **3. The Chain Rule and Independence** + +Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities. + +* **General Chain Rule:** + $$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$ +* **Simplification with Independence:** + If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$. +* **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where: + * **Nodes** represent random variables. + * **Edges (Arrows)** represent conditional dependencies (causality). + + + +--- + +### **4. Building a Bayesian Network (Causal Graph)** + +The lecture illustrates this with a practical example involving a crying baby. + +* **Scenario:** We want to model the causes of a baby crying. +* **Variables:** + * **Cry:** The observable effect. + * **Hungry, Sick, Diaper:** Direct causes of crying. + * **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying. +* **Dependencies:** + * "Hungry" and "Sick" might be independent of each other generally. + * "Cry" depends on all of them. + * "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry". + + + +--- + +### **5. The Three Canonical Patterns of Independence** + +To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence. + +#### **1. Tail-to-Tail (Common Cause)** +* **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y). +* **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**. +* **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$. + +#### **2. Head-to-Tail (Causal Chain)** +* **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y). +* **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**. +* **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further. + +#### **3. Head-to-Head (Common Effect / V-Structure)** +* **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z). +* **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away"). +* **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick). + * Being hungry tells us nothing about being sick (Independent). + * But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect. + +--- + +### **6. D-Separation** + +These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph. +* If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent). +* A path is blocked if: + * It contains a chain or fork where the middle node is **observed**. + * It contains a collider where the middle node (and all its descendants) are **NOT observed**. diff --git a/final/1120.md b/final/1120.md new file mode 100644 index 0000000..fe6da78 --- /dev/null +++ b/final/1120.md @@ -0,0 +1,79 @@ +# Lecture Summary: Directed Graphical Models and Naive Bayes + +**Date:** 2025.11.20 +**Topic:** Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier. + +--- + +### **1. Motivation: The Need for Parameter Reduction** +The lecture begins by reviewing Generative Methods using the Gaussian distribution. +* **The Problem:** In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive. + * For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector $\mu$ ($D$ parameters) and the Covariance Matrix $\Sigma$ (symmetric $D \times D$ matrix). + * The total number of parameters is roughly $O(D^2)$, specifically $D + \frac{D(D+1)}{2}$. + * For large $D$, this requires a massive amount of training data to avoid overfitting. +* **The Solution:** We use **Prior Knowledge** (domain knowledge) about the relationships between variables to reduce the number of parameters. + * By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions. + +--- + +### **2. Directed Graphical Models (Bayesian Networks)** +A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies. + +#### **Decomposition via Chain Rule** +* The joint probability $P(x)$ can be decomposed using the chain rule: + $$P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))$$ +* **Example Structure:** + If we have a graph where $x_1$ has no parents, $x_2$ depends on $x_1$, etc., the joint distribution splits into: + $$P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...$$ + +#### **Parameter Counting Example (Gaussian Case)** +The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model. +* **Full Gaussian:** Assumes all variables are correlated. + * For a 10-dimensional vector ($D=10$), parameters = $10 + \frac{10 \times 11}{2} = 65$. +* **Reduced Model:** Uses a graph structure where variables are conditionally independent. + * Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians). + * **Calculation:** For a univariate conditional Gaussian $P(x_i | x_j)$, we need parameters for the linear relationship (mean coefficients) and variance. + * In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic. + +--- + +### **3. The Naive Bayes Classifier** +The **Naive Bayes** classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction. + +* **Assumption:** Given the class label $y$, all input features $x_1, ..., x_D$ are **mutually independent**. +* **Structure:** The class $y$ is the parent of all feature nodes $x_i$. There are no connections between the features themselves. +* **Formula:** + $$P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)$$ +* **Advantage:** We only need to estimate the distribution of each feature individually, rather than their complex joint interactions. + +--- + +### **4. Application: Spam Classifier** +The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as **Spam ($y=1$)** or **Not Spam ($y=0$)**. + +#### **Feature Engineering** +* **Input:** Emails with varying text lengths. +* **Transformation:** A "Bag of Words" approach is used. + 1. Create a dictionary of $N$ words (e.g., $N=10,000$). + 2. Represent each email as a fixed-length binary vector $x \in \{0, 1\}^{10,000}$. + 3. $x_i = 1$ if the $i$-th word appears in the email, $0$ otherwise. + +#### **The "Curse of Dimensionality" (Without Naive Bayes)** +* Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables. +* If we tried to model the full joint distribution $P(x_1, ..., x_{10000} | y)$, we would need a probability table for every possible combination of words. +* **Parameter Count:** $2^{10,000}$ entries. This is computationally impossible. + +#### **Applying Naive Bayes** +* By assuming word independence given the class, we decompose the problem: + $$P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)$$ +* **Parameter Estimation:** + * We only need to estimate $P(x_i=1 | y=1)$ and $P(x_i=1 | y=0)$ for each word. + * This requires simply counting the frequency of each word in Spam vs. Non-Spam emails. +* **Reduced Parameter Count:** + * Instead of $2^{10,000}$, we need roughly $2 \times 10,000$ parameters (one probability per word per class). + * This transforms an impossible problem into a highly efficient and simple one. + +### **5. Summary** +* **Generative Methods** aim to model the underlying distribution $P(x, y)$. +* **Graphical Models** allow us to inject prior knowledge (independence assumptions) to make this feasible. +* **Naive Bayes** assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification. diff --git a/final/1124.md b/final/1124.md new file mode 100644 index 0000000..b8041ae --- /dev/null +++ b/final/1124.md @@ -0,0 +1,85 @@ +# Study Guide: Discrete Probability Models & Undirected Graphical Models + +**Date:** 2025.11.24 +**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models). + +--- + +### **1. Discrete Probability Distributions** +The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes). + +#### **Binomial Distribution** +* **Scenario:** A coin toss (Binary outcome: Head/Tail). +* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails). +* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$). +* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes. + $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$ + +#### **Multinomial Distribution** +* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution. +* **Definition:** + * We have $N$ total events (trials). + * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes. + * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome. +* **Probability Mass Function:** + $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$ + +--- + +### **2. Learning: Maximum Likelihood Estimation (MLE)** +How do we estimate the parameters ($\mu_k$) from data? + +* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$). +* **Method:** **Lagrange Multipliers**. + 1. **Objective:** Maximize Log-Likelihood: + $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$ + 2. **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$. + 3. **Lagrangian:** + $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$ + (Note: Constant terms like $N!$ vanish during differentiation). + 4. **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$. + +* **Result:** + $$\mu_k = \frac{m_k}{N}$$ + * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events). + * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures. + +--- + +### **3. Undirected Graphical Models (Markov Random Fields)** + +When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs). + +#### **Comparison** +* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships. +* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints. + +#### **Conditional Independence in MRF** +Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed). +* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set. + * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path. + +--- + +### **4. Factorization in Undirected Graphs** + +Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**. + +#### **Cliques and Maximal Cliques** +* **Clique:** A subgraph where every pair of nodes is connected (fully connected). +* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node. + +#### **The Joint Distribution Formula** +We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$. +$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$ + +* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1). +* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1. + $$Z = \sum_x \prod_{C} \psi_C(x_C)$$ + +#### **Example Decomposition** +Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$: +$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$ + +#### **Hammersley-Clifford Theorem** +This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques. diff --git a/final/1127.md b/final/1127.md new file mode 100644 index 0000000..f23d604 --- /dev/null +++ b/final/1127.md @@ -0,0 +1,63 @@ +# Study Guide: Undirected Graphical Models (Markov Random Fields) + +**Date:** 2025.11.27 +**Topic:** Potential Functions, Partition Function, and Conditional Independence in MRFs. + +--- + +### **1. Recap: Decomposition in Undirected Graphs** +Unlike Directed Graphical Models (Bayesian Networks) which use conditional probabilities, **Undirected Graphical Models (Markov Random Fields - MRFs)** cannot directly use probabilities because there is no direction/causality. Instead, they decompose the joint distribution based on **Maximal Cliques**. + +* **Cliques:** Subsets of nodes where every node is connected to every other node. +* **Maximal Clique:** A clique that cannot be expanded (e.g., in the example graph, the maximal cliques covers the graph). +* **Decomposition Rule:** The joint distribution is the product of functions defined over these maximal cliques. + +--- + +### **2. Potential Functions ($\psi$)** +* **Definition:** For each maximal clique $C$, we define a **Potential Function** $\psi_C(x_C)$ (often denoted as $\phi$ or $\psi$). + * It is a **positive function** ($\psi(x) \ge 0$) mapping the state of the clique variables to a real number. + * It represents the "compatibility" or "energy" of that configuration. +* **Key Distinction:** A potential function is **NOT a probability**. It does not sum to 1. It is just a score (non-negative function). + * *Example:* $\psi_{12}(x_1, x_2)$ scores the interaction between $x_1$ and $x_2$. + +--- + +### **3. The Partition Function ($Z$)** +Since the product of potential functions is not a probability distribution (it doesn't sum to 1), we must normalize it. + +* **Definition:** The normalization constant is called the **Partition Function** ($Z$). + $$Z = \sum_{x} \prod_{C} \psi_C(x_C)$$ +* **Role:** It ensures that the resulting distribution sums to 1, making it a valid probability distribution. + $$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$ +* **Calculation:** To find $Z$, we must sum the product of potentials over **all possible states** (combinations) of the random variables. This summation is often computationally expensive. + +#### **Example Calculation** +The lecture walks through a simple example with 4 binary variables and two cliques: $\{x_1, x_2, x_3\}$ and $\{x_3, x_4\}$. +* **Step 1:** Define potential tables for $\psi_{123}$ and $\psi_{34}$. +* **Step 2:** Calculate the score for every combination. +* **Step 3:** Sum all scores to get $Z$. In the example, $Z=10$. +* **Step 4:** The probability of any specific state (e.g., $P(1,0,0,0)$) is its specific score divided by $Z$ (e.g., $(1 \times 3)/10$ or similar depending on values). + +--- + +### **4. Parameter Estimation** +* **Discrete Case:** If variables are discrete (like the email spam example), the parameters are the entries in the potential tables. We estimate these values from data to maximize the likelihood. +* **Continuous Case:** If variables are continuous, potential functions are typically Gaussian distributions. We estimate means and covariances. +* **Reduction:** Just like in Bayesian Networks, using the graph structure reduces the number of parameters. + * *Without Graph:* A full table for 4 binary variables needs $2^4 = 16$ entries. + * *With Graph:* We only need tables for the cliques, significantly reducing complexity. + +--- + +### **5. Verifying Conditional Independence** +The lecture demonstrates analytically that the potential function formulation preserves the conditional independence properties of the graph. + +* **Scenario:** Graph with structure $x_1 - x_2 - x_3 - x_4$. + * Is $x_4$ independent of $x_1$ given $x_3$? +* **Analytical Check:** + * We calculate $P(x_4=1 | x_1=0, x_2=1, x_3=1)$. + * We also calculate $P(x_4=1 | x_1=0, x_2=0, x_3=1)$. +* **Result:** The calculation shows that as long as $x_3$ is fixed (given), the value of $x_1$ and $x_2$ cancels out in the probability ratio. + * $P(x_4|x_1, x_2, x_3) = \frac{\phi_1}{\phi_1 + \phi_0}$ (depends only on potentials involving $x_4$ and $x_3$). +* **Conclusion:** This confirms that $x_4 \perp \{x_1, x_2\} | x_3$. The formulation correctly encodes the global Markov property. diff --git a/final/1201.md b/final/1201.md new file mode 100644 index 0000000..c61a539 --- /dev/null +++ b/final/1201.md @@ -0,0 +1,102 @@ +# Study Guide: Bayesian Networks & Probabilistic Inference + +**Date:** 2025.12.01 (Final Lecture) +**Topic:** Bayesian Networks, Probabilistic Inference Examples, Marginalization. + +--- + +### **1. Recap: Directed vs. Undirected Models** +The lecture begins by briefly contrasting the two types of graphical models discussed: +* **Undirected Graphs (MRF):** Use potential functions ($\psi$) defined on maximal cliques. Requires a normalization constant (partition function $Z$) to become a probability distribution. +* **Directed Graphs (Bayesian Networks):** Use conditional probability distributions (CPDs). The joint distribution is the product of local conditional probabilities. + $$P(X) = \prod_{i} P(x_i | \text{parents}(x_i))$$ + +--- + +### **2. Example 1: The "Alarm" Network (Burglary/Earthquake)** +This is a classic example used to demonstrate inference in Bayesian Networks. + +#### **Scenario & Structure** +* **Nodes:** + * **B:** Burglary (Parent, no prior causes). + * **E:** Earthquake (Parent, no prior causes). + * **A:** Alarm (Triggered by Burglary or Earthquake). + * **J:** JohnCalls (Triggered by Alarm). + * **M:** MaryCalls (Triggered by Alarm). +* **Dependencies:** $B \rightarrow A \leftarrow E$, $A \rightarrow J$, $A \rightarrow M$. +* **Probabilities (Given):** + * $P(B) = 0.05$, $P(E) = 0.1$. + * $P(A|B, E)$: Table given (e.g., $P(A|B, \neg E) = 0.85$, $P(A|\neg B, \neg E) = 0.05$, etc.). + * $P(J|A) = 0.7$, $P(M|A) = 0.8$. + +#### **Task 1: Calculate a Specific Joint Probability** +Calculate the probability of the event: **Burglary, No Earthquake, Alarm rings, John calls, Mary does not call**. +$$P(B, \neg E, A, J, \neg M)$$ + +* **Decomposition:** Apply the Chain Rule based on the graph structure. + $$= P(B) \cdot P(\neg E) \cdot P(A | B, \neg E) \cdot P(J | A) \cdot P(\neg M | A)$$ +* **Calculation:** + $$= 0.05 \times 0.9 \times 0.85 \times 0.7 \times 0.2$$ + +#### **Task 2: Inference (Conditional Probability)** +Calculate the probability that a **Burglary occurred**, given that **John called** and **Mary did not call**. +$$P(B | J, \neg M)$$ + +* **Formula (Bayes Rule):** + $$P(B | J, \neg M) = \frac{P(B, J, \neg M)}{P(J, \neg M)}$$ + +* **Numerator Calculation ($P(B, J, \neg M)$):** + We must **marginalize out** the unknown variables ($A$ and $E$) from the joint distribution. + $$P(B, J, \neg M) = \sum_{A \in \{T,F\}} \sum_{E \in \{T,F\}} P(B, E, A, J, \neg M)$$ + This involves summing 4 terms (combinations of A and E). + +* **Denominator Calculation ($P(J, \neg M)$):** + We further marginalize out $B$ from the numerator result. + $$P(J, \neg M) = P(B, J, \neg M) + P(\neg B, J, \neg M)$$ + +--- + +### **3. Example 2: 4-Node Tree Structure** +A simpler example to demonstrate how sums simplify during marginalization. + +#### **Scenario & Structure** +* **Nodes:** $X_1, X_2, X_3, X_4 \in \{0, 1\}$ (Binary). +* **Dependencies:** + * $X_1 \rightarrow X_2$ + * $X_2 \rightarrow X_3$ + * $X_2 \rightarrow X_4$ +* **Decomposition:** $P(X) = P(X_1)P(X_2|X_1)P(X_3|X_2)P(X_4|X_2)$. +* **Given Tables:** Probabilities for all priors and conditionals are provided. + +#### **Task: Calculate Marginal Probability $P(X_3 = 1)$** +We need to find the probability of $X_3=1$ regardless of the other variables. + +* **Definition:** Sum the joint probability over all other variables ($X_1, X_2, X_4$). + $$P(X_3=1) = \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1, x_2, x_3=1, x_4)$$ + +* **Step 1: Expand using Graph Structure** + $$= \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1)P(x_2|x_1)P(X_3=1|x_2)P(x_4|x_2)$$ + +* **Step 2: Simplify (Key Insight)** + Move the summation signs to push them as far right as possible. The sum over $x_4$ only affects the last term $P(x_4|x_2)$. + $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2) \left[ \sum_{x_4} P(x_4|x_2) \right]$$ + + * **Property:** $\sum_{x_4} P(x_4|x_2) = 1$ (Sum of probabilities for a variable given a condition is always 1). + * Therefore, the $X_4$ term vanishes. This makes sense intuitively: $X_4$ is a "leaf" node distinct from $X_3$; knowing nothing about it doesn't change $X_3$'s probability if $X_2$ is handled. + +* **Step 3: Final Calculation** + We are left with summing over $X_1$ and $X_2$: + $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2)$$ + This expands to 4 terms (combinations of $x_1 \in \{0,1\}$ and $x_2 \in \{0,1\}$). + +--- + +### **4. Semester Summary & Conclusion** +The lecture concludes the semester's material. + +* **Key Themes Covered:** + * **Discriminative vs. Generative Methods:** The fundamental difference in approach (boundary vs. distribution). + * **Objective Functions:** Designing Loss functions vs. Likelihood functions. + * **Optimization:** Parameter estimation via derivatives (MLE). + * **Graphical Models:** Reducing parameter complexity using independence assumptions (Bayes Nets, MRFs). +* **Final Exam:** Scheduled for Thursday, December 11th. It will cover the concepts discussed, focusing on understanding the fundamentals (e.g., Likelihood, Generative principles) rather than rote memorization. diff --git a/final/AI_Lecture_note_1027.pdf b/final/AI_Lecture_note_1027.pdf new file mode 100644 index 0000000..9528ee1 --- /dev/null +++ b/final/AI_Lecture_note_1027.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0de19c9741cf9727c433a0bb5ff4d6dc964f18fcb3f83a6460aacc0e18f95a73 +size 3822586 diff --git a/final/AI_Lecture_note_1030.pdf b/final/AI_Lecture_note_1030.pdf new file mode 100644 index 0000000..433ef1b --- /dev/null +++ b/final/AI_Lecture_note_1030.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:213b28289e9217e611f0a3f7f724490ebe46a3f46dab077a60a52ea2f78f7acc +size 4974234 diff --git a/final/AI_Lecture_note_1103.pdf b/final/AI_Lecture_note_1103.pdf new file mode 100644 index 0000000..aa73dae --- /dev/null +++ b/final/AI_Lecture_note_1103.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c36fdb70b004399e3757fe1c1f829559acbda7027c88878be028b57a15804931 +size 3857954 diff --git a/final/AI_Lecture_note_1106.pdf b/final/AI_Lecture_note_1106.pdf new file mode 100644 index 0000000..75c3fdd --- /dev/null +++ b/final/AI_Lecture_note_1106.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:82560c38016d98be0d54aff881c898daf3127254a8e687c098ee604de574dc20 +size 5202704 diff --git a/final/AI_Lecture_note_1110.pdf b/final/AI_Lecture_note_1110.pdf new file mode 100644 index 0000000..4c83bbe --- /dev/null +++ b/final/AI_Lecture_note_1110.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:81bde77201036497774937cb91da60122b116e668cbb5e631ef2644497c5503b +size 5195473 diff --git a/final/AI_Lecture_note_1113.pdf b/final/AI_Lecture_note_1113.pdf new file mode 100644 index 0000000..3f44354 --- /dev/null +++ b/final/AI_Lecture_note_1113.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:fd978d86010265601a789d04c5f974d71a1d5c73ecaa1117ee3dc1cfbba886c0 +size 3923021 diff --git a/final/AI_Lecture_note_1117.pdf b/final/AI_Lecture_note_1117.pdf new file mode 100644 index 0000000..8abc774 --- /dev/null +++ b/final/AI_Lecture_note_1117.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7659d9fb22418e13f7697a2689f5b7e23a7dc01823d1efc5830ec4c747c1a624 +size 4163676 diff --git a/final/AI_Lecture_note_1120.pdf b/final/AI_Lecture_note_1120.pdf new file mode 100644 index 0000000..439871f --- /dev/null +++ b/final/AI_Lecture_note_1120.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:24f3ae1edf3e406a9523a9c4ba27e5a905775d9945bd0249f5b434e40a56d3b2 +size 5361036 diff --git a/final/AI_Lecture_note_1124.pdf b/final/AI_Lecture_note_1124.pdf new file mode 100644 index 0000000..82e94da --- /dev/null +++ b/final/AI_Lecture_note_1124.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a593bb3dbd242a303ef9ddf55d973d9ed1585f800452bd5e25d5e09fb00688f9 +size 3516978 diff --git a/final/AI_Lecture_note_1127.pdf b/final/AI_Lecture_note_1127.pdf new file mode 100644 index 0000000..75fbc0b --- /dev/null +++ b/final/AI_Lecture_note_1127.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:34a9756fce0bcf87e92d8f7055ffbfb327110ccddd85976be18b9b1621775125 +size 2381901 diff --git a/final/AI_Lecture_note_1201.pdf b/final/AI_Lecture_note_1201.pdf new file mode 100644 index 0000000..b8e816f --- /dev/null +++ b/final/AI_Lecture_note_1201.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:38eb9c14448ecb4988873f1631883bd778cecfe2d1d4cff9097db3f65972718b +size 3392169