diff --git a/final/1027.md b/final/1027.md
new file mode 100644
index 0000000..54d0781
--- /dev/null
+++ b/final/1027.md
@@ -0,0 +1,68 @@
+# Large Margin Classifiers and Optimization
+
+**Date:** 2025.10.27
+**Topic:** Large Margin Classifiers, Optimization, Margin Definition
+
+---
+
+### 1. Introduction to Robust Classification
+The lecture begins by shifting focus from generative methods to discriminative methods, specifically within a **linearly separable setting**.
+* **Problem Setting:** The goal is to classify data that can be perfectly separated by a linear boundary (hyperplane).
+* **Robustness:** While infinite linear classifiers may separate the data, the objective is to find the "best" one. The best classifier is defined as the one that is most **robust**, meaning it generalizes well to new test data and handles potential outliers effectively.
+* **Intuition:** A robust classifier places the decision boundary in the middle of the gap between classes, maximizing the distance to the nearest data points.
+
+### 2. Defining the Margin
+The concept of the **margin** is introduced to mathematically define robustness.
+* **Definition:** The margin is the distance between the decision hyperplane and the closest data points.
+* **Hyperplane Equation:** The decision boundary is defined as $w^T x - b = 0$.
+* **Support Lines:** To define the margin, we establish two parallel lines passing through the closest data points:
+    * $w^T x - b = 1$ (for class +1).
+    * $w^T x - b = -1$ (for class -1).
+    * The region between these lines contains no data points.
+
+### 3. Calculating the Margin Width
+The lecture derives the mathematical expression for the margin width using vector projection.
+* **Vector Projection:** The margin is calculated by projecting the vector connecting a point on the boundary ($x_0$) to a support vector ($x$) onto the normal vector $w$.
+* **Derivation:**
+    * The distance is the projection of vector $(x - x_0)$ onto the unit normal vector $\frac{w}{||w||}$.
+    * Using the constraint $w^T x - b = 1$ and $w^T x_0 - b = 0$, the derived margin distance is $\frac{1}{||w||}$.
+* **Conclusion:** Maximizing the margin is equivalent to **minimizing the norm of the weight vector $||w||$**.
+
+### 4. The Optimization Problem
+The task of finding the best classifier is formulated as a constrained optimization problem.
+
+* **Objective Function:**
+    $$\min ||w||^2$$
+    (Note: Minimizing $||w||$ is computationally equivalent to minimizing $||w||^2$)
+
+* **Constraints:** All data points must be correctly classified and lie outside the margin. This is formalized as:
+    * $w^T x_i - b \ge 1$ for $y_i = 1$.
+    * $w^T x_i - b \le -1$ for $y_i = -1$.
+    * **Combined Constraint:** $y_i (w^T x_i - b) \ge 1$ for all $i$.
+
+### 5. Optimization with Constraints (Lagrange Multipliers)
+The lecture explains how to solve this optimization problem using **Lagrange Multipliers**, using a general example first.
+
+* **Problem Setup:** Minimize an objective function $L(x)$ subject to a constraint $g(x) \ge 0$.
+* **Lagrangian:** A new objective function is defined by combining the original loss and the constraint with a multiplier $\lambda$:
+    $$L'(x) = L(x) - \lambda g(x)$$
+    (Note: The transcript discusses combining components; the sign depends on the specific maximization/minimization formulation)
+
+* **Solution Cases:**
+    The solution involves taking the derivative $\frac{dL'}{dx} = 0$ and considering two cases:
+    1.  **Feasible Region ($\lambda = 0$):** The unconstrained minimum of $L(x)$ naturally satisfies the constraint ($g(x) > 0$). In this case, the constraint is inactive.
+    2.  **Boundary Case ($\lambda > 0$):** The unconstrained minimum violates the constraint. Therefore, the optimal solution lies *on* the boundary where $g(x) = 0$.
+
+### 6. Example: Constrained Minimization
+A specific mathematical example is worked through to demonstrate the method.
+* **Objective:** Minimize $x_1^2 + x_2^2$ (distance from origin).
+* **Constraint:** $x_2 - x_1^2 - 1 \ge 0$ (must be above a parabola).
+* **Solving:**
+    * The Lagrangian is set up: $L' = x_1^2 + x_2^2 - \lambda(x_2 - x_1^2 - 1)$.
+    * **Case 1 ($\lambda = 0$):** Leads to $x_1=0, x_2=0$, which violates the constraint ($0 - 0 - 1 = -1 \not\ge 0$). This solution is discarded.
+    * **Case 2 (Boundary, $\lambda \ne 0$):** The solution must lie on $x_2 - x_1^2 - 1 = 0$. Solving the system of equations yields the valid minimum at $x_1=0, x_2=1$.
+
+### 7. Next Steps: Support Vector Machines
+The lecture concludes by linking this optimization framework back to the classifier.
+* **Support Vectors:** The data points that lie exactly on the margin boundary ($g(x)=0$) are called "Support Vectors".
+* **Future Topic:** This foundation leads into the **Support Vector Machine (SVM)** algorithm, which will be discussed in the next session to handle non-linearly separable data.
diff --git a/final/1030.md b/final/1030.md
new file mode 100644
index 0000000..471b85d
--- /dev/null
+++ b/final/1030.md
@@ -0,0 +1,125 @@
+# Support Vector Machines: Optimization, Dual Problem & Kernel Methods
+
+**Date:** 2025.10.30 and 2025.11.03
+**Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
+
+---
+
+### 1. Introduction to SVM Mathematics
+The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
+* **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them.
+* **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
+* **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier.
+
+### 2. General Optimization with Constraints
+The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
+* **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$).
+* **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
+    1.  **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$.
+    2.  **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$.
+
+### 3. Multi-Constraint Example
+A specific example is provided to demonstrate optimization with multiple constraints.
+* **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints.
+* **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$.
+* **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive).
+    * The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints.
+    * The valid solution is found where the constraints intersect (Boundary Case).
+
+### 4. SVM Mathematical Formulation (Primal Problem)
+The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
+* **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin).
+* **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$.
+* **Lagrangian Formulation:**
+    $$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$
+    Here, $\alpha_i$ represents the Lagrange multipliers.
+
+### 5. Deriving the Dual Problem
+To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero.
+* **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points.
+* **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$.
+* **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
+
+### 6. The Dual Form and Kernel Intuition
+The final derived Dual objective function depends entirely on the dot product of data points.
+* **Dual Equation:**
+    $$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
+    Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$.
+* **Primal vs. Dual:**
+    * **Primal:** Depends on the number of features/parameters ($D$).
+    * **Dual:** Depends on the number of data points ($N$).
+* **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
+
+---
+
+### 7. The Dual Form and Inner Products
+In the previous section, the **Dual Form** of the SVM optimization problem was derived.
+* **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points:
+    $$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$
+   
+* **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method.
+
+---
+
+### 8. Feature Mapping and Cover's Theorem
+When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
+
+* **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space.
+    * **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
+    * **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$.
+
+
+
+* **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
+    * **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space.
+
+---
+
+### 9. The Kernel Trick
+Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
+
+* **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors:
+    $$K(x, y) = \Phi(x)^T \Phi(y)$$
+   
+* **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$.
+
+* **Derivation Example (Polynomial Kernel):**
+    For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$.
+    $$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$
+    This is mathematically equivalent to the dot product of two mapped vectors where:
+    $$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
+    Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$.
+
+---
+
+### 10. Mercer's Theorem & Positive Definite Functions
+How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition.
+
+* **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$.
+* **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly.
+
+#### **Positive Definiteness (Matrix Definition)**
+To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
+* For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$.
+* **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**.
+
+---
+
+### 11. Infinite Dimensionality (RBF Kernel)
+The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
+* The exponential function can be expanded using a Taylor Series into an infinite sum.
+* This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**.
+* Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space.
+
+---
+
+### 12. Final SVM Formulation with Kernels
+By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
+
+* **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$:
+    $$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
+   
+* **Decision Rule:** For a new test point $x'$, the classification is determined by:
+    $$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$
+   
+**Next Lecture:** The course will move on to Generative Methods (probability methods).
diff --git a/final/1106.md b/final/1106.md
new file mode 100644
index 0000000..4bf4760
--- /dev/null
+++ b/final/1106.md
@@ -0,0 +1,92 @@
+# Lecture Summary: Generative Methods & Probability Review
+
+**Date:** 2025.11.06
+**Topic:** Discriminative vs. Generative Models, Probability Theory, Probabilistic Inference, and Gaussian Distributions.
+
+---
+
+### 1. Classification Approaches: Discriminative vs. Generative
+
+The lecture begins by distinguishing between two fundamental approaches to machine learning classification, specifically for binary problems (labels 0 or 1).
+
+#### **Discriminative Methods (e.g., Logistic Regression)**
+* **Goal:** Directly model the decision boundary or the conditional probability $P(y|x)$.
+* **Mechanism:** Focuses on distinguishing classes. It learns a function that maps inputs $x$ directly to class labels $y$.
+* **Limitation:** It does not model the underlying distribution of the data itself.
+
+#### **Generative Methods**
+* **Goal:** Model the joint probability or the class-conditional density $P(x|y)$ and the class prior $P(y)$.
+* **Mechanism:** It learns "how the data is generated" for each class.
+* **Classification:** To classify a new point, it uses **Bayes' Rule** to invert the probabilities:
+    $$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$$
+* **Advantage:** If you know the generative model, you can solve the classification problem *and* generate new data samples.
+
+---
+
+### 2. Probability Theory Review
+
+To understand Generative Methods, a strong foundation in probability is required.
+
+#### **Random Variables**
+* **Definition:** A random variable is technically a **function** (mapping) that assigns a real number to an outcome (event $\omega$) in the sample space $\Omega$.
+* **Example:** Tossing a coin 4 times. An event might be "HHTH", and the random variable $X(\omega)$ could be "number of heads" (which equals 3).
+
+#### **Probability vs. Probability Density Function (PDF)**
+The lecture emphasizes distinguishing between discrete probability ($P$) and continuous density ($p$).
+
+* **Discrete Probability ($P$):** Defined as the ratio of cardinalities (counts) or areas in discrete sets (e.g., Venn diagrams).
+    * **Probability Density Function ($p$):** Used for continuous variables.
+    * **Properties:** $p(x) \ge 0$ for all $x$, and $\int p(x)dx = 1$.
+    * **Relationship:** The probability of $x$ falling within a range is the **integral** (area under the curve) of the PDF. The probability of a specific point $P(x=x_0)$ is 0.
+    
+#### **Key Statistics**
+* **Expectation ($E[x]$):** The mean or weighted average of a random variable.
+    $$E[x] = \int x p(x) dx$$
+* **Covariance:** Measures the spread or variance of the data. For vectors, this results in a Covariance Matrix.
+    $$Cov[x] = E[(x - \mu)(x - \mu)^T]$$
+
+---
+
+### 3. The Trinity of Distributions: Joint, Conditional, and Marginal
+
+Understanding the relationship between these three is crucial for probabilistic modeling.
+
+#### **Joint PDF ($P(x_1, x_2)$)**
+* This represents the probability of $x_1$ and $x_2$ occurring together.
+* **Importance:** If you know the Joint PDF, you know *everything* about the system. You can derive all other probabilities (marginal, conditional) from it.
+
+#### **Conditional PDF ($P(x_1 | x_2)$)**
+* Represents the probability of $x_1$ given that $x_2$ is fixed to a specific value.
+* Visually, this is like taking a "slice" of the joint distribution 3D surface at $x_2 = a$.
+
+#### **Marginal PDF ($P(x_1)$)**
+* Represents the probability of $x_1$ regardless of $x_2$.
+* **Calculation:** You "marginalize out" (integrate or sum) the other variables.
+    * Continuous: $P(x_1) = \int P(x_1, x_2) dx_2$.
+    * Discrete: Summing rows or columns in a probability table.
+
+---
+
+### 4. Probabilistic Inference
+
+**Inference** is defined as calculating a desired probability (e.g., a prediction) starting from the Joint Probability function using rules like Bayes' theorem and marginalization.
+
+#### **Handling Missing Data**
+A major practical benefit of generative models (Joint PDF modeling) over discriminative models (like Logistic Regression) is robust handling of missing data.
+* **Scenario:** You have a model predicting disease ($y$) based on Age ($x_1$), Blood Pressure ($x_2$), and Oxygen ($x_3$).
+* **Problem:** A patient arrives, but you cannot measure Age ($x_1$). A discriminative model might fail or require value imputation (guessing averages).
+* **Probabilistic Solution:** You integrate (marginalize) out the missing variable $x_1$ from the joint distribution to get the probability based only on observed data:
+    $$P(y | x_2, x_3) = \frac{\int p(x_1, x_2, x_3, y) dx_1}{P(x_2, x_3)}$$.
+
+---
+
+### 5. The Gaussian Distribution
+
+The lecture concludes with a review of the Gaussian (Normal) distribution, the most important function in AI/ML.
+
+* **Univariate Gaussian:** Defined by mean $\mu$ and variance $\sigma^2$.
+* **Multivariate Gaussian:** Defined for a vector $x \in R^D$.
+    $$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$.
+* **Parameters:**
+    * $\mu$: Mean vector ($D$-dimensional).
+    * $\Sigma$: Covariance Matrix ($D \times D$). It must be **Symmetric** and **Positive Definite**.
diff --git a/final/1110.md b/final/1110.md
new file mode 100644
index 0000000..2969aa5
--- /dev/null
+++ b/final/1110.md
@@ -0,0 +1,104 @@
+# Study Guide: Generative Methods & Multivariate Gaussian Distributions
+
+**Date:** 2025.12.01
+**Topic:** Generative vs. Discriminative Models, Multivariate Gaussian Properties, Conditional and Marginal Distributions.
+
+---
+
+### **1. Generative vs. Discriminative Methods**
+
+The lecture begins by contrasting the new topic (Generative Methods) with previous topics (Discriminative Methods like Linear Regression, Logistic Regression, and SVM).
+
+* **Discriminative Methods (Separating):**
+    * These methods focus on finding a boundary (separating line or hyperplane) between classes.
+    * **Limitation:** They cannot generate new data samples because they do not model the data distribution; they only know the boundary.
+    * **Hypothesis:** They assume a linear line or function as the hypothesis to separate data.
+
+* **Generative Methods (Inferring Distribution):**
+    * **Goal:** To infer the **underlying distribution** (the rule or pattern) from which the data samples were drawn.
+    * **Assumption:** Data is not random; it follows a specific probabilistic structure (e.g., drawn from a distribution).
+    * **Capabilities:** Once the Joint Probability Distribution (underlying distribution) is known:
+        1.  **Classification:** Can be performed using Bayes' Rule.
+        2.  **Generation:** New samples can be created that follow the same patterns as the training data (e.g., generating new images or text).
+
+
+
+---
+
+### **2. The Gaussian (Normal) Distribution**
+
+The Gaussian distribution is the most popular choice for modeling the "hypothesis" of the underlying distribution in generative models.
+
+#### **Why Gaussian?**
+1.  **Simplicity:** Defined entirely by two parameters: Mean ($\mu$) and Covariance ($\Sigma$).
+2.  **Central Limit Theorem:** Sums of independent random events tend to follow a Gaussian distribution.
+3.  **Mathematical "Closure":** The most critical reason for its use in AI is that **Conditional** and **Marginal** distributions of a Multivariate Gaussian are *also* Gaussian.
+
+#### **Multivariate Gaussian Definition**
+For a $D$-dimensional vector $x$:
+$$P(x) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$$
+* $\mu$: Mean vector ($D$-dimensional).
+* $\Sigma$: Covariance Matrix ($D \times D$).
+
+
+
+[Image of multivariate gaussian distribution 3d plot]
+
+
+#### **Properties of the Covariance Matrix ($\Sigma$)**
+* **Symmetric:** $\Sigma_{ij} = \Sigma_{ji}$.
+* **Positive Definite:** All eigenvalues are positive.
+* **Diagonal Terms:** Represent the variance of individual variables.
+* **Off-Diagonal Terms:** Represent the correlation (covariance) between variables.
+    * If $\sigma_{12} = 0$, the variables are **independent** (for Gaussians).
+    * The matrix shape determines the geometry of the distribution contours (spherical vs. elliptical).
+
+---
+
+### **3. Independence and Factorization**
+
+If the Covariance Matrix is **diagonal** (all off-diagonal elements are 0), the variables are independent.
+* Mathematically, the inverse matrix $\Sigma^{-1}$ is also diagonal.
+* The joint probability factorizes into the product of marginals:
+    $$P(x_1, x_2) = P(x_1)P(x_2)$$
+* The "quadratic form" inside the exponential splits into a sum of separate squared terms.
+
+---
+
+### **4. Conditional Gaussian Distribution**
+
+The lecture derives what happens when we observe a subset of variables (e.g., $x_2$) and want to determine the distribution of the remaining variables ($x_1$). This is $P(x_1 | x_2)$.
+
+* **Concept:** Visually, this is equivalent to "slicing" the joint distribution at a specific value of $x_2$ (fixed constant).
+* **Result:** The resulting cross-section is **also a Gaussian distribution**.
+* **Parameters:** If we partition $x$, $\mu$, and $\Sigma$ into subsets, the conditional mean ($\mu_{1|2}$) and covariance ($\Sigma_{1|2}$) are given by:
+    * **Mean:** $\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)$.
+    * **Covariance:** $\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}$.
+    *(Note: The derivation involves completing the square to identify the Gaussian form).*
+
+
+
+---
+
+### **5. Marginal Gaussian Distribution**
+
+The lecture explains how to find the distribution of a subset of variables ($x_1$) by ignoring the others ($x_2$). This is $P(x_1)$.
+
+* **Concept:** This is equivalent to integrating out the unobserved variables:
+    $$P(x_1) = \int P(x_1, x_2) dx_2$$
+* **Result:** The marginal distribution is **also a Gaussian distribution**.
+* **Parameters:** Unlike the conditional case, calculating the marginal parameters is trivial. You simply select the corresponding sub-vector and sub-matrix from the joint parameters.
+    * Mean: $\mu_1$.
+    * Covariance: $\Sigma_{11}$.
+
+
+
+### **Summary Table**
+
+| Distribution | Type | Parameters Derived From Joint $(\mu, \Sigma)$ |
+| :--- | :--- | :--- |
+| **Joint** $P(x)$ | Gaussian | Given as $\mu, \Sigma$ |
+| **Conditional** $P(x_1 \| x_2)$ | Gaussian | Complex formula (involves matrix inversion of $\Sigma_{22}$) |
+| **Marginal** $P(x_1)$ | Gaussian | Simple subset (extract $\mu_1$ and $\Sigma_{11}$) |
+
+The lecture concludes by emphasizing that understanding these Gaussian properties is essential for the second half of the semester, as they form the basis for probabilistic generative models.
diff --git a/final/1113.md b/final/1113.md
new file mode 100644
index 0000000..d1cafb4
--- /dev/null
+++ b/final/1113.md
@@ -0,0 +1,85 @@
+# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier
+
+**Date:** 2025.11.13
+**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.
+
+---
+
+### **1. Overview: Learning in Generative Methods**
+The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.
+
+* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
+* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.
+
+#### **Why Gaussian?**
+The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.
+
+
+
+[Image of multivariate gaussian distribution 3d plot]
+
+
+---
+
+### **2. The Learning Process: Parameter Estimation**
+"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.
+
+#### **Step 1: Define the Objective Function**
+We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
+* **Goal:** We want to assign **high probability** to the observed (empirical) data points.
+* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
+    $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$
+
+#### **Step 2: Log-Likelihood (MLE)**
+Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
+* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.
+
+#### **Step 3: Optimization (Derivation)**
+We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.
+
+* **Optimal Mean ($\hat{\mu}$):**
+    The derivation yields the **Empirical Mean**. It is simply the average of the data points.
+    $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$
+
+* **Optimal Covariance ($\hat{\Sigma}$):**
+    The derivation yields the **Empirical Covariance**.
+    $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$
+
+**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.
+
+---
+
+### **3. Inference: Making Predictions**
+Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.
+
+#### **Classification**
+To classify a new data point $x_{new}$:
+1.  We aim to calculate the conditional probability $P(y | x_{new})$.
+2.  Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
+3.  We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).
+
+#### **Handling Missing Data**
+Generative models offer a theoretically robust way to handle missing variables.
+* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
+* **Method:** **Marginalization**.
+    1.  Start with the Joint PDF.
+    2.  Integrate (marginalize) out the missing variable $x_2$.
+        $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
+    3.  Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
+* This is superior to heuristic methods like imputing the mean.
+
+---
+
+### **4. Bayes Optimal Classifier**
+The lecture introduces the concept of the theoretical "perfect" classifier.
+
+* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
+* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
+    $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$
+
+#### **Bayes Error**
+* Even the optimal classifier has an irreducible error called the **Bayes Error**.
+* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
+* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
+* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
+    $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$
diff --git a/final/1117.md b/final/1117.md
new file mode 100644
index 0000000..37da912
--- /dev/null
+++ b/final/1117.md
@@ -0,0 +1,99 @@
+# Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)
+
+**Date:** 2025.11.17
+**Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.
+
+---
+
+### **1. Recap: Bayes Optimal Classifier and Bayes Error**
+
+The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**.
+* **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability.
+* **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.
+
+#### **Bayes Error (Irreducible Error)**
+* **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**.
+* **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
+* **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
+* **Formula:** The risk (expected error) is the integral of the minimum probability over the domain:
+    $$R^* = \int \min[P_1(x), P_2(x)] dx$$
+    If priors are equal, this simplifies to the integral of the overlap region.
+
+---
+
+### **2. Introduction to Graphical Models**
+
+The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks).
+
+* **Motivation:**
+    * A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements.
+    * The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters.
+    * For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
+* **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.
+
+---
+
+### **3. The Chain Rule and Independence**
+
+Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities.
+
+* **General Chain Rule:**
+    $$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$
+* **Simplification with Independence:**
+    If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$.
+* **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where:
+    * **Nodes** represent random variables.
+    * **Edges (Arrows)** represent conditional dependencies (causality).
+
+
+
+---
+
+### **4. Building a Bayesian Network (Causal Graph)**
+
+The lecture illustrates this with a practical example involving a crying baby.
+
+* **Scenario:** We want to model the causes of a baby crying.
+* **Variables:**
+    * **Cry:** The observable effect.
+    * **Hungry, Sick, Diaper:** Direct causes of crying.
+    * **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying.
+* **Dependencies:**
+    * "Hungry" and "Sick" might be independent of each other generally.
+    * "Cry" depends on all of them.
+    * "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry".
+
+
+
+---
+
+### **5. The Three Canonical Patterns of Independence**
+
+To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.
+
+#### **1. Tail-to-Tail (Common Cause)**
+* **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y).
+* **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**.
+* **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$.
+
+#### **2. Head-to-Tail (Causal Chain)**
+* **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y).
+* **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**.
+* **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further.
+
+#### **3. Head-to-Head (Common Effect / V-Structure)**
+* **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z).
+* **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away").
+* **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick).
+    * Being hungry tells us nothing about being sick (Independent).
+    * But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect.
+
+---
+
+### **6. D-Separation**
+
+These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph.
+* If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
+* A path is blocked if:
+    * It contains a chain or fork where the middle node is **observed**.
+    * It contains a collider where the middle node (and all its descendants) are **NOT observed**.
diff --git a/final/1120.md b/final/1120.md
new file mode 100644
index 0000000..fe6da78
--- /dev/null
+++ b/final/1120.md
@@ -0,0 +1,79 @@
+# Lecture Summary: Directed Graphical Models and Naive Bayes
+
+**Date:** 2025.11.20
+**Topic:** Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.
+
+---
+
+### **1. Motivation: The Need for Parameter Reduction**
+The lecture begins by reviewing Generative Methods using the Gaussian distribution.
+* **The Problem:** In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
+    * For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector $\mu$ ($D$ parameters) and the Covariance Matrix $\Sigma$ (symmetric $D \times D$ matrix).
+    * The total number of parameters is roughly $O(D^2)$, specifically $D + \frac{D(D+1)}{2}$.
+    * For large $D$, this requires a massive amount of training data to avoid overfitting.
+* **The Solution:** We use **Prior Knowledge** (domain knowledge) about the relationships between variables to reduce the number of parameters.
+    * By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.
+
+---
+
+### **2. Directed Graphical Models (Bayesian Networks)**
+A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.
+
+#### **Decomposition via Chain Rule**
+* The joint probability $P(x)$ can be decomposed using the chain rule:
+    $$P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))$$
+* **Example Structure:**
+    If we have a graph where $x_1$ has no parents, $x_2$ depends on $x_1$, etc., the joint distribution splits into:
+    $$P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...$$
+
+#### **Parameter Counting Example (Gaussian Case)**
+The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.
+* **Full Gaussian:** Assumes all variables are correlated.
+    * For a 10-dimensional vector ($D=10$), parameters = $10 + \frac{10 \times 11}{2} = 65$.
+* **Reduced Model:** Uses a graph structure where variables are conditionally independent.
+    * Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
+    * **Calculation:** For a univariate conditional Gaussian $P(x_i | x_j)$, we need parameters for the linear relationship (mean coefficients) and variance.
+    * In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.
+
+---
+
+### **3. The Naive Bayes Classifier**
+The **Naive Bayes** classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.
+
+* **Assumption:** Given the class label $y$, all input features $x_1, ..., x_D$ are **mutually independent**.
+* **Structure:** The class $y$ is the parent of all feature nodes $x_i$. There are no connections between the features themselves.
+* **Formula:**
+    $$P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)$$
+* **Advantage:** We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.
+
+---
+
+### **4. Application: Spam Classifier**
+The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as **Spam ($y=1$)** or **Not Spam ($y=0$)**.
+
+#### **Feature Engineering**
+* **Input:** Emails with varying text lengths.
+* **Transformation:** A "Bag of Words" approach is used.
+    1.  Create a dictionary of $N$ words (e.g., $N=10,000$).
+    2.  Represent each email as a fixed-length binary vector $x \in \{0, 1\}^{10,000}$.
+    3.  $x_i = 1$ if the $i$-th word appears in the email, $0$ otherwise.
+
+#### **The "Curse of Dimensionality" (Without Naive Bayes)**
+* Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
+* If we tried to model the full joint distribution $P(x_1, ..., x_{10000} | y)$, we would need a probability table for every possible combination of words.
+* **Parameter Count:** $2^{10,000}$ entries. This is computationally impossible.
+
+#### **Applying Naive Bayes**
+* By assuming word independence given the class, we decompose the problem:
+    $$P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)$$
+* **Parameter Estimation:**
+    * We only need to estimate $P(x_i=1 | y=1)$ and $P(x_i=1 | y=0)$ for each word.
+    * This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
+* **Reduced Parameter Count:**
+    * Instead of $2^{10,000}$, we need roughly $2 \times 10,000$ parameters (one probability per word per class).
+    * This transforms an impossible problem into a highly efficient and simple one.
+
+### **5. Summary**
+* **Generative Methods** aim to model the underlying distribution $P(x, y)$.
+* **Graphical Models** allow us to inject prior knowledge (independence assumptions) to make this feasible.
+* **Naive Bayes** assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.
diff --git a/final/1124.md b/final/1124.md
new file mode 100644
index 0000000..b8041ae
--- /dev/null
+++ b/final/1124.md
@@ -0,0 +1,85 @@
+# Study Guide: Discrete Probability Models & Undirected Graphical Models
+
+**Date:** 2025.11.24
+**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
+
+---
+
+### **1. Discrete Probability Distributions**
+The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
+
+#### **Binomial Distribution**
+* **Scenario:** A coin toss (Binary outcome: Head/Tail).
+* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
+* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
+* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
+    $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$
+
+#### **Multinomial Distribution**
+* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
+* **Definition:**
+    * We have $N$ total events (trials).
+    * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
+    * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
+* **Probability Mass Function:**
+    $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$
+
+---
+
+### **2. Learning: Maximum Likelihood Estimation (MLE)**
+How do we estimate the parameters ($\mu_k$) from data?
+
+* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
+* **Method:** **Lagrange Multipliers**.
+    1.  **Objective:** Maximize Log-Likelihood:
+        $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
+    2.  **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
+    3.  **Lagrangian:**
+        $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
+        (Note: Constant terms like $N!$ vanish during differentiation).
+    4.  **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.
+
+* **Result:**
+    $$\mu_k = \frac{m_k}{N}$$
+    * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
+    * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
+
+---
+
+### **3. Undirected Graphical Models (Markov Random Fields)**
+
+When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).
+
+#### **Comparison**
+* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
+* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.
+
+#### **Conditional Independence in MRF**
+Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
+* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
+    * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.
+
+---
+
+### **4. Factorization in Undirected Graphs**
+
+Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.
+
+#### **Cliques and Maximal Cliques**
+* **Clique:** A subgraph where every pair of nodes is connected (fully connected).
+* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.
+
+#### **The Joint Distribution Formula**
+We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
+$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
+
+* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
+* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
+    $$Z = \sum_x \prod_{C} \psi_C(x_C)$$
+
+#### **Example Decomposition**
+Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
+$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$
+
+#### **Hammersley-Clifford Theorem**
+This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.
diff --git a/final/1127.md b/final/1127.md
new file mode 100644
index 0000000..f23d604
--- /dev/null
+++ b/final/1127.md
@@ -0,0 +1,63 @@
+# Study Guide: Undirected Graphical Models (Markov Random Fields)
+
+**Date:** 2025.11.27
+**Topic:** Potential Functions, Partition Function, and Conditional Independence in MRFs.
+
+---
+
+### **1. Recap: Decomposition in Undirected Graphs**
+Unlike Directed Graphical Models (Bayesian Networks) which use conditional probabilities, **Undirected Graphical Models (Markov Random Fields - MRFs)** cannot directly use probabilities because there is no direction/causality. Instead, they decompose the joint distribution based on **Maximal Cliques**.
+
+* **Cliques:** Subsets of nodes where every node is connected to every other node.
+* **Maximal Clique:** A clique that cannot be expanded (e.g., in the example graph, the maximal cliques covers the graph).
+* **Decomposition Rule:** The joint distribution is the product of functions defined over these maximal cliques.
+
+---
+
+### **2. Potential Functions ($\psi$)**
+* **Definition:** For each maximal clique $C$, we define a **Potential Function** $\psi_C(x_C)$ (often denoted as $\phi$ or $\psi$).
+    * It is a **positive function** ($\psi(x) \ge 0$) mapping the state of the clique variables to a real number.
+    * It represents the "compatibility" or "energy" of that configuration.
+* **Key Distinction:** A potential function is **NOT a probability**. It does not sum to 1. It is just a score (non-negative function).
+    * *Example:* $\psi_{12}(x_1, x_2)$ scores the interaction between $x_1$ and $x_2$.
+
+---
+
+### **3. The Partition Function ($Z$)**
+Since the product of potential functions is not a probability distribution (it doesn't sum to 1), we must normalize it.
+
+* **Definition:** The normalization constant is called the **Partition Function** ($Z$).
+    $$Z = \sum_{x} \prod_{C} \psi_C(x_C)$$
+* **Role:** It ensures that the resulting distribution sums to 1, making it a valid probability distribution.
+    $$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$
+* **Calculation:** To find $Z$, we must sum the product of potentials over **all possible states** (combinations) of the random variables. This summation is often computationally expensive.
+
+#### **Example Calculation**
+The lecture walks through a simple example with 4 binary variables and two cliques: $\{x_1, x_2, x_3\}$ and $\{x_3, x_4\}$.
+* **Step 1:** Define potential tables for $\psi_{123}$ and $\psi_{34}$.
+* **Step 2:** Calculate the score for every combination.
+* **Step 3:** Sum all scores to get $Z$. In the example, $Z=10$.
+* **Step 4:** The probability of any specific state (e.g., $P(1,0,0,0)$) is its specific score divided by $Z$ (e.g., $(1 \times 3)/10$ or similar depending on values).
+
+---
+
+### **4. Parameter Estimation**
+* **Discrete Case:** If variables are discrete (like the email spam example), the parameters are the entries in the potential tables. We estimate these values from data to maximize the likelihood.
+* **Continuous Case:** If variables are continuous, potential functions are typically Gaussian distributions. We estimate means and covariances.
+* **Reduction:** Just like in Bayesian Networks, using the graph structure reduces the number of parameters.
+    * *Without Graph:* A full table for 4 binary variables needs $2^4 = 16$ entries.
+    * *With Graph:* We only need tables for the cliques, significantly reducing complexity.
+
+---
+
+### **5. Verifying Conditional Independence**
+The lecture demonstrates analytically that the potential function formulation preserves the conditional independence properties of the graph.
+
+* **Scenario:** Graph with structure $x_1 - x_2 - x_3 - x_4$.
+    * Is $x_4$ independent of $x_1$ given $x_3$?
+* **Analytical Check:**
+    * We calculate $P(x_4=1 | x_1=0, x_2=1, x_3=1)$.
+    * We also calculate $P(x_4=1 | x_1=0, x_2=0, x_3=1)$.
+* **Result:** The calculation shows that as long as $x_3$ is fixed (given), the value of $x_1$ and $x_2$ cancels out in the probability ratio.
+    * $P(x_4|x_1, x_2, x_3) = \frac{\phi_1}{\phi_1 + \phi_0}$ (depends only on potentials involving $x_4$ and $x_3$).
+* **Conclusion:** This confirms that $x_4 \perp \{x_1, x_2\} | x_3$. The formulation correctly encodes the global Markov property.
diff --git a/final/1201.md b/final/1201.md
new file mode 100644
index 0000000..c61a539
--- /dev/null
+++ b/final/1201.md
@@ -0,0 +1,102 @@
+# Study Guide: Bayesian Networks & Probabilistic Inference
+
+**Date:** 2025.12.01 (Final Lecture)
+**Topic:** Bayesian Networks, Probabilistic Inference Examples, Marginalization.
+
+---
+
+### **1. Recap: Directed vs. Undirected Models**
+The lecture begins by briefly contrasting the two types of graphical models discussed:
+* **Undirected Graphs (MRF):** Use potential functions ($\psi$) defined on maximal cliques. Requires a normalization constant (partition function $Z$) to become a probability distribution.
+* **Directed Graphs (Bayesian Networks):** Use conditional probability distributions (CPDs). The joint distribution is the product of local conditional probabilities.
+    $$P(X) = \prod_{i} P(x_i | \text{parents}(x_i))$$
+
+---
+
+### **2. Example 1: The "Alarm" Network (Burglary/Earthquake)**
+This is a classic example used to demonstrate inference in Bayesian Networks.
+
+#### **Scenario & Structure**
+* **Nodes:**
+    * **B:** Burglary (Parent, no prior causes).
+    * **E:** Earthquake (Parent, no prior causes).
+    * **A:** Alarm (Triggered by Burglary or Earthquake).
+    * **J:** JohnCalls (Triggered by Alarm).
+    * **M:** MaryCalls (Triggered by Alarm).
+* **Dependencies:** $B \rightarrow A \leftarrow E$, $A \rightarrow J$, $A \rightarrow M$.
+* **Probabilities (Given):**
+    * $P(B) = 0.05$, $P(E) = 0.1$.
+    * $P(A|B, E)$: Table given (e.g., $P(A|B, \neg E) = 0.85$, $P(A|\neg B, \neg E) = 0.05$, etc.).
+    * $P(J|A) = 0.7$, $P(M|A) = 0.8$.
+
+#### **Task 1: Calculate a Specific Joint Probability**
+Calculate the probability of the event: **Burglary, No Earthquake, Alarm rings, John calls, Mary does not call**.
+$$P(B, \neg E, A, J, \neg M)$$
+
+* **Decomposition:** Apply the Chain Rule based on the graph structure.
+    $$= P(B) \cdot P(\neg E) \cdot P(A | B, \neg E) \cdot P(J | A) \cdot P(\neg M | A)$$
+* **Calculation:**
+    $$= 0.05 \times 0.9 \times 0.85 \times 0.7 \times 0.2$$
+
+#### **Task 2: Inference (Conditional Probability)**
+Calculate the probability that a **Burglary occurred**, given that **John called** and **Mary did not call**.
+$$P(B | J, \neg M)$$
+
+* **Formula (Bayes Rule):**
+    $$P(B | J, \neg M) = \frac{P(B, J, \neg M)}{P(J, \neg M)}$$
+
+* **Numerator Calculation ($P(B, J, \neg M)$):**
+    We must **marginalize out** the unknown variables ($A$ and $E$) from the joint distribution.
+    $$P(B, J, \neg M) = \sum_{A \in \{T,F\}} \sum_{E \in \{T,F\}} P(B, E, A, J, \neg M)$$
+    This involves summing 4 terms (combinations of A and E).
+
+* **Denominator Calculation ($P(J, \neg M)$):**
+    We further marginalize out $B$ from the numerator result.
+    $$P(J, \neg M) = P(B, J, \neg M) + P(\neg B, J, \neg M)$$
+
+---
+
+### **3. Example 2: 4-Node Tree Structure**
+A simpler example to demonstrate how sums simplify during marginalization.
+
+#### **Scenario & Structure**
+* **Nodes:** $X_1, X_2, X_3, X_4 \in \{0, 1\}$ (Binary).
+* **Dependencies:**
+    * $X_1 \rightarrow X_2$
+    * $X_2 \rightarrow X_3$
+    * $X_2 \rightarrow X_4$
+* **Decomposition:** $P(X) = P(X_1)P(X_2|X_1)P(X_3|X_2)P(X_4|X_2)$.
+* **Given Tables:** Probabilities for all priors and conditionals are provided.
+
+#### **Task: Calculate Marginal Probability $P(X_3 = 1)$**
+We need to find the probability of $X_3=1$ regardless of the other variables.
+
+* **Definition:** Sum the joint probability over all other variables ($X_1, X_2, X_4$).
+    $$P(X_3=1) = \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1, x_2, x_3=1, x_4)$$
+
+* **Step 1: Expand using Graph Structure**
+    $$= \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1)P(x_2|x_1)P(X_3=1|x_2)P(x_4|x_2)$$
+
+* **Step 2: Simplify (Key Insight)**
+    Move the summation signs to push them as far right as possible. The sum over $x_4$ only affects the last term $P(x_4|x_2)$.
+    $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2) \left[ \sum_{x_4} P(x_4|x_2) \right]$$
+   
+    * **Property:** $\sum_{x_4} P(x_4|x_2) = 1$ (Sum of probabilities for a variable given a condition is always 1).
+    * Therefore, the $X_4$ term vanishes. This makes sense intuitively: $X_4$ is a "leaf" node distinct from $X_3$; knowing nothing about it doesn't change $X_3$'s probability if $X_2$ is handled.
+
+* **Step 3: Final Calculation**
+    We are left with summing over $X_1$ and $X_2$:
+    $$= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2)$$
+    This expands to 4 terms (combinations of $x_1 \in \{0,1\}$ and $x_2 \in \{0,1\}$).
+
+---
+
+### **4. Semester Summary & Conclusion**
+The lecture concludes the semester's material.
+
+* **Key Themes Covered:**
+    * **Discriminative vs. Generative Methods:** The fundamental difference in approach (boundary vs. distribution).
+    * **Objective Functions:** Designing Loss functions vs. Likelihood functions.
+    * **Optimization:** Parameter estimation via derivatives (MLE).
+    * **Graphical Models:** Reducing parameter complexity using independence assumptions (Bayes Nets, MRFs).
+* **Final Exam:** Scheduled for Thursday, December 11th. It will cover the concepts discussed, focusing on understanding the fundamentals (e.g., Likelihood, Generative principles) rather than rote memorization.
diff --git a/final/AI_Lecture_note_1027.pdf b/final/AI_Lecture_note_1027.pdf
new file mode 100644
index 0000000..9528ee1
--- /dev/null
+++ b/final/AI_Lecture_note_1027.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0de19c9741cf9727c433a0bb5ff4d6dc964f18fcb3f83a6460aacc0e18f95a73
+size 3822586
diff --git a/final/AI_Lecture_note_1030.pdf b/final/AI_Lecture_note_1030.pdf
new file mode 100644
index 0000000..433ef1b
--- /dev/null
+++ b/final/AI_Lecture_note_1030.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:213b28289e9217e611f0a3f7f724490ebe46a3f46dab077a60a52ea2f78f7acc
+size 4974234
diff --git a/final/AI_Lecture_note_1103.pdf b/final/AI_Lecture_note_1103.pdf
new file mode 100644
index 0000000..aa73dae
--- /dev/null
+++ b/final/AI_Lecture_note_1103.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c36fdb70b004399e3757fe1c1f829559acbda7027c88878be028b57a15804931
+size 3857954
diff --git a/final/AI_Lecture_note_1106.pdf b/final/AI_Lecture_note_1106.pdf
new file mode 100644
index 0000000..75c3fdd
--- /dev/null
+++ b/final/AI_Lecture_note_1106.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:82560c38016d98be0d54aff881c898daf3127254a8e687c098ee604de574dc20
+size 5202704
diff --git a/final/AI_Lecture_note_1110.pdf b/final/AI_Lecture_note_1110.pdf
new file mode 100644
index 0000000..4c83bbe
--- /dev/null
+++ b/final/AI_Lecture_note_1110.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:81bde77201036497774937cb91da60122b116e668cbb5e631ef2644497c5503b
+size 5195473
diff --git a/final/AI_Lecture_note_1113.pdf b/final/AI_Lecture_note_1113.pdf
new file mode 100644
index 0000000..3f44354
--- /dev/null
+++ b/final/AI_Lecture_note_1113.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fd978d86010265601a789d04c5f974d71a1d5c73ecaa1117ee3dc1cfbba886c0
+size 3923021
diff --git a/final/AI_Lecture_note_1117.pdf b/final/AI_Lecture_note_1117.pdf
new file mode 100644
index 0000000..8abc774
--- /dev/null
+++ b/final/AI_Lecture_note_1117.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7659d9fb22418e13f7697a2689f5b7e23a7dc01823d1efc5830ec4c747c1a624
+size 4163676
diff --git a/final/AI_Lecture_note_1120.pdf b/final/AI_Lecture_note_1120.pdf
new file mode 100644
index 0000000..439871f
--- /dev/null
+++ b/final/AI_Lecture_note_1120.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:24f3ae1edf3e406a9523a9c4ba27e5a905775d9945bd0249f5b434e40a56d3b2
+size 5361036
diff --git a/final/AI_Lecture_note_1124.pdf b/final/AI_Lecture_note_1124.pdf
new file mode 100644
index 0000000..82e94da
--- /dev/null
+++ b/final/AI_Lecture_note_1124.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a593bb3dbd242a303ef9ddf55d973d9ed1585f800452bd5e25d5e09fb00688f9
+size 3516978
diff --git a/final/AI_Lecture_note_1127.pdf b/final/AI_Lecture_note_1127.pdf
new file mode 100644
index 0000000..75fbc0b
--- /dev/null
+++ b/final/AI_Lecture_note_1127.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:34a9756fce0bcf87e92d8f7055ffbfb327110ccddd85976be18b9b1621775125
+size 2381901
diff --git a/final/AI_Lecture_note_1201.pdf b/final/AI_Lecture_note_1201.pdf
new file mode 100644
index 0000000..b8e816f
--- /dev/null
+++ b/final/AI_Lecture_note_1201.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:38eb9c14448ecb4988873f1631883bd778cecfe2d1d4cff9097db3f65972718b
+size 3392169