Files
2025-02-AI/final/1030.md
2025-12-06 18:32:08 +09:00

8.1 KiB

Support Vector Machines: Optimization, Dual Problem & Kernel Methods

Date: 2025.10.30 and 2025.11.03 Topic: SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem


1. Introduction to SVM Mathematics

The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.

  • Goal: The objective is to understand the flow and connection of formulas rather than memorizing them.
  • Context: SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
  • Core Concept: The algorithm seeks to maximize the margin to ensure the most robust classifier.

2. General Optimization with Constraints

The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.

  • Problem Setup: To minimize an objective function L(x) subject to constraints g(x) \ge 0, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers (\lambda).
  • KKT Conditions: The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
    1. Feasible Region: The unconstrained minimum satisfies the constraint. Here, \lambda = 0.
    2. Boundary Case: The solution lies on the boundary where g(x) = 0. Here, \lambda > 0.

3. Multi-Constraint Example

A specific example is provided to demonstrate optimization with multiple constraints.

  • Objective: Minimize x_1^2 + x_2^2 subject to two linear constraints.
  • Lagrangian: The function is defined as L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x).
  • Solving Strategy: With two constraints, there are four possible combinations for \lambda values (both zero, one zero, or both positive).
    • The lecture demonstrates testing these cases. For instance, assuming both \lambda=0 yields x_1=0, x_2=0, which violates the constraints.
    • The valid solution is found where the constraints intersect (Boundary Case).

4. SVM Mathematical Formulation (Primal Problem)

The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.

  • Objective Function: Minimize \frac{1}{2}||w||^2 (equivalent to maximizing the margin).
  • Constraints: All data points must be correctly classified outside the margin: y_i(w^T x_i - b) \ge 1.
  • Lagrangian Formulation: L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1] Here, \alpha_i represents the Lagrange multipliers.

5. Deriving the Dual Problem

To solve this, the Partial Derivatives with respect to the parameters w and b are set to zero.

  • Derivative w.r.t w: Yields the relationship w = \sum \alpha_i y_i x_i. This shows w is a linear combination of the data points.
  • Derivative w.r.t b: Yields the constraint \sum \alpha_i y_i = 0.
  • Substitution: By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.

6. The Dual Form and Kernel Intuition

The final derived Dual objective function depends entirely on the dot product of data points.

  • Dual Equation: \text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j Subject to \sum \alpha_i y_i = 0 and \alpha_i \ge 0.
  • Primal vs. Dual:
    • Primal: Depends on the number of features/parameters (D).
    • Dual: Depends on the number of data points (N).
  • Significance: The term x_i^T x_j represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.

7. The Dual Form and Inner Products

In the previous section, the Dual Form of the SVM optimization problem was derived.

  • Objective Function: The dual objective function to maximize involves the parameters \alpha and the data points:

    \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)
  • Key Observation: The optimization depends solely on the inner product (x_i^T x_j) between data points. This inner product represents the similarity between two vectors, which is the foundational concept for the Kernel Method.


8. Feature Mapping and Cover's Theorem

When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.

  • Mapping Function (\Phi): We define a transformation rule, or mapping function \Phi(x), that projects input vector x from the original space to a high-dimensional feature space.

    • Example 1 (1D to 2D): Mapping x \to (x, x^2). A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
    • Example 2 (2D to 3D): Mapping x = (x_1, x_2) to \Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2).
  • Cover's Theorem: This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.

    • Strategy: Apply a mapping function \Phi to the original data, then find a linear classifier in that high-dimensional space.

9. The Kernel Trick

Directly computing the mapping \Phi(x) can be computationally expensive or impossible (e.g., infinite dimensions). The Kernel Trick allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.

  • Definition: A Kernel function K(x, y) calculates the inner product of the mapped vectors:

    K(x, y) = \Phi(x)^T \Phi(y)
  • Efficiency: The result is a scalar value calculated without knowing the explicit form of \Phi.

  • Derivation Example (Polynomial Kernel): For 2D vectors x and y, consider the kernel K(x, y) = (x^T y)^2.

    (x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2

    This is mathematically equivalent to the dot product of two mapped vectors where:

    \Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)

    Thus, calculating (x^T y)^2 in the original space is equivalent to calculating similarity in the 3D space defined by \Phi.


10. Mercer's Theorem & Positive Definite Functions

How do we know if a function K(x, y) is a valid kernel? Mercer's Theorem provides the condition.

  • The Theorem: If a function K(x, y) is Positive Definite (P.D.), then there always exists a mapping function \Phi such that K(x, y) = \Phi(x)^T \Phi(y).
  • Implication: We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive \Phi explicitly.

Positive Definiteness (Matrix Definition)

To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.

  • For any non-zero vector z, a matrix M is P.D. if z^T M z > 0 for all z.
  • Eigenvalue Condition: A matrix is P.D. if and only if all of its eigenvalues are positive.

11. Infinite Dimensionality (RBF Kernel)

The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.

  • The exponential function can be expanded using a Taylor Series into an infinite sum.
  • This implies that using an exponential-based kernel is equivalent to mapping the data into an infinite-dimensional space.
  • Even though the dimension is infinite, the calculation K(x, y) remains a simple scalar operation in the original space.

12. Final SVM Formulation with Kernels

By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.

  • Dual Objective: Replace x_i^T x_j with K(x_i, x_j):

    \text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)
  • Decision Rule: For a new test point x', the classification is determined by:

    \sum \alpha_i y_i K(x_i, x') - b \ge 0

Next Lecture: The course will move on to Generative Methods (probability methods).