Files
2025-02-AI/final/1124.md
2025-12-06 18:32:08 +09:00

4.8 KiB

Study Guide: Discrete Probability Models & Undirected Graphical Models

Date: 2025.11.24 Topic: Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).


1. Discrete Probability Distributions

The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).

Binomial Distribution

  • Scenario: A coin toss (Binary outcome: Head/Tail).
  • Random Variables: m_1 (count of Heads), m_2 (count of Tails).
  • Parameters: Probability of Head (\mu) and Tail (1-\mu).
  • Formula: For a sequence of tosses, we consider the number of ways to arrange the outcomes. P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}

Multinomial Distribution

  • Scenario: Rolling a die with K faces (e.g., K=6). This generalizes the binomial distribution.
  • Definition:
    • We have N total events (trials).
    • We observe counts m_1, m_2, ..., m_k for each of the K possible outcomes.
    • Parameters \mu_1, ..., \mu_k represent the probability of each outcome.
  • Probability Mass Function: P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}

2. Learning: Maximum Likelihood Estimation (MLE)

How do we estimate the parameters (\mu_k) from data?

  • Goal: Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 (\sum \mu_k = 1).

  • Method: Lagrange Multipliers.

    1. Objective: Maximize Log-Likelihood: L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)
    2. Constraint: \sum_{k=1}^{K} \mu_k - 1 = 0.
    3. Lagrangian: L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1) (Note: Constant terms like N! vanish during differentiation).
    4. Derivation: Taking the derivative w.r.t \mu_k and setting to 0 yields \mu_k = - \frac{m_k}{\lambda}. Solving for \lambda using the constraint gives \lambda = -N.
  • Result:

    \mu_k = \frac{m_k}{N}
    • The optimal parameter is simply the empirical fraction (count of specific events divided by total events).
    • This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.

3. Undirected Graphical Models (Markov Random Fields)

When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use Undirected Graphs instead of Bayesian Networks (Directed Acyclic Graphs).

Comparison

  • Directed (Bayesian Network): Uses conditional probabilities (e.g., P(A|B)). Represents causality or asymmetric relationships.
  • Undirected (Markov Random Field - MRF): Uses "Potential Functions" (\psi). Represents correlation or symmetric constraints.

Conditional Independence in MRF

Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).

  • Global Markov Property: Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
    • Example: If nodes X_1 and X_5 are not directly connected, they are conditionally independent given the intermediate nodes (e.g., X_3) that block the path.

4. Factorization in Undirected Graphs

Since we cannot use chain rules of conditional probabilities (because P(A|B) \neq P(B|A) generally), we model the joint distribution using Cliques.

Cliques and Maximal Cliques

  • Clique: A subgraph where every pair of nodes is connected (fully connected).
  • Maximal Clique: A clique that cannot be expanded by including any other adjacent node.

The Joint Distribution Formula

We associate a Potential Function (\psi_C) with each maximal clique C.

P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)
  • Potential Function (\psi): A non-negative function that scores the compatibility of variables in a clique. It is not a probability (doesn't sum to 1).
  • Partition Function (Z): The normalization constant required to make the total probability sum to 1. Z = \sum_x \prod_{C} \psi_C(x_C)

Example Decomposition

Given a graph with maximal cliques \{x_1, x_2\}, \{x_1, x_3\}, and \{x_3, x_4, x_5\}:

P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)

Hammersley-Clifford Theorem

This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.