Study Guide: Bayesian Networks & Probabilistic Inference

Date: 2025.12.01 (Final Lecture) Topic: Bayesian Networks, Probabilistic Inference Examples, Marginalization.

The lecture begins by briefly contrasting the two types of graphical models discussed:

Undirected Graphs (MRF): Use potential functions (\psi) defined on maximal cliques. Requires a normalization constant (partition function Z) to become a probability distribution.
Directed Graphs (Bayesian Networks): Use conditional probability distributions (CPDs). The joint distribution is the product of local conditional probabilities. P(X) = \prod_{i} P(x_i | \text{parents}(x_i))

This is a classic example used to demonstrate inference in Bayesian Networks.

Nodes:
- B: Burglary (Parent, no prior causes).
- E: Earthquake (Parent, no prior causes).
- A: Alarm (Triggered by Burglary or Earthquake).
- J: JohnCalls (Triggered by Alarm).
- M: MaryCalls (Triggered by Alarm).
Dependencies: B \rightarrow A \leftarrow E, A \rightarrow J, A \rightarrow M.
Probabilities (Given):
- P(B) = 0.05, P(E) = 0.1.
- P(A|B, E): Table given (e.g., P(A|B, \neg E) = 0.85, P(A|\neg B, \neg E) = 0.05, etc.).
- P(J|A) = 0.7, P(M|A) = 0.8.

Calculate the probability of the event: Burglary, No Earthquake, Alarm rings, John calls, Mary does not call.

P(B, \neg E, A, J, \neg M)

Decomposition: Apply the Chain Rule based on the graph structure. = P(B) \cdot P(\neg E) \cdot P(A | B, \neg E) \cdot P(J | A) \cdot P(\neg M | A)
Calculation: = 0.05 \times 0.9 \times 0.85 \times 0.7 \times 0.2

Calculate the probability that a Burglary occurred, given that John called and Mary did not call.

P(B | J, \neg M)

Formula (Bayes Rule):
P(B | J, \neg M) = \frac{P(B, J, \neg M)}{P(J, \neg M)}
Numerator Calculation (P(B, J, \neg M)): We must marginalize out the unknown variables (A and E) from the joint distribution.
P(B, J, \neg M) = \sum_{A \in \{T,F\}} \sum_{E \in \{T,F\}} P(B, E, A, J, \neg M)
This involves summing 4 terms (combinations of A and E).
Denominator Calculation (P(J, \neg M)): We further marginalize out B from the numerator result.
P(J, \neg M) = P(B, J, \neg M) + P(\neg B, J, \neg M)

A simpler example to demonstrate how sums simplify during marginalization.

Nodes: X_1, X_2, X_3, X_4 \in \{0, 1\} (Binary).
Dependencies:
- X_1 \rightarrow X_2
- X_2 \rightarrow X_3
- X_2 \rightarrow X_4
Decomposition: P(X) = P(X_1)P(X_2|X_1)P(X_3|X_2)P(X_4|X_2).
Given Tables: Probabilities for all priors and conditionals are provided.

We need to find the probability of X_3=1 regardless of the other variables.

Definition: Sum the joint probability over all other variables (X_1, X_2, X_4).
P(X_3=1) = \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1, x_2, x_3=1, x_4)
Step 1: Expand using Graph Structure
= \sum_{x_1} \sum_{x_2} \sum_{x_4} P(x_1)P(x_2|x_1)P(X_3=1|x_2)P(x_4|x_2)
Step 2: Simplify (Key Insight) Move the summation signs to push them as far right as possible. The sum over x_4 only affects the last term P(x_4|x_2).
= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2) \left[ \sum_{x_4} P(x_4|x_2) \right]
- Property: \sum_{x_4} P(x_4|x_2) = 1 (Sum of probabilities for a variable given a condition is always 1).
- Therefore, the X_4 term vanishes. This makes sense intuitively: X_4 is a "leaf" node distinct from X_3; knowing nothing about it doesn't change $X_3$'s probability if X_2 is handled.
Step 3: Final Calculation We are left with summing over X_1 and X_2:
= \sum_{x_1} \sum_{x_2} P(x_1)P(x_2|x_1)P(X_3=1|x_2)
This expands to 4 terms (combinations of x_1 \in \{0,1\} and x_2 \in \{0,1\}).

The lecture concludes the semester's material.

Key Themes Covered:
- Discriminative vs. Generative Methods: The fundamental difference in approach (boundary vs. distribution).
- Objective Functions: Designing Loss functions vs. Likelihood functions.
- Optimization: Parameter estimation via derivatives (MLE).
- Graphical Models: Reducing parameter complexity using independence assumptions (Bayes Nets, MRFs).
Final Exam: Scheduled for Thursday, December 11th. It will cover the concepts discussed, focusing on understanding the fundamentals (e.g., Likelihood, Generative principles) rather than rote memorization.