Probability as Extended Logic

Probability as Extended Logic

Statistics is not about numbers. It’s about reasoning under uncertainty.

Think about the questions you’ll ask as a researcher:

  • Are these two variables related?
  • Will this intervention work?
  • What caused this observation?
  • What will happen if we change something?

Statistical methods give you formal tools for answering these questions. But here’s the catch: statistics cannot tell you about causes without additional assumptions.

Statistical Models Are Powerful But Mindless

Statistical models are powerful tools, but they have a critical limitation: they do exactly what you tell them to do, without understanding your scientific question.

A regression model doesn’t know whether you’re predicting ice cream sales from temperature or temperature from ice cream sales. It will happily give you an answer either way—even if one interpretation is scientifically meaningless.

Statistical models:

  • Follow instructions literally
  • Have no understanding of causation
  • Can produce absurd results if misapplied
  • Require careful human guidance

Statistical Association vs. Causal Inference

Consider this dataset: as ice cream sales increase, drowning deaths increase.

Statistical question: Are these variables associated? Answer: Yes, they are positively correlated.

Causal question: Does ice cream cause drowning? Answer from data alone: Impossible to say.

Figure 2
Animation showing three scenarios side by side: (1) ice cream sales causing drowning (absurd), (2) drowning causing ice cream sales (equally absurd), (3) a hidden variable (temperature/summer) causing both. The third scenario highlights with a common ancestor node, showing how correlation can arise without direct causation.

The data alone cannot distinguish between:

  1. Ice cream → Drowning (implausible)
  2. Drowning → Ice cream (equally implausible)
  3. Temperature → Ice cream and Temperature → Drowning (plausible!)

This is the fundamental problem: correlation does not imply causation, and data alone cannot reveal causation.

Enter the DAG: Your Scientific Model

A Directed Acyclic Graph (DAG) is a visual representation of your causal assumptions.

Components:

  • Nodes: Variables in your system
  • Directed edges: Causal relationships (arrows point from cause to effect)
  • Acyclic: No variable can cause itself (no loops)

Example: Ice Cream and Drowning

Temperature → Ice cream sales
     ↓
  Drowning

This DAG encodes our theory:

  • Temperature affects ice cream sales
  • Temperature affects drowning rates (more people swim when it’s hot)
  • Ice cream sales and drowning are not causally related
  • Their correlation arises from their common cause: temperature

Figure 3
Interactive animation of the ice cream DAG. Start with just ice cream sales and drowning connected by a question mark. Then temperature appears above them, sends arrows to both variables, and the question mark between ice cream and drowning fades away. Equations appear showing how data is generated from the causal structure.

The Mantra: “No Causes In; No Causes Out”

Statistical models process data. They don’t generate causal knowledge from nothing.

If you want causal conclusions, you must input causal assumptions.

These assumptions come from:

  • Domain knowledge
  • Theory
  • Previous experiments
  • Physical/biological/social mechanisms
  • Your DAG

Building Your First DAG

Let’s work through a real example: Does education increase income?

Step 1: Identify variables

  • Education (years)
  • Income ($/year)

Step 2: What causes what?

  • Start simple: Education → Income
  • But wait… are there other factors?

Step 3: Consider confounders

Confounders are variables that affect both treatment and outcome:

  • Family wealth might affect both education (can afford college) and income (inheritance, connections)
  • Intelligence might affect both education (academic success) and income (job performance)

Step 4: Draw the DAG

Family wealth → Education → Income
     ↓                         ↑
     └─────────────────────────┘

Intelligence → Education → Income
     ↓                        ↑
     └────────────────────────┘

DAGs Tell You What to Measure

Your DAG determines your statistical analysis.

Without a DAG: “Let me try every possible model and see what works”

With a DAG: “My causal model says I need to condition on X and Y, so my statistical model should include them”

This is the bridge from science to statistics:

  1. DAG (science): what causes what
  2. Statistical model (tool): how to estimate effects given the causal structure

Common DAG Structures

1. The Pipe (Mediation)

X → Z → Y
  • Z is a mediator: X affects Y through Z
  • Example: Exercise → Cardiovascular health → Longevity

2. The Fork (Confounding)

  Z
 ↙ ↘
X   Y
  • Z is a confounder: Z causes both X and Y
  • Example: Temperature → Ice cream & Drowning

3. The Collider

X → Z ← Y
  • Z is a collider: both X and Y cause Z
  • Example: Talent → Success ← Luck
  • Danger: Conditioning on a collider creates spurious associations!

Collider Bias: If you condition on a collider, you induce a correlation between its causes—even if they’re independent! This is also called “Berkson’s paradox” or “selection bias.”

Example: Among Hollywood actors (success = collider), talent and luck appear negatively correlated—because you need one or the other to succeed. But among the general population, they’re independent.

The Scientific Workflow

Here’s how you should approach any data analysis problem:

  1. Ask a causal question: Does higher A cause lower B?
  2. Draw a DAG: What are the causal relationships?
  3. Identify the estimand: What do we need to measure/control?
  4. Choose a statistical model: The tool that implements your strategy
  5. Validate assumptions: Does your model make sense?
  6. Interpret carefully: What does your answer actually mean?

Why This Matters

Many data analysis mistakes stem from skipping the DAG:

  • Wrong: “I’ll include all available variables and let the model sort it out”

  • Right: “My DAG says which variables matter and how”

  • Wrong: “Statistical significance means it’s important”

  • Right: “Statistical significance means it’s probably not zero—causal importance comes from the DAG”

  • Wrong: “The data will reveal the truth”

  • Right: “The data + my causal model can test specific hypotheses”

The Language of Probability

Before we go further, we need to establish some notation. You’ll see expressions like these throughout this course, and you need to know what they actually mean.

What Does $P(X)$ Mean?

$P(X)$ is the probability distribution of the variable $X$. It tells you how likely different values of $X$ are.

When you see: $$X \sim \mathcal{N}(0, 1)$$

This reads as: “$X$ is distributed as a Normal distribution with mean 0 and standard deviation 1.” The tilde ($\sim$) means “is distributed as.”

More generally:

  • $X \sim \mathcal{N}(\mu, \sigma)$ means $X$ follows a Normal (Gaussian) distribution
  • $X \sim \text{Uniform}(a, b)$ means $X$ is uniformly distributed between $a$ and $b$
  • $X \sim \text{Poisson}(\lambda)$ means $X$ follows a Poisson distribution with rate $\lambda$

The Three Axioms of Probability

All of probability theory rests on three simple axioms. These were formalized by Andrey Kolmogorov in 1933, and they define what probability actually means.

Axiom 1: Non-negativity

Probabilities are never negative: $$P(A) \geq 0$$

for any event $A$. You can’t have a -20% chance of something happening.

Axiom 2: Normalization

The probability that something happens is 1: $$P(\Omega) = 1$$

where $\Omega$ (capital omega) represents all possible outcomes. If you roll a die, you’re guaranteed to get some number between 1 and 6.

Axiom 3: Additivity

If two events $A$ and $B$ cannot both happen (they’re mutually exclusive), then: $$P(A \text{ or } B) = P(A) + P(B)$$

Example: When rolling a die, the probability of getting a 1 or a 6 is: $$P(1 \text{ or } 6) = P(1) + P(6) = \frac{1}{6} + \frac{1}{6} = \frac{1}{3}$$

This works because you can’t roll both a 1 and a 6 at the same time.

What Follows From These Axioms

From these three simple rules, you can derive all of probability theory. For example:

The sum rule: The total probability across all possible outcomes must equal 1. $$\sum_i P(X = x_i) = 1$$

The complement rule: The probability that $A$ doesn’t happen is: $$P(\text{not } A) = 1 - P(A)$$

Bayes’ theorem (which we’ll explore in detail later): $$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

These axioms seem simple, but they’re incredibly powerful. They give us a consistent framework for reasoning about uncertainty.

Conditional Probability: $P(X|A)$

The vertical bar $|$ means “given” or “conditional on.”

$P(X|A)$ reads as: “the probability distribution of $X$ given that $A$ is true.”

Example:

  • $P(\text{rain})$ = probability it will rain today
  • $P(\text{rain}|\text{cloudy})$ = probability it will rain given that it’s cloudy

Conditioning on information changes the probability distribution. If you know it’s cloudy, your probability of rain goes up.

Formal Definition

Conditional probability is defined as: $$P(A|B) = \frac{P(A \text{ and } B)}{P(B)}$$

provided that $P(B) > 0$. In words: the probability of $A$ given $B$ equals the probability that both $A$ and $B$ occur, divided by the probability of $B$.

Intuition: We’re restricting our sample space to only those cases where $B$ is true, then asking how often $A$ also occurs within that restricted space.

Concrete Example: Medical Testing

Suppose:

  • 1% of people have a disease: $P(\text{disease}) = 0.01$
  • A test is 95% accurate for sick people: $P(\text{positive}|\text{disease}) = 0.95$
  • The test has a 10% false positive rate: $P(\text{positive}|\text{no disease}) = 0.10$

Question: If you test positive, what’s the probability you have the disease?

Many people guess 95%, but that’s wrong! We need $P(\text{disease}|\text{positive})$, not $P(\text{positive}|\text{disease})$—these are different.

Let’s calculate it properly. First, what’s $P(\text{positive})$?

$$P(\text{positive}) = P(\text{positive}|\text{disease}) \cdot P(\text{disease}) + P(\text{positive}|\text{no disease}) \cdot P(\text{no disease})$$

$$= 0.95 \times 0.01 + 0.10 \times 0.99 = 0.0095 + 0.099 = 0.1085$$

Now using the definition of conditional probability:

$$P(\text{disease}|\text{positive}) = \frac{P(\text{positive and disease})}{P(\text{positive})} = \frac{0.0095}{0.1085} \approx 0.088$$

Only 8.8%! Even with a positive test, you probably don’t have the disease, because the disease is rare and false positives are common.

This is a profound result: the order of conditioning matters. $P(A|B) \neq P(B|A)$ in general.

Multiple Conditions: $P(X|A, B)$

You can condition on multiple things at once:

$P(X|A, B)$ reads as: “the probability distribution of $X$ given both $A$ and $B$.”

Example:

  • $P(\text{rain}|\text{cloudy}, \text{summer})$ = probability of rain given that it’s cloudy AND it’s summer

The comma between $A$ and $B$ means “and” in this context.

The Multiplication Rule (Chain Rule)

From the definition of conditional probability, we can derive the multiplication rule:

$$P(A \text{ and } B) = P(A|B) \cdot P(B) = P(B|A) \cdot P(A)$$

This generalizes to multiple events: $$P(A, B, C) = P(A|B, C) \cdot P(B|C) \cdot P(C)$$

Example: What’s the probability of drawing two aces in a row from a standard deck (without replacement)?

$$P(\text{ace}_1, \text{ace}_2) = P(\text{ace}_2|\text{ace}_1) \cdot P(\text{ace}_1) = \frac{3}{51} \times \frac{4}{52} = \frac{1}{221}$$

The Law of Total Probability

If $B_1, B_2, \ldots, B_n$ are mutually exclusive events that cover all possibilities (they partition the sample space), then:

$$P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)$$

This is incredibly useful. It says: to find the total probability of $A$, consider all the different ways $A$ can happen (through each $B_i$), and add them up weighted by how likely each scenario is.

Example: The probability it rains tomorrow is: $$P(\text{rain}) = P(\text{rain}|\text{cloudy}) \cdot P(\text{cloudy}) + P(\text{rain}|\text{sunny}) \cdot P(\text{sunny})$$

We used this principle in the medical testing example above.

Independence and Conditional Independence

Two events $A$ and $B$ are independent if: $$P(A|B) = P(A)$$

In words: knowing $B$ doesn’t change your probability of $A$. Equivalently, $P(A, B) = P(A) \cdot P(B)$.

Example: If you flip two fair coins, the result of the second flip is independent of the first.

But here’s a subtle point: conditional independence is different from independence.

$A$ and $B$ are conditionally independent given $C$ if: $$P(A|B, C) = P(A|C)$$

This means: once you know $C$, learning $B$ tells you nothing additional about $A$.

Example (DAG context): Consider the fork structure from earlier:

Temperature → Ice cream sales
     ↓
  Drowning

Ice cream sales and drowning are not independent: $P(\text{drowning}|\text{ice cream sales}) \neq P(\text{drowning})$.

But they are conditionally independent given temperature: $$P(\text{drowning}|\text{ice cream}, \text{temperature}) = P(\text{drowning}|\text{temperature})$$

Once you know the temperature, ice cream sales tell you nothing additional about drowning risk. This is a key principle of DAGs: variables are conditionally independent given their parents in the graph.

Summary

Key Takeaways:

  1. Statistical models are powerful but mindless tools—they need human guidance
  2. Correlation ≠ causation; data alone cannot reveal causal relationships
  3. DAGs formalize your causal assumptions and guide your statistical analysis
  4. “No causes in; no causes out”—causal conclusions require causal assumptions
  5. Different DAG structures (pipes, forks, colliders) require different analytical strategies

Next lecture: We’ll formalize probability as a tool for reasoning under uncertainty, introducing conditional probability and Bayes’ theorem as extensions of logic.

Last updated on