Probability as Extended Logic
Statistics is not about numbers. It’s about reasoning under uncertainty.
Think about the questions you’ll ask as a researcher:
- Are these two variables related?
- Will this intervention work?
- What caused this observation?
- What will happen if we change something?
Statistical methods give you formal tools for answering these questions. But here’s the catch: statistics cannot tell you about causes without additional assumptions.
Statistical Models Are Powerful But Mindless
Statistical models are powerful tools, but they have a critical limitation: they do exactly what you tell them to do, without understanding your scientific question.
A regression model doesn’t know whether you’re predicting ice cream sales from temperature or temperature from ice cream sales. It will happily give you an answer either way—even if one interpretation is scientifically meaningless.
Statistical models:
- Follow instructions literally
- Have no understanding of causation
- Can produce absurd results if misapplied
- Require careful human guidance
Statistical Association vs. Causal Inference
Consider this dataset: as ice cream sales increase, drowning deaths increase.
Statistical question: Are these variables associated? Answer: Yes, they are positively correlated.
Causal question: Does ice cream cause drowning? Answer from data alone: Impossible to say.

The data alone cannot distinguish between:
- Ice cream → Drowning (implausible)
- Drowning → Ice cream (equally implausible)
- Temperature → Ice cream and Temperature → Drowning (plausible!)
This is the fundamental problem: correlation does not imply causation, and data alone cannot reveal causation.
Enter the DAG: Your Scientific Model
A Directed Acyclic Graph (DAG) is a visual representation of your causal assumptions.
Components:
- Nodes: Variables in your system
- Directed edges: Causal relationships (arrows point from cause to effect)
- Acyclic: No variable can cause itself (no loops)
Example: Ice Cream and Drowning
Temperature → Ice cream sales
↓
DrowningThis DAG encodes our theory:
- Temperature affects ice cream sales
- Temperature affects drowning rates (more people swim when it’s hot)
- Ice cream sales and drowning are not causally related
- Their correlation arises from their common cause: temperature

The Mantra: “No Causes In; No Causes Out”
Statistical models process data. They don’t generate causal knowledge from nothing.
If you want causal conclusions, you must input causal assumptions.
These assumptions come from:
- Domain knowledge
- Theory
- Previous experiments
- Physical/biological/social mechanisms
- Your DAG
Building Your First DAG
Let’s work through a real example: Does education increase income?
Step 1: Identify variables
- Education (years)
- Income ($/year)
Step 2: What causes what?
- Start simple: Education → Income
- But wait… are there other factors?
Step 3: Consider confounders
Confounders are variables that affect both treatment and outcome:
- Family wealth might affect both education (can afford college) and income (inheritance, connections)
- Intelligence might affect both education (academic success) and income (job performance)
Step 4: Draw the DAG
Family wealth → Education → Income
↓ ↑
└─────────────────────────┘
Intelligence → Education → Income
↓ ↑
└────────────────────────┘DAGs Tell You What to Measure
Your DAG determines your statistical analysis.
Without a DAG: “Let me try every possible model and see what works”
With a DAG: “My causal model says I need to condition on X and Y, so my statistical model should include them”
This is the bridge from science to statistics:
- DAG (science): what causes what
- Statistical model (tool): how to estimate effects given the causal structure
Common DAG Structures
1. The Pipe (Mediation)
X → Z → Y- Z is a mediator: X affects Y through Z
- Example: Exercise → Cardiovascular health → Longevity
2. The Fork (Confounding)
Z
↙ ↘
X Y- Z is a confounder: Z causes both X and Y
- Example: Temperature → Ice cream & Drowning
3. The Collider
X → Z ← Y- Z is a collider: both X and Y cause Z
- Example: Talent → Success ← Luck
- Danger: Conditioning on a collider creates spurious associations!
Collider Bias: If you condition on a collider, you induce a correlation between its causes—even if they’re independent! This is also called “Berkson’s paradox” or “selection bias.”
Example: Among Hollywood actors (success = collider), talent and luck appear negatively correlated—because you need one or the other to succeed. But among the general population, they’re independent.
The Scientific Workflow
Here’s how you should approach any data analysis problem:
- Ask a causal question: Does higher A cause lower B?
- Draw a DAG: What are the causal relationships?
- Identify the estimand: What do we need to measure/control?
- Choose a statistical model: The tool that implements your strategy
- Validate assumptions: Does your model make sense?
- Interpret carefully: What does your answer actually mean?
Why This Matters
Many data analysis mistakes stem from skipping the DAG:
Wrong: “I’ll include all available variables and let the model sort it out”
Right: “My DAG says which variables matter and how”
Wrong: “Statistical significance means it’s important”
Right: “Statistical significance means it’s probably not zero—causal importance comes from the DAG”
Wrong: “The data will reveal the truth”
Right: “The data + my causal model can test specific hypotheses”
The Language of Probability
Before we go further, we need to establish some notation. You’ll see expressions like these throughout this course, and you need to know what they actually mean.
What Does $P(X)$ Mean?
$P(X)$ is the probability distribution of the variable $X$. It tells you how likely different values of $X$ are.
When you see: $$X \sim \mathcal{N}(0, 1)$$
This reads as: “$X$ is distributed as a Normal distribution with mean 0 and standard deviation 1.” The tilde ($\sim$) means “is distributed as.”
More generally:
- $X \sim \mathcal{N}(\mu, \sigma)$ means $X$ follows a Normal (Gaussian) distribution
- $X \sim \text{Uniform}(a, b)$ means $X$ is uniformly distributed between $a$ and $b$
- $X \sim \text{Poisson}(\lambda)$ means $X$ follows a Poisson distribution with rate $\lambda$
The Three Axioms of Probability
All of probability theory rests on three simple axioms. These were formalized by Andrey Kolmogorov in 1933, and they define what probability actually means.
Axiom 1: Non-negativity
Probabilities are never negative: $$P(A) \geq 0$$
for any event $A$. You can’t have a -20% chance of something happening.
Axiom 2: Normalization
The probability that something happens is 1: $$P(\Omega) = 1$$
where $\Omega$ (capital omega) represents all possible outcomes. If you roll a die, you’re guaranteed to get some number between 1 and 6.
Axiom 3: Additivity
If two events $A$ and $B$ cannot both happen (they’re mutually exclusive), then: $$P(A \text{ or } B) = P(A) + P(B)$$
Example: When rolling a die, the probability of getting a 1 or a 6 is: $$P(1 \text{ or } 6) = P(1) + P(6) = \frac{1}{6} + \frac{1}{6} = \frac{1}{3}$$
This works because you can’t roll both a 1 and a 6 at the same time.
What Follows From These Axioms
From these three simple rules, you can derive all of probability theory. For example:
The sum rule: The total probability across all possible outcomes must equal 1. $$\sum_i P(X = x_i) = 1$$
The complement rule: The probability that $A$ doesn’t happen is: $$P(\text{not } A) = 1 - P(A)$$
Bayes’ theorem (which we’ll explore in detail later): $$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
These axioms seem simple, but they’re incredibly powerful. They give us a consistent framework for reasoning about uncertainty.
Conditional Probability: $P(X|A)$
The vertical bar $|$ means “given” or “conditional on.”
$P(X|A)$ reads as: “the probability distribution of $X$ given that $A$ is true.”
Example:
- $P(\text{rain})$ = probability it will rain today
- $P(\text{rain}|\text{cloudy})$ = probability it will rain given that it’s cloudy
Conditioning on information changes the probability distribution. If you know it’s cloudy, your probability of rain goes up.
Formal Definition
Conditional probability is defined as: $$P(A|B) = \frac{P(A \text{ and } B)}{P(B)}$$
provided that $P(B) > 0$. In words: the probability of $A$ given $B$ equals the probability that both $A$ and $B$ occur, divided by the probability of $B$.
Intuition: We’re restricting our sample space to only those cases where $B$ is true, then asking how often $A$ also occurs within that restricted space.
Concrete Example: Medical Testing
Suppose:
- 1% of people have a disease: $P(\text{disease}) = 0.01$
- A test is 95% accurate for sick people: $P(\text{positive}|\text{disease}) = 0.95$
- The test has a 10% false positive rate: $P(\text{positive}|\text{no disease}) = 0.10$
Question: If you test positive, what’s the probability you have the disease?
Many people guess 95%, but that’s wrong! We need $P(\text{disease}|\text{positive})$, not $P(\text{positive}|\text{disease})$—these are different.
Let’s calculate it properly. First, what’s $P(\text{positive})$?
$$P(\text{positive}) = P(\text{positive}|\text{disease}) \cdot P(\text{disease}) + P(\text{positive}|\text{no disease}) \cdot P(\text{no disease})$$
$$= 0.95 \times 0.01 + 0.10 \times 0.99 = 0.0095 + 0.099 = 0.1085$$
Now using the definition of conditional probability:
$$P(\text{disease}|\text{positive}) = \frac{P(\text{positive and disease})}{P(\text{positive})} = \frac{0.0095}{0.1085} \approx 0.088$$
Only 8.8%! Even with a positive test, you probably don’t have the disease, because the disease is rare and false positives are common.
This is a profound result: the order of conditioning matters. $P(A|B) \neq P(B|A)$ in general.
Multiple Conditions: $P(X|A, B)$
You can condition on multiple things at once:
$P(X|A, B)$ reads as: “the probability distribution of $X$ given both $A$ and $B$.”
Example:
- $P(\text{rain}|\text{cloudy}, \text{summer})$ = probability of rain given that it’s cloudy AND it’s summer
The comma between $A$ and $B$ means “and” in this context.
The Multiplication Rule (Chain Rule)
From the definition of conditional probability, we can derive the multiplication rule:
$$P(A \text{ and } B) = P(A|B) \cdot P(B) = P(B|A) \cdot P(A)$$
This generalizes to multiple events: $$P(A, B, C) = P(A|B, C) \cdot P(B|C) \cdot P(C)$$
Example: What’s the probability of drawing two aces in a row from a standard deck (without replacement)?
$$P(\text{ace}_1, \text{ace}_2) = P(\text{ace}_2|\text{ace}_1) \cdot P(\text{ace}_1) = \frac{3}{51} \times \frac{4}{52} = \frac{1}{221}$$
The Law of Total Probability
If $B_1, B_2, \ldots, B_n$ are mutually exclusive events that cover all possibilities (they partition the sample space), then:
$$P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)$$
This is incredibly useful. It says: to find the total probability of $A$, consider all the different ways $A$ can happen (through each $B_i$), and add them up weighted by how likely each scenario is.
Example: The probability it rains tomorrow is: $$P(\text{rain}) = P(\text{rain}|\text{cloudy}) \cdot P(\text{cloudy}) + P(\text{rain}|\text{sunny}) \cdot P(\text{sunny})$$
We used this principle in the medical testing example above.
Independence and Conditional Independence
Two events $A$ and $B$ are independent if: $$P(A|B) = P(A)$$
In words: knowing $B$ doesn’t change your probability of $A$. Equivalently, $P(A, B) = P(A) \cdot P(B)$.
Example: If you flip two fair coins, the result of the second flip is independent of the first.
But here’s a subtle point: conditional independence is different from independence.
$A$ and $B$ are conditionally independent given $C$ if: $$P(A|B, C) = P(A|C)$$
This means: once you know $C$, learning $B$ tells you nothing additional about $A$.
Example (DAG context): Consider the fork structure from earlier:
Temperature → Ice cream sales
↓
DrowningIce cream sales and drowning are not independent: $P(\text{drowning}|\text{ice cream sales}) \neq P(\text{drowning})$.
But they are conditionally independent given temperature: $$P(\text{drowning}|\text{ice cream}, \text{temperature}) = P(\text{drowning}|\text{temperature})$$
Once you know the temperature, ice cream sales tell you nothing additional about drowning risk. This is a key principle of DAGs: variables are conditionally independent given their parents in the graph.
Summary
Key Takeaways:
- Statistical models are powerful but mindless tools—they need human guidance
- Correlation ≠ causation; data alone cannot reveal causal relationships
- DAGs formalize your causal assumptions and guide your statistical analysis
- “No causes in; no causes out”—causal conclusions require causal assumptions
- Different DAG structures (pipes, forks, colliders) require different analytical strategies
Next lecture: We’ll formalize probability as a tool for reasoning under uncertainty, introducing conditional probability and Bayes’ theorem as extensions of logic.