5 Statistical Foundations
This section is still under construction
The goal of this section is to build an intuition for what statistical tests accomplish, and what the assumptions are.
A plug for a book called Intuitive Biostatistics by Harvey Motulsky - it is excellent. Highly recommended - https://a.co/d/4NCk2bS
By the end of this page, you should have an intuitive sense of what a p-value is (and what it is not) and the role that it plays in the argument of a scientific report.
To forecast the punchline: We seek to make arguments that observed relationships are not explained by confounding, bias, or chance - and therefore must be causation. P-values summarize how surprising it would be to see the observed data, assuming there is nothing going on and the assumptions of the test hold.
5.1 The Disjunctive Syllogism
We’ll start out abstract, and become more concrete.
The root of the problem is that we don’t directly observe cause and effect. Instead, we must make an argument for it.
The structure of the usual scientific argument mirrors Sherlock Holmes quote:
When you have eliminated the impossible, whatever remains, however improbable, must be the truth.
This “argument-by-excluding-alternatives” is termed the disjunctive syllogism. NOTE: this isn’t the only possible argument - for example, likelihoodism (and by extension, Bayesianism), makes the argument that when comparing a hypothesis to the alternative hypotheses the data would be much less likely. etc.
In the language of epidemiology, we’re interested in the relationship between an “exposure” (meant broadly - could refer to a treatment, an occupational exposure, a characteristics, etc.) and an “outcome” (also meant broadly, could be the occurence of an event, a side effect, a health state, etc.).
If an exposure and an outcome are associated, there are 4 possible explanations:
- Confounding (some other factor, the confounder, influences the likelihood of the exposure and the likelihood of the outcome through other mechanisms)
- Bias (some non-random distortion of the measurements in a study)
- Chance
- Or, causation (meaning, a real effect)
Thus, the way we’ll seek evidence for causation is indirect. First, we’ll show there is an association. Then, we’ll make arguments against the possibility of confounding, bias, and chance explaining the association. If the reader accepts there’s an association but that reasons 1-3 are not plausible explanations, they’re left accepting causation as an explanation.
5.2 What’s a P-value good for?
To provide evidence for cause and effect, we need compelling arguments against confounding, bias, and chance. Arguments against confounding are best done by by study design (e.g. randomization) but can be addressed using statistical methods (see Chapter 6). Bias is best addressed by careful choice of instruments/assessments, but can also be supported by elements of study design (e.g. blinding).
Whether or not chance is a plausible explanation or not is where P-values come in. Repeated for emphasis: P-values ONLY address the plausibilit of chance explaining an apparent association. Having a low p-value tells you nothing about whether confounding or bias could explain an association - you need other arguments for that.
5.2.1 Intuitively, what is a p-value?
If there was no association, how surprising would it be to see the observed data?
if there is a large p-value, it would not be very surprising to see the observed data by just the play of chance.
if the p-value is small, then it would be quite surprising to see the observed data if there’s really nothing going on.
Note: the P-value is not comparing one hypothesis to another (ie. an alternative hypothesis). Nor is it saying how likely it would be to find an effect if there is something going on - it’s just saying: “if there’s actually nothing going on here (ie. no causal effect, no confounding, no bias), how unlikely would this finding be?
This is a conditional probability: “IF nothing is going on” (or, assume nothing is going on), then, how unlikely would this be. The notation is P ( observed_data | nothing going on).
Box: note that p-values are NOT a measure of how strong evidence is. A finding with p-value 0.03 isn’t necessarily less robust even than p-value 0.001 - even if there is no bias or confounding. Consider, a huge study can have a very small p-value (unlikely chance.. but could be a small amount of bias, confounding, or a trivial effect), or a smaller study could have a larger p-value. This is termed “Lindley’s Paradox”
5.2.2 How surprising is too surprising?
Arguing by coincidence begs the question: how surprising does something have to be before you say “there must be something going on here!”.
Luckily, you can convert P-values to coin flips to get an intuitive sense (look inside the box for the actual conversion, but feel free to skip of logarithms make you queasy).
Box: “Shannon Transform: S-value”
In a world where the null value is true (ie. either heads or tails is equally likely to occur; but it’s going to be one of those two), you can characterize how suprising it would be to see a particular sequence. this is calculated by S-value = -log_2_ P-value.
Here’s the setup… say I start flipping a coin, how many consecutive “Heads” need to occur before you’ll suspect it’s not a fair coin?
Sequence | Flips (S-Value) | P-value |
HH | 2 flips | 0.25 |
HHH | 3 flips | 0.125 |
HHHH | 4 flips | 0.0625 |
4.32 flips | 0.05 | |
HHHHH | 5 flips | 0.03125 |
More at: https://stat.lesslikely.com/s-values/
IMPORTANT POINT: the p-value is NOT the probability that the coin is biased. It is the probability of seeing that result (or more extreme) ASSUMING the coin is biased.
So, to answer the question from the lede… the specific answer to “How surprising is too surprising for chance to be a compelling explanation? Since we, as a community, have agreed that P = 0.05 is the threshold… a bit more surprising than 4 consecutive heads with a fair coin is too suspicious.
Box: Multiplicity understanding multiplicity: If we all flipped a coin 5 times, what is the chance that one of us would get 5 heads in a row?
- If 10 people = (1-0.03125)^10 = 0.727. 23.3% of at least 1 HHHHH -> we’re back to weak evidence of a real effect.
- This is obviously true when multiple tests are reported, but less obviously also true if you try several analyses and choose the “best one” after seeing the result. Hence, prespecification.
Box: What, then, do confidence intervals mean?
Note - it’s NOT that there’s a 95% chance that the true value is within this range. It’s if you execute this method 100 times, you’d the value to be outside the range 5% of the time. [ ].gif
How do you choose the right test?
We have to go one step further to understand where the P-value comes from because of one sneaky clause that I snuck into the definition of the p-value. Quote: “P-values summarize how surprising it would be to see the observed data, assuming there is nothing going on and the assumptions of the test hold.”
Like regression models (covered soon Chapter 6), there are many statistical tests, each of which requires assumptions to be made. The main characteristics of the data that allow one to choose an appropriate tests are how many groups, the size of the groups, the type of variable, and whether measurements are independent of each other (ie. knowing one observation tells you nothing about another) or correlated (e.g. observation from the same patient at two different times)?
Level of measurement of outcome variable | Two Independent Groups | Three or more Independent Groups | Two Correlated* Samples | Three or more Correlated* Samples |
Dichotomous (e.g. yes/no) |
chi-square or Fisher’s exact test | chi-square or Fisher-Freeman-Halton test | McNemar test | Cochran Q test |
Unordered Categorical (e.g. red/blue/grey) |
chi-square or Fisher-Freeman-Halton test | chi-square or Fisher-Freeman-Halton test | Stuart-Maxwell test | Multiplicity adjusted Stuart-Maxwell tests# |
Ordered categorical (e.g. bad, neutral, good) |
Wilcoxon-Mann-Whitney (WMW) test |
Old School***: Kruskal-Wallis analysis of variance (ANOVA) New School***: multiplicity adjusted WMW test |
Wilcoxon sign rank test |
Old School# Friedman two-way ANOVA by ranks New School# Mulitiplicity adjusted Wilcoxon sign rank tests |
Continuous (e.g. BMI) |
independent groups t-test |
Old school***: oneway ANOVA New school***: multiplicity adjusted independent groups t tests |
paired t-test | mixed effects linear regression |
Time to event (e.g. survival) |
log-rank test | Multiplicity adjusted log-rank test | Shared-frailty Cox regression | Shared-frailty Cox regression |
From: From: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine.
Side-box: How can you collaborate effectively with a statistician? They will know these assumptions and can tell you when your analyses makes dubious assumptions (if you communicate the constraints of the problem correctly). THus, the reason to know about these things, even if you’re enrolling the help of a statistician, is to realize when you’re making a decision that bakes assumptions into your analysis - and you want to help the statistician understand the clinical situation so they can help match)
5.3 How to formulate your scientific question as a statistically testable hypothesis?
Without a doubt, the most common error trainees make when formulating their scientific hypothesis as a (statistically-)testable hypothesis is not realizing that you must define a null hypothesis where there is no effect, then find evidence against it - so-called Null Hypotheses Significance Testing. It goes something like this:
Scientifically, I suspect that high PaCO2 levels indicate a patient is at higher risk of 30-day readmission
The null hypothesis is that PaCO2 levels are not associated with risk of 30-day readmission
I then evaluate the numbers of readmissions across the range of PaCO2 levels and ask the question: is there a substantial enough excess in readmissions among patients with higher PaCO2 levels that it’s unlikely to be explainable just by random variation?
if it’s more suspicious than getting heads 4.32 times in a row, we say that chance alone can’t explain it.
(Note, I’ll need to make separate arguments that confounders or biased assessment of either PaCO2 or readmission could explain it, if I want to make a convincing argument is the PaCO2 per se - for more on that, Chapter 6)
5.3.1 Severity:
A last topic that I’ve included here - but would often usually be introduced later… but I think warrants early exposure - is statistical severity. If we zoom out - we see that we’re generating exclusionary evidence against “null” hypothesis - but we’re not necessarily summarizing the strength of evidence in favor of a hypothesis.
How should we summarize how much evidence for a hypothesis? The idea is that we have evidence in favor of a theory in proportion to the strength of challenges it has survived. For example, if theres a theory that makes a prediction that would be extremely unlikely to occur unless the theory is true -e.g. einstein predicting the gravity-induced curvature of light - but then the evidence does NOT disprove it… that’s pretty strong evidence - because that test was severe.
Similarly, if you prespecify an analysis - stick to the statistical analysis plan, and find an effect - that’s a much stronger challenge than if you have a lot of ways to slice the data or analyze the effect. Prespecification is more severe.
Applied to your design - you want to balance power (see Chapter 4) and severity. ***
5.4 Resources
Visualizations: - https://sites.google.com/view/ben-prytherch-shiny-apps/shiny-apps