The Method-Question-Data Triangle
2001-03-01
We often learn methods in isolation –> during classes
Then we want to apply those methods in our data.
This is particularly true for people that really enjoy analytical methods!
This is backwards!
The research question should drive everything!
RESEARCH QUESTION
/\
/ \
/ \
/ \
/ \
/ \
DATA TYPE -------- METHOD CHOICE
All three must align
Misalignment leads to:
| Question Type | What You’re Looking For | Example Methods |
|---|---|---|
| Is there a difference? | Comparison | t-test, ANOVA, GLM |
| Is there a relationship? | Association | Regression, correlation |
| Can I predict? | Prediction | ML, regression |
| What’s the pattern? | Structure | Clustering, PCA, ordination |
Your question determines what “answer” looks like!
| Response Variable | Distribution | Common Methods |
|---|---|---|
| Continuous, normal | Gaussian | LM, ANOVA |
| Counts (0, 1, 2, …) | Poisson, NegBin | GLM |
| Binary (yes/no) | Binomial | Logistic regression |
| Proportions (0-1) | Binomial, Beta | GLM, Beta regression |
Also consider:
Before analyzing, ask yourself:
✅ What exactly is my question?
✅ What type of response variable do I have?
✅ What is my data structure (grouping, nesting, time)?
✅ Does my method handle all of this?
If you can’t answer these → STOP and think!
We’ll look at three examples:
Mismatch - Method ignores key data structure
Good Match - Method fits question and data
Overcomplicated - Method is fancier than needed
Let’s see each one…
Scenario:
TRUE EFFECT: 3 units
WRONG MODEL (lm):
Estimate: 3.38
Std Error: 2.57
p-value: 0.2043
CORRECT MODEL (lmer):
Estimate: 3.38
Std Error: 0.99
p-value: 0
In this design (treatment within fields):
In other designs (treatment between fields):
Lesson: Always check your independence assumption!
Scenario:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0412203 0.0657951 31.023896 2.567154e-211
treatmentnative 0.5761755 0.0822319 7.006715 2.439777e-12
Checklist:
Coefficient (log scale): 0.576
Multiplicative effect: 1.78
Native plants get 77.9 % more visits
Lesson: Match your distribution to your data type!
Scenario:
The overkill:
“Let’s use a Bayesian hierarchical model with spatial autocorrelation, weakly informative priors, and MCMC sampling!”
Difference: -4.51 meters
95% CI: [ 2.2 , 6.81 ]
p-value: 0.000239
Time to run: ~0.001 seconds
A Bayesian spatial model: ~5-10 minutes
Same answer!
Problems with overcomplication:
Principle of parsimony:
Use the simplest method that adequately addresses your question
Save fancy methods for when you need them!
I know this is challenging! When I learn about new methods, I want to use them ALL THE TIME. But resist the urge! Focus on the researhc question!
If you teach, you get to explore any methods you want in your lessons!
| Example | Problem | Consequence | Lesson |
|---|---|---|---|
| Mismatch | Ignored grouping structure | False positive risk | Check independence |
| Good Match | None - appropriate method | Valid inference | Match distribution |
| Overcomplicated | Unnecessary complexity | Wasted effort | Start simple |
Before you analyze, ask:
1️⃣ What is my question? (difference, relationship, prediction)
2️⃣ What is my response variable? (continuous, count, binary, proportion)
3️⃣ What is my data structure? (independent, grouped, nested, repeated)
4️⃣ Does my method handle all three?
. . .
. . .
Think-Pair-Share (5 min)
Look back at your silent reflection sheet. Based on the three examples:
- What method might be appropriate for YOUR data?
- What’s one thing about your data structure that makes method choice tricky?
Grab your worksheet and let’s go!
Data Detective Stations
Grab your worksheet and let’s go!
# ============================================================================
# EXAMPLE 1: MISMATCH - Ignoring nested structure
# Demonstrates FALSE POSITIVE from pseudoreplication
# ============================================================================
library(lme4)
library(ggplot2)
library(dplyr)
set.seed(42)
n_fields <- 5
n_plots_per_field <- 4
mismatch_data <- expand.grid(
field = factor(1:n_fields),
plot = 1:n_plots_per_field
) |>
mutate(
treatment = rep(c("control", "control", "fertilizer", "fertilizer"), n_fields),
treatment = factor(treatment, levels = c("control", "fertilizer"))
)
# Large field-to-field variation
field_effects <- data.frame(
field = factor(1:n_fields),
field_effect = c(-12, -5, 2, 8, 14)
)
true_effect <- 0 # NO TRUE EFFECT!
mismatch_data <- mismatch_data |>
left_join(field_effects, by = "field") |>
mutate(
yield = 50 + field_effect +
ifelse(treatment == "fertilizer", true_effect, 0) +
rnorm(n(), mean = 0, sd = 1.5)
) |>
# Create confounding between treatment and field quality
mutate(
yield = yield + ifelse(treatment == "fertilizer", field_effect * 0.15, 0)
)
# Compare wrong vs correct
wrong_model <- lm(yield ~ treatment, data = mismatch_data)
correct_model <- lmer(yield ~ treatment + (1|field), data = mismatch_data)
# Wrong model shows "significant" effect (p < 0.05)
summary(wrong_model)
# Correct model shows non-significant (as it should be - no true effect!)
summary(correct_model)# ============================================================================
# EXAMPLE 2: GOOD MATCH - Poisson GLM for counts
# ============================================================================
set.seed(2024)
n_per_group <- 30
goodmatch_data <- data.frame(
plant_id = 1:(2 * n_per_group),
treatment = factor(rep(c("native", "non_native"), each = n_per_group),
levels = c("non_native", "native"))
)
baseline_visits <- 8
native_effect <- 0.5 # Log-scale
goodmatch_data <- goodmatch_data |>
mutate(
log_mu = log(baseline_visits) +
ifelse(treatment == "native", native_effect, 0),
visits = rpois(n(), lambda = exp(log_mu))
)
poisson_model <- glm(visits ~ treatment, family = poisson,
data = goodmatch_data)
summary(poisson_model)
# Interpretation
exp(coef(poisson_model)["treatmentnative"]) # Multiplicative effect# ============================================================================
# EXAMPLE 3: OVERCOMPLICATED - Simple question, complex method
# ============================================================================
set.seed(2024)
n_trees <- 30
overcomp_data <- data.frame(
tree_id = 1:(2 * n_trees),
forest_type = factor(rep(c("deciduous", "coniferous"), each = n_trees))
)
deciduous_mean <- 18
coniferous_mean <- 22
tree_sd <- 4
overcomp_data <- overcomp_data |>
mutate(
height = ifelse(forest_type == "deciduous",
rnorm(n(), deciduous_mean, tree_sd),
rnorm(n(), coniferous_mean, tree_sd))
)
# Simple and appropriate!
t.test(height ~ forest_type, data = overcomp_data)
# Or equivalently
lm(height ~ forest_type, data = overcomp_data) |> summary()