Matching Methods to Research Questions

The Method-Question-Data Triangle

Dr. Molina

2001-03-01

The Problem

  • We often learn methods in isolation –> during classes

    • “This is a t-test”
    • “This is ANOVA”
    • “This is regression”
    • “This is Bayesian”
  • Then we want to apply those methods in our data.

  • This is particularly true for people that really enjoy analytical methods!

  • This is backwards!

The research question should drive everything!

The Triangle Framework

         RESEARCH QUESTION
                /\
               /  \
              /    \
             /      \
            /        \
           /          \
    DATA TYPE -------- METHOD CHOICE

All three must align

Misalignment leads to:

  • Wrong conclusions
  • Wasted effort
  • Rejected papers
  • Sad grad students! 😦

Research Question Drives Everything

Question Type What You’re Looking For Example Methods
Is there a difference? Comparison t-test, ANOVA, GLM
Is there a relationship? Association Regression, correlation
Can I predict? Prediction ML, regression
What’s the pattern? Structure Clustering, PCA, ordination

Your question determines what “answer” looks like!

Data Type Constrains Your Options

Response Variable Distribution Common Methods
Continuous, normal Gaussian LM, ANOVA
Counts (0, 1, 2, …) Poisson, NegBin GLM
Binary (yes/no) Binomial Logistic regression
Proportions (0-1) Binomial, Beta GLM, Beta regression

Also consider:

  • Independence vs. grouping
  • Repeated measures
  • Nested/hierarchical structure

The Alignment Check ✅

Before analyzing, ask yourself:

  1. ✅ What exactly is my question?

  2. ✅ What type of response variable do I have?

  3. ✅ What is my data structure (grouping, nesting, time)?

  4. ✅ Does my method handle all of this?

If you can’t answer these → STOP and think!

Three Common Scenarios

We’ll look at three examples:

  1. Mismatch - Method ignores key data structure

  2. Good Match - Method fits question and data

  3. Overcomplicated - Method is fancier than needed

Let’s see each one…

Example 1: MISMATCH

Scenario:

  • Testing fertilizer effect on crop yield
  • 5 fields, 4 plots per field
  • 2 control plots, 2 fertilized plots per field

The analysis:

lm(yield ~ fertilizer)
  • Treats all 20 observations as independent!

Example 1: The Problem

Example 1: Wrong vs. Right

Show code
# WRONG: Ignores field structure
wrong_model <- lm(yield ~ treatment, data = mismatch_data)

# RIGHT: Accounts for field
correct_model <- lmer(yield ~ treatment + (1|field), data = mismatch_data)
TRUE EFFECT: 3 units
WRONG MODEL (lm):
  Estimate: 3.38 
  Std Error: 2.57 
  p-value: 0.2043 
CORRECT MODEL (lmer):
  Estimate: 3.38 
  Std Error: 0.99 
  p-value: 0 

Example 1: The Fix

In this design (treatment within fields):

  • The wrong model has inflated standard errors because it treats field variance as residual noise
  • The correct model* separates field variance → cleaner estimate of treatment effect
  • This means reduced power when you ignore structure

In other designs (treatment between fields):

  • Ignoring structure would inflate Type I error instead


# Wrong
lm(yield ~ treatment)

# Right  
lmer(yield ~ treatment + (1|field))

Lesson: Always check your independence assumption!

Example 2: GOOD MATCH

Scenario:

  • Pollinator visits to flowers
  • 2 treatments: native vs. non-native plants
  • 30 plants per treatment
  • Response: count of visits (0 to ~45)

The approach:

glm(visits ~ treatment, family = poisson)

Example 2: The Data

Example 2: Why It Works

Show code
poisson_model <- glm(visits ~ treatment, family = poisson, 
                     data = goodmatch_data)
summary(poisson_model)$coefficients
                 Estimate Std. Error   z value      Pr(>|z|)
(Intercept)     2.0412203  0.0657951 31.023896 2.567154e-211
treatmentnative 0.5761755  0.0822319  7.006715  2.439777e-12

Checklist:

  • ✅ Count data → Poisson distribution
  • ✅ No upper bound on counts
  • ✅ Independent observations (different plants)
  • ✅ Simple comparison question

Example 2: Interpretation

Show code
est <- coef(poisson_model)["treatmentnative"]
cat("Coefficient (log scale):", round(est, 3), "\n")
Coefficient (log scale): 0.576 
Show code
cat("Multiplicative effect:", round(exp(est), 2), "\n")
Multiplicative effect: 1.78 
Show code
cat("Native plants get", round((exp(est) - 1) * 100, 1), "% more visits\n")
Native plants get 77.9 % more visits

Lesson: Match your distribution to your data type!

Example 3: OVERCOMPLICATED

Scenario:

  • Tree height in 2 forest types
  • 30 trees per forest type
  • Normal distribution, no grouping
  • Simple question: “Is there a difference?”

The overkill:

“Let’s use a Bayesian hierarchical model with spatial autocorrelation, weakly informative priors, and MCMC sampling!”

Example 3: The Data

Example 3: Simple vs. Complex

Show code
# Simple t-test (appropriate!)
simple_test <- t.test(height ~ forest_type, data = overcomp_data)

cat("Difference:", round(diff(simple_test$estimate), 2), "meters\n")
Difference: -4.51 meters
Show code
cat("95% CI: [", round(simple_test$conf.int[1], 2), ",", 
    round(simple_test$conf.int[2], 2), "]\n")
95% CI: [ 2.2 , 6.81 ]
Show code
cat("p-value:", format(simple_test$p.value, digits = 3), "\n")
p-value: 0.000239 

Time to run: ~0.001 seconds

A Bayesian spatial model: ~5-10 minutes

Same answer!

Example 3: The Lesson

Problems with overcomplication:

  • Takes much longer to fit
  • Harder to interpret
  • Reviewers get confused
  • More things can go wrong
  • Same answer as simple approach!

Principle of parsimony:

Use the simplest method that adequately addresses your question

Save fancy methods for when you need them!

I know this is challenging! When I learn about new methods, I want to use them ALL THE TIME. But resist the urge! Focus on the researhc question!

If you teach, you get to explore any methods you want in your lessons!

Summary: Three Scenarios

Example Problem Consequence Lesson
Mismatch Ignored grouping structure False positive risk Check independence
Good Match None - appropriate method Valid inference Match distribution
Overcomplicated Unnecessary complexity Wasted effort Start simple

Your Checklist

Before you analyze, ask:

1️⃣ What is my question? (difference, relationship, prediction)

2️⃣ What is my response variable? (continuous, count, binary, proportion)

3️⃣ What is my data structure? (independent, grouped, nested, repeated)

4️⃣ Does my method handle all three?

The Golden Rule

Start simple.

. . .

Add complexity only when needed.

. . .

Always justify your choices.

Now It’s Your Turn!

Think-Pair-Share (5 min)

Look back at your silent reflection sheet. Based on the three examples:

  • What method might be appropriate for YOUR data?
  • What’s one thing about your data structure that makes method choice tricky?
  1. Think (<1 min) — you already did this… you can use 30 seconds to update your notes
  2. Pair (3 min) — discuss with a neighbor, help each other troubleshoot
  3. Share (1 min) — 2–3 volunteers share their pairing’s most interesting dilemma

Grab your worksheet and let’s go!

Now It’s Your Turn!

Data Detective Stations

  • 6 scenarios around the room
  • Diagnose: Mismatch? Good match? Overcomplicated?
  • Work in pairs
  • 4 minutes per station

Grab your worksheet and let’s go!

Appendix: Full Simulation Code

Show code
# ============================================================================
# EXAMPLE 1: MISMATCH - Ignoring nested structure
# Demonstrates FALSE POSITIVE from pseudoreplication
# ============================================================================

library(lme4)
library(ggplot2)
library(dplyr)

set.seed(42)

n_fields <- 5
n_plots_per_field <- 4

mismatch_data <- expand.grid(
  field = factor(1:n_fields),
  plot = 1:n_plots_per_field
) |>
  mutate(
    treatment = rep(c("control", "control", "fertilizer", "fertilizer"), n_fields),
    treatment = factor(treatment, levels = c("control", "fertilizer"))
  )

# Large field-to-field variation
field_effects <- data.frame(
  field = factor(1:n_fields),
  field_effect = c(-12, -5, 2, 8, 14)
)

true_effect <- 0  # NO TRUE EFFECT!

mismatch_data <- mismatch_data |>
  left_join(field_effects, by = "field") |>
  mutate(
    yield = 50 + field_effect + 
            ifelse(treatment == "fertilizer", true_effect, 0) +
            rnorm(n(), mean = 0, sd = 1.5)
  ) |>
  # Create confounding between treatment and field quality
  mutate(
    yield = yield + ifelse(treatment == "fertilizer", field_effect * 0.15, 0)
  )

# Compare wrong vs correct
wrong_model <- lm(yield ~ treatment, data = mismatch_data)
correct_model <- lmer(yield ~ treatment + (1|field), data = mismatch_data)

# Wrong model shows "significant" effect (p < 0.05)
summary(wrong_model)

# Correct model shows non-significant (as it should be - no true effect!)
summary(correct_model)

Appendix: Full Simulation Code (continued)

Show code
# ============================================================================
# EXAMPLE 2: GOOD MATCH - Poisson GLM for counts
# ============================================================================

set.seed(2024)
n_per_group <- 30

goodmatch_data <- data.frame(
  plant_id = 1:(2 * n_per_group),
  treatment = factor(rep(c("native", "non_native"), each = n_per_group),
                     levels = c("non_native", "native"))
)

baseline_visits <- 8
native_effect <- 0.5  # Log-scale

goodmatch_data <- goodmatch_data |>
  mutate(
    log_mu = log(baseline_visits) + 
             ifelse(treatment == "native", native_effect, 0),
    visits = rpois(n(), lambda = exp(log_mu))
  )

poisson_model <- glm(visits ~ treatment, family = poisson, 
                     data = goodmatch_data)
summary(poisson_model)

# Interpretation
exp(coef(poisson_model)["treatmentnative"])  # Multiplicative effect

Appendix: Full Simulation Code (continued)

Show code
# ============================================================================
# EXAMPLE 3: OVERCOMPLICATED - Simple question, complex method
# ============================================================================

set.seed(2024)
n_trees <- 30

overcomp_data <- data.frame(
  tree_id = 1:(2 * n_trees),
  forest_type = factor(rep(c("deciduous", "coniferous"), each = n_trees))
)

deciduous_mean <- 18
coniferous_mean <- 22
tree_sd <- 4

overcomp_data <- overcomp_data |>
  mutate(
    height = ifelse(forest_type == "deciduous",
                    rnorm(n(), deciduous_mean, tree_sd),
                    rnorm(n(), coniferous_mean, tree_sd))
  )

# Simple and appropriate!
t.test(height ~ forest_type, data = overcomp_data)

# Or equivalently
lm(height ~ forest_type, data = overcomp_data) |> summary()