Week 03 Session B — Answer Sheet

🔒 Instructor Access Only

This page contains solutions to the Week 03 coding activities.

Students: Please work through the problems yourself before viewing answers!

If you are a student and don’t have the password,
that means you shouldn’t be here yet! 😊

Instructor Use Only

This document contains solutions and discussion points for the differentiated coding activities. Do not share with students until after the session.

1 🟢 Track A: Beginners

1.1 Option A2: My First Model — Expected Answers

1.1.1 Question: What does the slope for temperature mean?

Expected answer: For every 1-degree increase in temperature, yield increases by approximately 1.3 units (the exact value depends on the data). This assumes a linear relationship and that all other factors are held constant.

1.1.2 Question: Which model explains more variance?

Expected answer: Model 2 (with two predictors) will typically have a higher R-squared than Model 1. However, students should note:

R-squared always increases when you add predictors
Adjusted R-squared accounts for the number of predictors
A higher R-squared doesn’t always mean a “better” model

1.1.3 Question: Interpret the coefficient for rainfall while “controlling for” temperature

Expected answer: Holding temperature constant, for every 1-unit increase in rainfall, yield changes by [X] units. This is the partial effect of rainfall — the effect of rainfall after accounting for the relationship between temperature and yield.

Discussion point: With the provided data, temperature and rainfall are correlated. This can lead to:

Coefficients that differ from simple correlations
Larger standard errors (multicollinearity)
Difficulty interpreting “independent” effects

2 🟡 Track B: Intermediate Stations

2.1 Station 1: Salamander Survival — Answer

2.1.1 What’s wrong with the proposed method?

The proposed method glm(survival ~ canopy_cover, family = binomial) treats each salamander as an independent observation. But salamanders within the same forest fragment are not independent - they share the same environment, microclimate, predators, etc.

The problem: Pseudoreplication. The true sample size is 12 fragments, not 240 salamanders.

2.1.2 The correct approach

Use a generalized linear mixed model (GLMM) with fragment as a random effect:

library(tidyverse)
library(lme4)

# -------------------------------------------
# Simulate the data
# -------------------------------------------

set.seed(123)

n_fragments <- 12
n_salamanders_per_fragment <- 20

# Create fragment-level data
fragments <- data.frame(
  fragment_id = 1:n_fragments,
  canopy_cover = runif(n_fragments, 30, 90)
)

# Add fragment-level random effect
# This represents unmeasured fragment-level variation
fragments$fragment_effect <- rnorm(n_fragments, 0, 0.8)

# Create individual salamander data
salamander_data <- fragments %>%
  slice(rep(1:n(), each = n_salamanders_per_fragment)) %>%
  mutate(
    salamander_id = 1:n(),
    # True model: logit(survival) = -2 + 0.04*canopy + fragment_effect
    log_odds = -2 + 0.04 * canopy_cover + fragment_effect,
    prob_survival = plogis(log_odds),
    survival = rbinom(n(), 1, prob_survival)
  )

# -------------------------------------------
# WRONG approach (ignoring clustering)
# -------------------------------------------

wrong_model <- glm(survival ~ canopy_cover, 
                   family = binomial, 
                   data = salamander_data)
summary(wrong_model)

# -------------------------------------------
# CORRECT approach (mixed model)
# -------------------------------------------

correct_model <- glmer(survival ~ canopy_cover + (1 | fragment_id), 
                       family = binomial, 
                       data = salamander_data)
summary(correct_model)

# -------------------------------------------
# Compare results
# -------------------------------------------

# Extract coefficients and SEs
wrong_coef <- summary(wrong_model)$coefficients
correct_coef <- summary(correct_model)$coefficients

cat("WRONG model SE for canopy_cover:", wrong_coef["canopy_cover", "Std. Error"], "\n")
cat("CORRECT model SE for canopy_cover:", correct_coef["canopy_cover", "Std. Error"], "\n")

2.1.3 What students should observe

Comparison	Wrong Model	Correct Model
Standard error for canopy_cover	Smaller (artificially precise)	Larger (realistic)
p-value	Often significant	May not be significant
Conclusion	May claim strong effect	More honest uncertainty

Key insight: Ignoring the clustering leads to overconfident conclusions because we’re pretending we have 240 independent observations when we really only have 12 independent units.

2.2 Station 4: Soil Microbial Diversity — Answer

2.2.1 Is Kruskal-Wallis wrong, or just overly cautious?

Kruskal-Wallis is not wrong, but it may be unnecessarily conservative for this scenario.

When Kruskal-Wallis is truly needed:

Severely non-normal data (heavy skew, outliers)
Ordinal data
Very small sample sizes where normality can’t be assessed
Heterogeneous variances that can’t be addressed

When ANOVA is fine:

Slight skewness (ANOVA is robust to this)
Moderate sample sizes (n ≥ 10 per group)
Residuals approximately normal
Similar variances across groups

2.2.2 The comparison

library(tidyverse)

# -------------------------------------------
# Simulate the data
# -------------------------------------------

set.seed(456)

n_per_treatment <- 10

soil_data <- data.frame(
  treatment = rep(c("no_till", "reduced_till", "conventional"), each = n_per_treatment),
  field_id = 1:30
)

# Simulate slightly right-skewed Shannon diversity
# True effect: no-till > reduced > conventional
soil_data <- soil_data %>%
  mutate(
    true_mean = case_when(
      treatment == "no_till" ~ 3.8,
      treatment == "reduced_till" ~ 3.2,
      treatment == "conventional" ~ 2.6
    ),
    # Add slight right skew using gamma-like distribution
    diversity = true_mean + rnorm(n(), 0, 0.5) + rexp(n(), rate = 5)
  )

# Visualize
ggplot(soil_data, aes(x = treatment, y = diversity)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.5)

# -------------------------------------------
# Proposed: Kruskal-Wallis
# -------------------------------------------

kruskal_result <- kruskal.test(diversity ~ treatment, data = soil_data)
print(kruskal_result)

# -------------------------------------------
# Alternative: ANOVA
# -------------------------------------------

anova_model <- aov(diversity ~ treatment, data = soil_data)
summary(anova_model)

# Check assumptions
par(mfrow = c(2, 2))
plot(anova_model)

# Post-hoc (only available with ANOVA, not Kruskal-Wallis easily)
TukeyHSD(anova_model)

2.2.3 What students should observe

Aspect	Kruskal-Wallis	ANOVA
Tests	Differences in ranks	Differences in means
Interpretation	Less intuitive	Effect sizes in original units
Post-hoc tests	Dunn’s test (less common)	Tukey HSD (standard)
Power	Slightly lower	Slightly higher

Key insight: For slightly skewed but otherwise well-behaved data, ANOVA is often the better choice because:

It’s more powerful
Effect sizes are interpretable
Post-hoc comparisons are straightforward
ANOVA is robust to moderate violations of normality

2.3 Station 6: Seedling Germination — Answer

2.3.1 What’s wrong with `lm(proportion ~ stratification_time)`?

Three major problems:

Bounded response: Proportions are bounded between 0 and 1. Linear regression can predict values outside this range.
Non-constant variance: Variance of proportions depends on the mean. Near 0 or 1, variance is smaller. This violates homoscedasticity.
Ignores sample size: A proportion of 0.5 from 10 seeds has more uncertainty than 0.5 from 100 seeds. Linear regression treats them the same.

2.3.2 The correct approach

Use binomial GLM with counts (successes and failures):

library(tidyverse)

# -------------------------------------------
# Simulate the data
# -------------------------------------------

set.seed(789)

n_dishes <- 10
n_seeds <- 50
strat_times <- c(0, 30, 60, 90)

germination_data <- expand.grid(
  dish_id = 1:n_dishes,
  stratification_time = strat_times
) %>%
  mutate(
    # True relationship: logit(germination) = -1.5 + 0.025 * time
    log_odds = -1.5 + 0.025 * stratification_time,
    true_prob = plogis(log_odds),
    # Simulate germination (binomial process)
    n_germinated = rbinom(n(), n_seeds, true_prob),
    n_failed = n_seeds - n_germinated,
    proportion = n_germinated / n_seeds
  )

# Visualize
ggplot(germination_data, aes(x = factor(stratification_time), y = proportion)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.5) +
  labs(x = "Stratification Time (days)", y = "Proportion Germinated")

# -------------------------------------------
# WRONG: Linear model on proportions
# -------------------------------------------

wrong_model <- lm(proportion ~ stratification_time, data = germination_data)
summary(wrong_model)

# Check residuals — look for heteroscedasticity
par(mfrow = c(2, 2))
plot(wrong_model)

# -------------------------------------------
# CORRECT: Binomial GLM with counts
# -------------------------------------------

correct_model <- glm(cbind(n_germinated, n_failed) ~ stratification_time, 
                     family = binomial, 
                     data = germination_data)
summary(correct_model)

# Check for overdispersion
# Residual deviance / residual df should be approximately 1
resid_dev <- summary(correct_model)$deviance
resid_df <- summary(correct_model)$df.residual
cat("Dispersion ratio:", resid_dev / resid_df, "\n")
# If >> 1, use quasibinomial or beta-binomial

# -------------------------------------------
# Compare predictions
# -------------------------------------------

new_data <- data.frame(stratification_time = 0:90)
new_data$pred_wrong <- predict(wrong_model, new_data)
new_data$pred_correct <- predict(correct_model, new_data, type = "response")

ggplot() +
  geom_point(data = germination_data, 
             aes(x = stratification_time, y = proportion), alpha = 0.5) +
  geom_line(data = new_data, 
            aes(x = stratification_time, y = pred_wrong, color = "Linear Model")) +
  geom_line(data = new_data, 
            aes(x = stratification_time, y = pred_correct, color = "Binomial GLM")) +
  labs(y = "Proportion Germinated", color = "Model") +
  ylim(0, 1)

2.3.3 What students should observe

Issue	Linear Model	Binomial GLM
Predictions at extremes	Can go below 0 or above 1	Always between 0 and 1
Residual variance	Heteroscedastic (funnel shape)	Properly modeled
Interpretation	Additive effect on proportion	Multiplicative effect on odds
Accounts for n_seeds	No	Yes

Key insight: When you have count data out of a known total (successes/trials), use cbind(successes, failures) with family = binomial. This properly models the binomial nature of the data.

3 🔴 Track C: Advanced Stations

3.1 Station 2: Wheat Yield Trials — Answer

3.1.1 What’s wrong with `aov(yield ~ variety)`?

The proposed model ignores the block structure of the RCBD (Randomized Complete Block Design). Blocks account for spatial heterogeneity (some blocks may be more fertile).

Consequences of ignoring blocks:

Block variation is lumped into residual error
Residual variance is inflated
Standard errors are larger
F-test has less power
May miss real treatment effects

3.1.2 The correct approach

library(tidyverse)
library(lme4)
library(emmeans)

# -------------------------------------------
# Simulate RCBD data
# -------------------------------------------

set.seed(2024)

varieties <- c("Var_A", "Var_B", "Var_C", "Var_D", "Var_E")
blocks <- c("Block_1", "Block_2", "Block_3", "Block_4")

wheat_data <- expand.grid(
  variety = varieties,
  block = blocks,
  stringsAsFactors = FALSE
) %>%
  mutate(
    # True variety effects (differences from grand mean)
    variety_effect = case_when(
      variety == "Var_A" ~ -200,
      variety == "Var_B" ~ 0,
      variety == "Var_C" ~ 400,  # Best variety
      variety == "Var_D" ~ -100,
      variety == "Var_E" ~ 200
    ),
    # Block effects (some blocks more fertile)
    block_effect = case_when(
      block == "Block_1" ~ -400,
      block == "Block_2" ~ -100,
      block == "Block_3" ~ 200,
      block == "Block_4" ~ 300
    ),
    # Grand mean + effects + residual error
    yield = 5000 + variety_effect + block_effect + rnorm(n(), 0, 120)
  )

# Visualize
ggplot(wheat_data, aes(x = variety, y = yield, color = block)) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "Wheat Yield by Variety and Block")

# -------------------------------------------
# WRONG: Ignoring blocks
# -------------------------------------------

wrong_model <- aov(yield ~ variety, data = wheat_data)
summary(wrong_model)

# Note: Residual MS includes block variation!

# -------------------------------------------
# CORRECT Option A: Block as fixed effect
# -------------------------------------------

correct_fixed <- aov(yield ~ block + variety, data = wheat_data)
summary(correct_fixed)

# Compare residual variance
cat("Wrong model residual MS:", summary(wrong_model)[[1]]["Residuals", "Mean Sq"], "\n")
cat("Correct model residual MS:", summary(correct_fixed)[[1]]["Residuals", "Mean Sq"], "\n")

# -------------------------------------------
# CORRECT Option B: Block as random effect
# -------------------------------------------

correct_random <- lmer(yield ~ variety + (1 | block), data = wheat_data)
summary(correct_random)
anova(correct_random)

# -------------------------------------------
# Post-hoc comparisons
# -------------------------------------------

emmeans_result <- emmeans(correct_fixed, pairwise ~ variety)
emmeans_result$contrasts

# Compact letter display
library(multcomp)
cld(emmeans_result$emmeans, Letters = letters)

# -------------------------------------------
# Extension: Publication figure
# -------------------------------------------

# Get estimated marginal means
emm <- as.data.frame(emmeans_result$emmeans)

ggplot(emm, aes(x = reorder(variety, emmean), y = emmean)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
  geom_jitter(data = wheat_data, aes(x = variety, y = yield), 
              width = 0.1, alpha = 0.4, color = "gray50") +
  labs(x = "Variety", y = "Yield (kg/ha)",
       title = "Wheat Variety Comparison",
       subtitle = "Points = raw data, Error bars = 95% CI of estimated means") +
  theme_minimal()

3.1.3 Fixed vs. Random blocks?

Use Fixed Block	Use Random Block
Blocks are exhaustive (these specific blocks)	Blocks are a sample from a population
Interest in specific block differences	Want to generalize beyond these blocks
Blocks are part of the experimental design	Blocks are nuisance variation
Traditional agronomic trials	Multi-site, multi-year studies

For this scenario: Either is defensible. Fixed is more traditional in agronomy; random is more generalizable.

3.2 Station 3: Bird Abundance Over Time — Answer

3.2.1 What are the THREE problems with `lm(bird_count ~ year)`?

Non-normal response: Counts follow Poisson (or negative binomial), not normal distribution
Non-independence within parks: Multiple observations from the same park are correlated (need random effect for park)
Temporal pseudoreplication: Multiple visits within years are not independent

3.2.2 The correct approach

library(tidyverse)
library(lme4)
library(glmmTMB)

# -------------------------------------------
# Simulate repeated measures count data
# -------------------------------------------

set.seed(303)

n_parks <- 8
n_years <- 5
n_visits <- 6

bird_data <- expand.grid(
  park_id = 1:n_parks,
  year = 1:n_years,
  visit = 1:n_visits
) %>%
  mutate(
    # Park-level random intercepts
    park_effect = rep(rnorm(n_parks, 0, 0.4), each = n_years * n_visits),
    # Optional: Park-level random slopes (some parks decline faster)
    park_slope = rep(rnorm(n_parks, 0, 0.05), each = n_years * n_visits),
    # True model: log(count) = 3.5 - 0.15*year + park_effect + park_slope*year
    log_mean = 3.5 - 0.15 * year + park_effect + park_slope * year,
    # Simulate Poisson counts
    bird_count = rpois(n(), exp(log_mean))
  )

# Visualize
ggplot(bird_data, aes(x = year, y = bird_count, color = factor(park_id))) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  geom_smooth(method = "glm", method.args = list(family = "poisson"), se = FALSE) +
  facet_wrap(~ park_id, ncol = 4) +
  labs(title = "Bird Counts by Park Over Time")

# -------------------------------------------
# WRONG: Linear model ignoring everything
# -------------------------------------------

wrong_model <- lm(bird_count ~ year, data = bird_data)
summary(wrong_model)

# Problems visible in residuals
par(mfrow = c(2, 2))
plot(wrong_model)

# -------------------------------------------
# BETTER: Poisson GLMM with random intercepts
# -------------------------------------------

better_model <- glmer(bird_count ~ year + (1 | park_id), 
                      family = poisson, 
                      data = bird_data)
summary(better_model)

# Check for overdispersion
overdisp <- sum(residuals(better_model, type = "pearson")^2) / df.residual(better_model)
cat("Overdispersion ratio:", overdisp, "\n")
# If > 1.5, consider negative binomial

# -------------------------------------------
# BEST: Random slopes (parks decline at different rates)
# -------------------------------------------

best_model <- glmer(bird_count ~ year + (1 + year | park_id), 
                    family = poisson, 
                    data = bird_data)
summary(best_model)

# Compare models
anova(better_model, best_model)

# -------------------------------------------
# Interpretation
# -------------------------------------------

# The year coefficient is on the log scale
year_coef <- fixef(best_model)["year"]
cat("Year effect (log scale):", year_coef, "\n")
cat("Multiplicative change per year:", exp(year_coef), "\n")
cat("Percent change per year:", (exp(year_coef) - 1) * 100, "%\n")

# -------------------------------------------
# Extension: Negative binomial for overdispersion
# -------------------------------------------

nb_model <- glmmTMB(bird_count ~ year + (1 + year | park_id), 
                    family = nbinom2, 
                    data = bird_data)
summary(nb_model)

3.2.3 What students should observe

Model	Issues Addressed
`lm(count ~ year)`	None
`glm(..., family = poisson)`	Count distribution
`glmer(... + (1 \\| park))`	Count + park clustering
`glmer(... + (1 + year \\| park))`	Count + park clustering + varying trends
`glmmTMB(..., family = nbinom2)`	All above + overdispersion

3.3 Station 5: Pollinator Network Complexity — Answer

3.3.1 Why is the Bayesian spatial model overkill?

The proposed method (Bayesian multilevel model with spatial Gaussian process) would require: - Spatial coordinates (not provided) - Grouping structure for random effects (not present) - Substantial computation time - Expertise in Bayesian methods - Complex model diagnostics

For 25 independent meadows with a simple linear relationship, this is massive overkill.

3.3.2 The correct approach: Simple linear regression

library(tidyverse)

# -------------------------------------------
# Simulate simple, well-behaved data
# -------------------------------------------

set.seed(555)

n_meadows <- 25

pollinator_data <- data.frame(
  meadow_id = 1:n_meadows,
  plant_richness = round(runif(n_meadows, 5, 45))
) %>%
  mutate(
    # Simple linear relationship with normal errors
    network_complexity = 1.0 + 0.15 * plant_richness + rnorm(n_meadows, 0, 0.7)
  )

# Visualize
ggplot(pollinator_data, aes(x = plant_richness, y = network_complexity)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(x = "Plant Species Richness", 
       y = "Network Complexity Score",
       title = "Simple Linear Relationship")

# -------------------------------------------
# THE CORRECT APPROACH: Simple lm()
# -------------------------------------------

simple_model <- lm(network_complexity ~ plant_richness, data = pollinator_data)
summary(simple_model)

# Check assumptions
par(mfrow = c(2, 2))
plot(simple_model)

# All assumptions met! We're done.

# -------------------------------------------
# Discussion: When WOULD complexity be needed?
# -------------------------------------------

# The complex approach would be appropriate if:

# 1. Meadows were grouped (e.g., multiple meadows per region)
#    → Need random intercepts for region

# 2. Repeated measures over time
#    → Need random intercepts for meadow, possibly temporal autocorrelation

# 3. Spatial autocorrelation (nearby meadows more similar)
#    → Need spatial covariance structure (Gaussian process, CAR, etc.)

# 4. Multiple response variables
#    → Need multivariate model

# 5. Prior information from previous studies
#    → Bayesian framework with informative priors

# -------------------------------------------
# Costs of over-complication
# -------------------------------------------

# 1. Computation time (hours vs. seconds)
# 2. More parameters to estimate with limited data (n=25)
# 3. Harder to interpret and explain
# 4. More opportunities for errors
# 5. Reviewers may not understand or trust the analysis
# 6. May not converge or give unreliable estimates
# 7. Requires specialized software and expertise

3.3.3 Key discussion points

Occam’s Razor in Statistics: Start with the simplest model that addresses your research question. Add complexity only when:

The simple model clearly fails (poor fit, violated assumptions)
The research question requires it (e.g., estimating variance components)
The data structure demands it (e.g., hierarchical, spatial, temporal)

Red flags for unnecessary complexity:

“I used a Bayesian spatial model” for simple cross-sectional data
Random effects with < 5 groups
More parameters than you can reliably estimate
Methods chosen to seem sophisticated rather than answer the question

4 General Discussion Points for Wrap-Up

4.1 Why simulate data?

Forces you to understand the data-generating process — You can’t simulate what you don’t understand
Creates known “truth” to compare methods against — You know the true effect size
Reveals how methods fail — See the consequences of wrong choices
Builds intuition — Understand what different parameters do
Useful for power analysis — Before collecting real data

4.2 Common themes across stations

Station	Main Issue	Lesson
1 (Salamander)	Pseudoreplication	Know your unit of replication
2 (Wheat)	Ignoring design structure	Use the structure you designed
3 (Birds)	Multiple violations	Match method to data structure
4 (Soil)	Overcautious	Don’t be afraid of parametric tests
5 (Pollinator)	Overkill	Simpler is often better
6 (Germination)	Wrong distribution	Match distribution to response type

4.3 Signs you’ve chosen the wrong method

Residual plots look terrible
Predictions outside possible range
Standard errors that seem too small (pseudoreplication)
Model won’t converge
Results don’t make biological sense
You can’t explain what the model is doing

🔒 Instructor Access Only

1 🟢 Track A: Beginners

1.1 Option A2: My First Model — Expected Answers

1.1.1 Question: What does the slope for temperature mean?

1.1.2 Question: Which model explains more variance?

1.1.3 Question: Interpret the coefficient for rainfall while “controlling for” temperature

2 🟡 Track B: Intermediate Stations

2.1 Station 1: Salamander Survival — Answer

2.1.1 What’s wrong with the proposed method?

2.1.2 The correct approach

2.1.3 What students should observe

2.2 Station 4: Soil Microbial Diversity — Answer

2.2.1 Is Kruskal-Wallis wrong, or just overly cautious?

2.2.2 The comparison

2.2.3 What students should observe

2.3 Station 6: Seedling Germination — Answer

2.3.1 What’s wrong with lm(proportion ~ stratification_time)?

2.3.2 The correct approach

2.3.3 What students should observe

3 🔴 Track C: Advanced Stations

3.1 Station 2: Wheat Yield Trials — Answer

3.1.1 What’s wrong with aov(yield ~ variety)?

3.1.2 The correct approach

3.1.3 Fixed vs. Random blocks?

3.2 Station 3: Bird Abundance Over Time — Answer

3.2.1 What are the THREE problems with lm(bird_count ~ year)?

3.2.2 The correct approach

3.2.3 What students should observe

3.3 Station 5: Pollinator Network Complexity — Answer

3.3.1 Why is the Bayesian spatial model overkill?

3.3.2 The correct approach: Simple linear regression

3.3.3 Key discussion points

4 General Discussion Points for Wrap-Up

4.1 Why simulate data?

4.2 Common themes across stations

4.3 Signs you’ve chosen the wrong method

2.3.1 What’s wrong with `lm(proportion ~ stratification_time)`?

3.1.1 What’s wrong with `aov(yield ~ variety)`?

3.2.1 What are the THREE problems with `lm(bird_count ~ year)`?