Week 03 - Statistical Modeling Framework

Week: 3

Topic: Statistical Modeling Framework

0.1 Learning Objectives

Connect research questions, data structure and data analysis

0.2 Background

No background this week. After class, review the following topics:

Inference
Linear Models
Distributions
- Normal
- Binomial
- Poisson
- Negative Binomial

0.3 Before class

Think about:

A dataset you’re currently working with (or interested in working with)
The main research question you want to answer
What is your response variable? What is its distribution?

1 Session A:

1.1 Part 1: Silent Reflection (10 min)

On the paper I gave you, write the following:

Reflection Prompts

My research question is (one sentence):

My response variable is: (circle one or write your own)
- Continuous measurements
- Counts
- Binary (yes/no, presence/absence)
- Proportions
- Categories
- Other: ___________

My data structure includes: (check all that apply)
- Repeated measures on the same units
- Multiple sites/plots/locations
- Time series
- Nested structure (e.g., plants within plots within sites)
- Just one observation per unit
- Other: ___________

One worry I have about analyzing my data:

1.2 Part 2: Graffitti discussion (12 minutes)

Your task: - Move freely around the room - Use markers to write on ANY of the questions (words, phrases, questions, drawings) - Read what others wrote - Add +1, arrows, or comments to build on others’ ideas - No names required

The Six Questions

Wall 1: “What’s the hardest thing about data analysis for you?”

Wall 2: “What’s the gap between what you learned in stats class and what you actually need/use?”

Wall 3: “What do you wish your advisor/committee understood about statistical analysis?”

Wall 4: “What statistical concept do you pretend to understand but actually don’t?”

Wall 5: “What do you want to learn in this class?”

Wall 6: “Do you ever feel like you are doing analyses that you do not understand or that might be wrong, but don’t know why? If yes, explain”

1.3 Post-graffitti (6 min)

Brief discussion of themes and patterns from the walls. This will inform future class topics.

1.4 Part 3: Main concept –> Lecture

1.5 The Method-Question-Data Triangle (15 min)

     RESEARCH QUESTION
            /\
           /  \
          /    \
         /      \
        /        \
       /          \
DATA TYPE -------- METHOD CHOICE

All needed to be aligned. Dr. Molina will give a lecture showing examples of mismatch, good match and overcomplicated model.

2 Session B:

2.1 Discussion of Diagnosis Stations (10 min)

2.2 Self-Assessment (1 minute)

Before we begin, place yourself into a track based on your current comfort level:

Track	Description	You should choose this if…
🟢 Track A	New to R / Need a refresher	“I’ve never used R” OR “It’s been a while and I need to review some basics”
🟡 Track B	Some experience	“I can run models but need practice choosing the right one”
🔴 Track C	Experienced	“I can fit GLMs, troubleshoot code, and want a bit of a challenge”

2.3 🟢 Track A: Beginners (Choose One Option)

2.3.1 Option A1: Introduction to R, Projects, and Quarto

If you are completely new to R or need a structured introduction, work through the Intro to R, Projects and Quarto assignment.

Your goal: Complete the tutorial and have a working .qmd file that renders by the end of class.

2.3.2 Option A2: My First Model - Guided Template

If you have some R basics but haven’t run statistical models, use this guided template.

Your goal: Fit two models, interpret the output, and check assumptions.

# ===========================================
# SNR 690: Your First Model Template
# ===========================================

# 1. Load packages (install if needed)
library(tidyverse)

# 2. Load YOUR data (replace with your file path)
# my_data <- read_csv("your_data.csv")

# For now, let's use example data:
my_data <- data.frame(
  yield = c(12, 15, 14, 18, 20, 22, 19, 25, 17, 21),
  temperature = c(15, 16, 15, 18, 20, 21, 19, 23, 17, 20),
  rainfall = c(50, 55, 48, 60, 70, 75, 65, 80, 58, 68),
  site = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E")
)

# 3. Explore your data
head(my_data)
summary(my_data)

# Visualize relationships
ggplot(my_data, aes(x = temperature, y = yield)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Yield vs Temperature")

# -------------------------------------------
# MODEL 1: Single predictor
# -------------------------------------------

# 4. Fit a simple model with ONE predictor
model1 <- lm(yield ~ temperature, data = my_data)

# 5. Look at the output
summary(model1)

# 6. Check assumptions (look at these plots!)
par(mfrow = c(2, 2))
plot(model1)

# QUESTION: What does the slope for temperature mean?
# Write one sentence here:
# 

# -------------------------------------------
# MODEL 2: Two predictors
# -------------------------------------------

# 7. Now fit a model with TWO predictors
model2 <- lm(yield ~ temperature + rainfall, data = my_data)

# 8. Look at the output
summary(model2)

# 9. Check assumptions
par(mfrow = c(2, 2))
plot(model2)

# 10. Compare models
# Which model explains more variance? (Hint: look at R-squared)
# 

# QUESTION: Interpret the coefficient for rainfall. 
# What does it mean while "controlling for" temperature?
# Write one sentence here:
# 




# -------------------------------------------
# YOUR TURN: Adapt for your data
# -------------------------------------------

# First, run the same model with an interactive term, what changed?
# Replace the example data with your own data. Simulate data if needed. Let's look at your own system.
# What is your response variable?
# What are 1-2 predictor variables you want to test?
# Fit the models and interpret!

2.4 🟡 Track B: Intermediate - Simulate & Analyze (Choose One Station)

Your task: Take one of the diagnosis stations from Session A (from the ones listed here), simulate data that matches the scenario, then fit the analysis two ways: the “proposed” (potentially wrong) way and the “correct” way. Compare the results.

Your Deliverable

Simulate data that matches the scenario description
Fit the proposed model (as written in the station)
Fit the correct model (what SHOULD be used)
Compare the outputs - what’s different? Why does it matter?
Write 2-3 sentences explaining what you learned

2.4.1 Station 1: Salamander Survival

Research Question: Does canopy cover affect salamander survival in forest fragments?

Data Structure:

12 forest fragments
20 salamanders marked in each fragment (240 total)
Survival recorded as binary (alive/dead) after one year
Canopy cover measured as % for each fragment

Proposed Method: glm(survival ~ canopy_cover, family = binomial)

Your Diagnosis from Session A: Needs modification - what’s missing?

Hints:

What is the unit of observation? What is the unit of replication?
Are salamanders within the same fragment independent?
What package might you need for the correct approach?

2.4.2 Station 4: Soil Microbial Diversity

Research Question: Does tillage treatment affect soil microbial diversity?

Data Structure:

3 treatments: no-till, reduced till, conventional till
10 fields per treatment (30 fields total)
Shannon diversity index calculated for each field (continuous, 0–5)
Data slightly right-skewed but otherwise well-behaved

Proposed Method: Kruskal-Wallis test (non-parametric)

Your Diagnosis from Session A: Is this wrong, or just overly cautious?

Hints:

When is a non-parametric test truly necessary?
What assumptions does ANOVA actually require?
How robust is ANOVA to slight skewness?

2.4.3 Station 6: Seedling Germination

Research Question: Does stratification time (0, 30, 60, 90 days) affect germination rate?

Data Structure:

4 stratification treatments
10 petri dishes per treatment (40 dishes)
50 seeds per dish
Response: proportion germinated (0.0–1.0)

Proposed Method: lm(proportion ~ stratification_time)

Your Diagnosis from Session A: Wrong method - what would you use instead?

Hints:

What’s the difference between a proportion and a count?
Why can’t you just use lm() on proportions?
Look up cbind() with glm(..., family = binomial)

2.5 🔴 Track C: Advanced - Simulate, Fix & Extend (Choose One Station)

Your task: Take one of the diagnosis stations, simulate realistic data (including appropriate complexity), then fit the analysis two ways: the “proposed” (wrong) way and the “correct” way. Then extend: add a visualization, check for additional issues, or adapt the code structure for your own research project.

Your Deliverable

Simulate data that realistically reflects the scenario (think about variance structure!)
Fit the proposed model
Fit the correct model
Extend in one of these ways:
- Create a publication-ready visualization
- Check for additional issues (overdispersion, influential points, etc.)
- Adapt the analysis structure for YOUR research data
Be prepared to explain your approach to the class

2.5.1 Station 2: Wheat Yield Trials

Research Question: Which of 5 wheat varieties produces the highest yield?

Data Structure:

5 varieties tested in 4 blocks (randomized complete block design)
One plot per variety per block (20 plots total)
Yield (kg/ha) recorded once per plot
Data look approximately normal

Proposed Method: aov(yield ~ variety) (one-way ANOVA)

Your Diagnosis from Session A: Needs modification - what’s missing?

Challenges:

Simulate data with BOTH variety effects AND block effects
What happens to your standard errors and p-values when you ignore blocks?
Should block be fixed or random? When does it matter?
Extension: Add post-hoc comparisons with emmeans

2.5.2 Station 3: Bird Abundance Over Time

Research Question: Is bird abundance declining in urban parks?

Data Structure:

8 urban parks surveyed
Each park visited 6 times per year for 5 years (240 total observations)
Response: count of birds per visit (range: 0–89)
Year as continuous predictor

Proposed Method: lm(bird_count ~ year)

Your Diagnosis from Session A: Wrong method - multiple issues!

Challenges:

This data has at least THREE problems with the proposed method. What are they?
Simulate data with park-level random effects and temporal structure
What distribution should counts follow?
Extension: Try adding random slopes - do parks decline at different rates?

2.5.3 Station 5: Pollinator Network Complexity

Research Question: Do pollinator networks become more complex with plant diversity?

Data Structure:

25 meadows sampled
Plant diversity (species richness) recorded per meadow
Network complexity score calculated (continuous, 1.2–8.7)
Data are normal, linear relationship looks reasonable

Proposed Method: Bayesian multilevel model with varying intercepts and slopes, spatial Gaussian process

Your Diagnosis from Session A: This is OVERKILL!

Challenges:

Simulate simple, well-behaved data that matches this scenario
Show that a simple lm() is sufficient
Discussion: When WOULD the complex approach be justified? What would the data need to look like?
Extension: What are the costs of over-complicating your analysis? Write a brief argument for simplicity.

2.6 End of Session Wrap-Up

Deliverables (choose based on your track):

Track	What to submit/share
🟢 Track A	Intro to R –> follow the instructions. Or your working `.qmd` file with at least one model output
🟡 Track B	Your simulation + comparison of wrong vs. correct model, with 2-3 sentences explaining what you learned.
🔴 Track C	Your extended analysis + one visualization + notes on how this applies to your research

2.6.0.1 Exit Ticket (5 min)

Is thinking about how methods align to your objectives useful?

Week 03 - Statistical Modeling Framework

0.1 Learning Objectives

0.2 Background

0.3 Before class

1 Session A:

1.1 Part 1: Silent Reflection (10 min)

1.2 Part 2: Graffitti discussion (12 minutes)

1.3 Post-graffitti (6 min)

1.4 Part 3: Main concept –> Lecture

1.5 The Method-Question-Data Triangle (15 min)

1.7 4. Data Detective: Diagnosis Stations (25 min)

1.7.1 Setup

1.7.1.1 🔍 Station 1: Salamander Survival

1.7.1.2 🔍 Station 2: Wheat Yield Trials

1.7.1.3 🔍 Station 3: Bird Abundance Over Time

1.7.1.4 🔍 Station 4: Soil Microbial Diversity

1.7.1.5 🔍 Station 5: Pollinator Network Complexity

1.7.1.6 🔍 Station 6: Seedling Germination Experiment

2 Session B:

2.1 Discussion of Diagnosis Stations (10 min)

2.2 Self-Assessment (1 minute)

2.3 🟢 Track A: Beginners (Choose One Option)

2.3.1 Option A1: Introduction to R, Projects, and Quarto

2.3.2 Option A2: My First Model - Guided Template

2.4 🟡 Track B: Intermediate - Simulate & Analyze (Choose One Station)

2.4.1 Station 1: Salamander Survival

2.4.2 Station 4: Soil Microbial Diversity

2.4.3 Station 6: Seedling Germination

2.5 🔴 Track C: Advanced - Simulate, Fix & Extend (Choose One Station)

2.5.1 Station 2: Wheat Yield Trials

2.5.2 Station 3: Bird Abundance Over Time

2.5.3 Station 5: Pollinator Network Complexity

2.6 End of Session Wrap-Up

2.6.0.1 Exit Ticket (5 min)

0.1 Learning Objectives

0.2 Background

0.3 Before class

1 Session A:

1.1 Part 1: Silent Reflection (10 min)

1.2 Part 2: Graffitti discussion (12 minutes)

1.3 Post-graffitti (6 min)

1.4 Part 3: Main concept –> Lecture

1.5 The Method-Question-Data Triangle (15 min)

1.6 🤝 Think-Pair-Share (5 min)

1.7 4. Data Detective: Diagnosis Stations (25 min)

1.7.1 Setup

1.7.1.1 🔍 Station 1: Salamander Survival

1.7.1.2 🔍 Station 2: Wheat Yield Trials

1.7.1.3 🔍 Station 3: Bird Abundance Over Time

1.7.1.4 🔍 Station 4: Soil Microbial Diversity

1.7.1.5 🔍 Station 5: Pollinator Network Complexity

1.7.1.6 🔍 Station 6: Seedling Germination Experiment

2 Session B:

2.1 Discussion of Diagnosis Stations (10 min)

2.2 Self-Assessment (1 minute)

2.3 🟢 Track A: Beginners (Choose One Option)

2.3.1 Option A1: Introduction to R, Projects, and Quarto

2.3.2 Option A2: My First Model - Guided Template

2.4 🟡 Track B: Intermediate - Simulate & Analyze (Choose One Station)

2.4.1 Station 1: Salamander Survival

2.4.2 Station 4: Soil Microbial Diversity

2.4.3 Station 6: Seedling Germination

2.5 🔴 Track C: Advanced - Simulate, Fix & Extend (Choose One Station)

2.5.1 Station 2: Wheat Yield Trials

2.5.2 Station 3: Bird Abundance Over Time

2.5.3 Station 5: Pollinator Network Complexity

2.6 End of Session Wrap-Up

2.6.0.1 Exit Ticket (5 min)