# ===========================================
# SNR 690: Your First Model Template
# ===========================================
# 1. Load packages (install if needed)
library(tidyverse)
# 2. Load YOUR data (replace with your file path)
# my_data <- read_csv("your_data.csv")
# For now, let's use example data:
my_data <- data.frame(
yield = c(12, 15, 14, 18, 20, 22, 19, 25, 17, 21),
temperature = c(15, 16, 15, 18, 20, 21, 19, 23, 17, 20),
rainfall = c(50, 55, 48, 60, 70, 75, 65, 80, 58, 68),
site = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E")
)
# 3. Explore your data
head(my_data)
summary(my_data)
# Visualize relationships
ggplot(my_data, aes(x = temperature, y = yield)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Yield vs Temperature")
# -------------------------------------------
# MODEL 1: Single predictor
# -------------------------------------------
# 4. Fit a simple model with ONE predictor
model1 <- lm(yield ~ temperature, data = my_data)
# 5. Look at the output
summary(model1)
# 6. Check assumptions (look at these plots!)
par(mfrow = c(2, 2))
plot(model1)
# QUESTION: What does the slope for temperature mean?
# Write one sentence here:
#
# -------------------------------------------
# MODEL 2: Two predictors
# -------------------------------------------
# 7. Now fit a model with TWO predictors
model2 <- lm(yield ~ temperature + rainfall, data = my_data)
# 8. Look at the output
summary(model2)
# 9. Check assumptions
par(mfrow = c(2, 2))
plot(model2)
# 10. Compare models
# Which model explains more variance? (Hint: look at R-squared)
#
# QUESTION: Interpret the coefficient for rainfall.
# What does it mean while "controlling for" temperature?
# Write one sentence here:
#
# -------------------------------------------
# YOUR TURN: Adapt for your data
# -------------------------------------------
# First, run the same model with an interactive term, what changed?
# Replace the example data with your own data. Simulate data if needed. Let's look at your own system.
# What is your response variable?
# What are 1-2 predictor variables you want to test?
# Fit the models and interpret!Week 03 - Statistical Modeling Framework
Week: 3
Topic: Statistical Modeling Framework
0.1 Learning Objectives
Connect research questions, data structure and data analysis
0.2 Background
No background this week. After class, review the following topics:
- Inference
- Linear Models
- Distributions
- Normal
- Binomial
- Poisson
- Negative Binomial
0.3 Before class
Think about:
- A dataset you’re currently working with (or interested in working with)
- The main research question you want to answer
- What is your response variable? What is its distribution?
1 Session A:
1.1 Part 1: Silent Reflection (10 min)
On the paper I gave you, write the following:
- My research question is (one sentence):
- My response variable is: (circle one or write your own)
- Continuous measurements
- Counts
- Binary (yes/no, presence/absence)
- Proportions
- Categories
- Other: ___________
- My data structure includes: (check all that apply)
- One worry I have about analyzing my data:
1.2 Part 2: Graffitti discussion (12 minutes)
Your task: - Move freely around the room - Use markers to write on ANY of the questions (words, phrases, questions, drawings) - Read what others wrote - Add +1, arrows, or comments to build on others’ ideas - No names required
Wall 1: “What’s the hardest thing about data analysis for you?”
Wall 2: “What’s the gap between what you learned in stats class and what you actually need/use?”
Wall 3: “What do you wish your advisor/committee understood about statistical analysis?”
Wall 4: “What statistical concept do you pretend to understand but actually don’t?”
Wall 5: “What do you want to learn in this class?”
Wall 6: “Do you ever feel like you are doing analyses that you do not understand or that might be wrong, but don’t know why? If yes, explain”
1.3 Post-graffitti (6 min)
Brief discussion of themes and patterns from the walls. This will inform future class topics.
1.4 Part 3: Main concept –> Lecture
1.5 The Method-Question-Data Triangle (15 min)
RESEARCH QUESTION
/\
/ \
/ \
/ \
/ \
/ \
DATA TYPE -------- METHOD CHOICE
All needed to be aligned. Dr. Molina will give a lecture showing examples of mismatch, good match and overcomplicated model.
1.7 4. Data Detective: Diagnosis Stations (25 min)
1.7.1 Setup
Six stations around the room, each with a scenario card describing: - A research question - A dataset description - A statistical method that was used (or is being proposed)
Your job: Diagnose whether the method fits the question and data.
Rotation schedule: - 4 minutes per station - Work in pairs or trios - Record your diagnosis on the station worksheet
1.7.1.1 🔍 Station 1: Salamander Survival
Research Question:
Does canopy cover affect salamander survival in forest fragments?
Data:
- 12 forest fragments
- 20 salamanders marked in each fragment (240 total)
- Survival recorded as binary (alive/dead) after one year
- Canopy cover measured as % for each fragment
Proposed Method:
Logistic regression: glm(survival ~ canopy_cover, family = binomial)
Your Diagnosis:
Why?
1.7.1.2 🔍 Station 2: Wheat Yield Trials
Research Question:
Which of 5 wheat varieties produces the highest yield?
Data:
- 5 varieties tested in 4 blocks (randomized complete block design)
- One plot per variety per block (20 plots total)
- Yield (kg/ha) recorded once per plot
- Data look approximately normal
Proposed Method:
One-way ANOVA: aov(yield ~ variety)
Your Diagnosis:
Why?
1.7.1.3 🔍 Station 3: Bird Abundance Over Time
Research Question:
Is bird abundance declining in urban parks?
Data:
- 8 urban parks surveyed
- Each park visited 6 times per year for 5 years (240 total observations)
- Response: count of birds per visit (range: 0–89)
- Year as continuous predictor
Proposed Method:
Linear regression: lm(bird_count ~ year)
Your Diagnosis:
Why?
1.7.1.4 🔍 Station 4: Soil Microbial Diversity
Research Question:
Does tillage treatment affect soil microbial diversity?
Data:
- 3 treatments: no-till, reduced till, conventional till
- 10 fields per treatment (30 fields total)
- Shannon diversity index calculated for each field (continuous, 0–5)
- Data slightly right-skewed but otherwise well-behaved
Proposed Method:
Kruskal-Wallis test (non-parametric)
Your Diagnosis:
Why?
1.7.1.5 🔍 Station 5: Pollinator Network Complexity
Research Question:
Do pollinator networks become more complex with plant diversity?
Data:
- 25 meadows sampled
- Plant diversity (species richness) recorded per meadow
- Network complexity score calculated (continuous, 1.2–8.7)
- Data are normal, linear relationship looks reasonable
Proposed Method:
Bayesian multilevel model with varying intercepts and slopes, spatial Gaussian process
Your Diagnosis:
Why?
1.7.1.6 🔍 Station 6: Seedling Germination Experiment
Research Question:
Does stratification time (0, 30, 60, 90 days) affect germination rate?
Data:
- 4 stratification treatments
- 10 petri dishes per treatment (40 dishes)
- 50 seeds per dish
- Response: proportion germinated (0.0–1.0)
Proposed Method:
Linear regression: lm(proportion ~ stratification_time)
Your Diagnosis:
Why?
2 Session B:
2.1 Discussion of Diagnosis Stations (10 min)
2.2 Self-Assessment (1 minute)
Before we begin, place yourself into a track based on your current comfort level:
| Track | Description | You should choose this if… |
|---|---|---|
| 🟢 Track A | New to R / Need a refresher | “I’ve never used R” OR “It’s been a while and I need to review some basics” |
| 🟡 Track B | Some experience | “I can run models but need practice choosing the right one” |
| 🔴 Track C | Experienced | “I can fit GLMs, troubleshoot code, and want a bit of a challenge” |
2.3 🟢 Track A: Beginners (Choose One Option)
2.3.1 Option A1: Introduction to R, Projects, and Quarto
If you are completely new to R or need a structured introduction, work through the Intro to R, Projects and Quarto assignment.
Your goal: Complete the tutorial and have a working .qmd file that renders by the end of class.
2.3.2 Option A2: My First Model - Guided Template
If you have some R basics but haven’t run statistical models, use this guided template.
Your goal: Fit two models, interpret the output, and check assumptions.
2.4 🟡 Track B: Intermediate - Simulate & Analyze (Choose One Station)
Your task: Take one of the diagnosis stations from Session A (from the ones listed here), simulate data that matches the scenario, then fit the analysis two ways: the “proposed” (potentially wrong) way and the “correct” way. Compare the results.
- Simulate data that matches the scenario description
- Fit the proposed model (as written in the station)
- Fit the correct model (what SHOULD be used)
- Compare the outputs - what’s different? Why does it matter?
- Write 2-3 sentences explaining what you learned
2.4.1 Station 1: Salamander Survival
Research Question: Does canopy cover affect salamander survival in forest fragments?
Data Structure:
- 12 forest fragments
- 20 salamanders marked in each fragment (240 total)
- Survival recorded as binary (alive/dead) after one year
- Canopy cover measured as % for each fragment
Proposed Method: glm(survival ~ canopy_cover, family = binomial)
Your Diagnosis from Session A: Needs modification - what’s missing?
Hints:
- What is the unit of observation? What is the unit of replication?
- Are salamanders within the same fragment independent?
- What package might you need for the correct approach?
2.4.2 Station 4: Soil Microbial Diversity
Research Question: Does tillage treatment affect soil microbial diversity?
Data Structure:
- 3 treatments: no-till, reduced till, conventional till
- 10 fields per treatment (30 fields total)
- Shannon diversity index calculated for each field (continuous, 0–5)
- Data slightly right-skewed but otherwise well-behaved
Proposed Method: Kruskal-Wallis test (non-parametric)
Your Diagnosis from Session A: Is this wrong, or just overly cautious?
Hints:
- When is a non-parametric test truly necessary?
- What assumptions does ANOVA actually require?
- How robust is ANOVA to slight skewness?
2.4.3 Station 6: Seedling Germination
Research Question: Does stratification time (0, 30, 60, 90 days) affect germination rate?
Data Structure:
- 4 stratification treatments
- 10 petri dishes per treatment (40 dishes)
- 50 seeds per dish
- Response: proportion germinated (0.0–1.0)
Proposed Method: lm(proportion ~ stratification_time)
Your Diagnosis from Session A: Wrong method - what would you use instead?
Hints:
- What’s the difference between a proportion and a count?
- Why can’t you just use
lm()on proportions? - Look up
cbind()withglm(..., family = binomial)
2.5 🔴 Track C: Advanced - Simulate, Fix & Extend (Choose One Station)
Your task: Take one of the diagnosis stations, simulate realistic data (including appropriate complexity), then fit the analysis two ways: the “proposed” (wrong) way and the “correct” way. Then extend: add a visualization, check for additional issues, or adapt the code structure for your own research project.
- Simulate data that realistically reflects the scenario (think about variance structure!)
- Fit the proposed model
- Fit the correct model
- Extend in one of these ways:
- Create a publication-ready visualization
- Check for additional issues (overdispersion, influential points, etc.)
- Adapt the analysis structure for YOUR research data
- Be prepared to explain your approach to the class
2.5.1 Station 2: Wheat Yield Trials
Research Question: Which of 5 wheat varieties produces the highest yield?
Data Structure:
- 5 varieties tested in 4 blocks (randomized complete block design)
- One plot per variety per block (20 plots total)
- Yield (kg/ha) recorded once per plot
- Data look approximately normal
Proposed Method: aov(yield ~ variety) (one-way ANOVA)
Your Diagnosis from Session A: Needs modification - what’s missing?
Challenges:
- Simulate data with BOTH variety effects AND block effects
- What happens to your standard errors and p-values when you ignore blocks?
- Should block be fixed or random? When does it matter?
- Extension: Add post-hoc comparisons with
emmeans
2.5.2 Station 3: Bird Abundance Over Time
Research Question: Is bird abundance declining in urban parks?
Data Structure:
- 8 urban parks surveyed
- Each park visited 6 times per year for 5 years (240 total observations)
- Response: count of birds per visit (range: 0–89)
- Year as continuous predictor
Proposed Method: lm(bird_count ~ year)
Your Diagnosis from Session A: Wrong method - multiple issues!
Challenges:
- This data has at least THREE problems with the proposed method. What are they?
- Simulate data with park-level random effects and temporal structure
- What distribution should counts follow?
- Extension: Try adding random slopes - do parks decline at different rates?
2.5.3 Station 5: Pollinator Network Complexity
Research Question: Do pollinator networks become more complex with plant diversity?
Data Structure:
- 25 meadows sampled
- Plant diversity (species richness) recorded per meadow
- Network complexity score calculated (continuous, 1.2–8.7)
- Data are normal, linear relationship looks reasonable
Proposed Method: Bayesian multilevel model with varying intercepts and slopes, spatial Gaussian process
Your Diagnosis from Session A: This is OVERKILL!
Challenges:
- Simulate simple, well-behaved data that matches this scenario
- Show that a simple
lm()is sufficient - Discussion: When WOULD the complex approach be justified? What would the data need to look like?
- Extension: What are the costs of over-complicating your analysis? Write a brief argument for simplicity.
2.6 End of Session Wrap-Up
Deliverables (choose based on your track):
| Track | What to submit/share |
|---|---|
| 🟢 Track A | Intro to R –> follow the instructions. Or your working .qmd file with at least one model output |
| 🟡 Track B | Your simulation + comparison of wrong vs. correct model, with 2-3 sentences explaining what you learned. |
| 🔴 Track C | Your extended analysis + one visualization + notes on how this applies to your research |
2.6.0.1 Exit Ticket (5 min)
Is thinking about how methods align to your objectives useful?