Week 04 - Version Control, Git & Collaboration

Week 4 # Objectives and background

Topic: Introduction to Git, GitHub, and Collaborative Reproducible Research

0.1 Learning Objectives

  • Explain why version control matters for reproducible, transparent research
  • Connect version control philosophy to themes from Weeks 1–3 (data exploration, model selection, the Method-Question-Data triangle)
  • Use Git and GitHub for basic version control (init, add, commit, push, pull)
  • Collaborate on a shared repository using branches and pull requests
  • Apply version control to a Quarto-based analytical workflow
The Semester So Far - Key Takeaways from Weeks 1–3
Week Topic Key Takeaway
1 Introductions & Papers Data exploration protects inference (Zuur); exploration vs. inference should be independent (Tredennick); always explore your data before modeling
2 Philosophy of Data Analysis “All models are wrong” (Box); Bayesian vs. frequentist thinking; uncertainty is everywhere; the goal is to find the useful model, not the true one
3 Statistical Modeling Framework The Method-Question-Data Triangle must align; mismatch → wrong conclusions; overcomplication → wasted effort; start simple, add complexity only when needed
4 Version Control & Collaboration Your analytical decisions need to be tracked, transparent, and reproducible - Git makes this possible

In Weeks 1–3, we talked about how to think about data analysis:

  • Week 1 emphasized that data exploration and inference are separate activities - but both must be documented. How do you keep track of which analyses were exploratory vs. confirmatory?
  • Week 2 introduced the idea that all models are wrong but some are useful. If you try multiple models, how do you record which ones you tried, why you chose one over another, and what changed?
  • Week 3 showed us the Method-Question-Data Triangle and the dangers of mismatch vs. overcomplication. When you iterate through model choices - trying a lm(), realizing you need a glmer(), adding random effects - how do you track that evolution?

Version control can help us with this And at WORST… it is a great tool to have!

About Week 3 Simulations

Many of you found the simulation exercises in Week 3 challenging - simulating data that matches a scenario description, then fitting both the “proposed” (wrong) and “correct” models. That’s completely normal. Simulating data is a skill that takes practice.

Session B this week is designed to give you a second pass at those simulations. The first hands-on activity will walk through the Week 3 simulation & analysis exercises together, step by step. Then we’ll layer Git on top, so you’re learning version control with familiar code rather than starting from scratch.

0.2 Background Reading

Before Thursday class, review the following:

Make sure you have:

Think about:

  • How many versions of your thesis/analysis scripts do you currently have?
  • Have you ever lost work, overwritten a file, or been unable to undo a change?
  • Have you ever emailed a script to a collaborator and gotten confused about which version is “current”?

1 Session A: Why Version Control? (Lecture + Discussion, 75 min)

1.1 Part 1: Retrieval Practice - What Do You Remember? (8 min)

Before we introduce anything new, let’s activate what you already know from Weeks 1–3. No notes, no phones - this is retrieval practice. Struggling to recall is the point; it strengthens memory.

Quick-Fire Recall (write on paper, 3 min)

Answer as many as you can from memory. One or two sentences each is fine.

  1. What did Zuur argue about data exploration? What is the goal of exploring your data before modeling?
  2. What is Box’s famous quote about models? What does it mean in practice?
  3. In Week 3, what were the three scenarios we examined? (Hint: one was a “good match”)
  4. In the examples from Week 3, explain one fo the experiments and what went wrong?
  5. What is the Golden Rule from Week 3?

Debrief (5 min): Turn to a neighbor and compare answers. Fill in each other’s gaps. note: ask them what they want here

Why Retrieval Practice?

Research shows that actively recalling information - even when it’s difficult - produces stronger long-term retention than re-reading notes (Roediger & Butler, 2011). This is a “desirable difficulty.” If it felt hard, that’s good!

1.2 Part 2: Predict-Observe-Explain (12 min)

Quick Poll: Raise your hand if you’ve ever:

  1. Lost work because you overwrote a file
  2. Couldn’t remember why you changed something in your code
  3. Had a collaborator edit the same file and you had to manually merge changes
  4. Had a folder full of _v2, _v3, _FINAL files

1.2.1 🔮 PREDICT (3 min - write on paper)

Here is a scenario. Predict what will go wrong:

You’re writing your thesis. You have a file called analysis.R. You make changes over two weeks. Your folder now looks like this:

analysis.R
analysis_v2.R
analysis_FINAL.R
analysis_FINAL_v2.R
analysis_FINAL_ACTUALLY_FINAL.R
analysis_FINAL_ACTUALLY_FINAL_USE_THIS_ONE.R

Your advisor emails and says: “The reviewer wants to see the version where you used the Poisson model instead of the linear model. Can you send that?”

Write down: What happens next? What specifically goes wrong? How long does it take you to find the right version? Do you even have it?

1.2.2 👁️ OBSERVE (2 minute)

I will show how source control works. And why it is so powerful.

1.2.3 💡 EXPLAIN (5 min - whole class)

As a class, answer:

  1. Why does the file-naming approach fail? (Be specific: what information is lost?)
  2. What does Git preserve that file-naming doesn’t?
  3. Can you connect this to Box’s “all models are wrong” idea? (Hint: if you’re iterating through models, you need a record of which wrong models you tried and why you moved on)

1.3 Part 3: Why Version Control? - Elaborative Interrogation (10 min)

Elaborative interrogation means asking “why?” and “how?” until you reach a deep understanding.

1.3.1 Round 1: Pair Up - “Why” Chain (6 min)

With a partner, take turns asking “why?” about this statement. Go at least four levels deep:

“Researchers should use version control for their analyses.”

1.3.2 Round 2: Connect to a Specific Week (5 min)

Each pair picks two statements.

And you do the why? and how? chain for each one.

  1. Data exploration should protect inference, not create it. (Zuur, Week 1)

  2. All models are wrong, but some are useful.” (Box, Week 2)

  3. “Start simple. Add complexity only when needed. Always justify your choices.” (Golden Rule, Week 3)

  4. “Misalignment between method, question, and data leads to wrong conclusions, wasted effort, rejected papers, and sad grad students.” (Week 3)

  5. “Exploration and inference should come from independent studies. (Tredennick, Week 1)

Share Out (2–3 min): 3–4 pairs share their best sentence.

1.4 Part 4: Lecture - What is Git? (≈20 min)

Dr. Molina will give a traditional lecture covering the following. Take notes - you’ll use this for a concept map right after.

1.4.1 What is version control?

  • A system that records changes to files over time
  • You can recall any previous version at any time
  • Think of it as Track Changes for your entire project - but much more powerful

1.4.2 Why Git specifically?

  • Created by Linus Torvalds (creator of Linux) in 2005
  • Distributed - every collaborator has the full project history
  • The industry standard for software, increasingly for science
  • Integrates with RStudio, Positron, and Quarto

1.4.3 Key Concepts

Concept What It Is Analogy
Repository (repo) A project folder tracked by Git Your lab notebook for a project
Commit A snapshot of your files at a point in time An entry in your lab notebook
Staging (git add) Choosing which changes to include in the next commit Deciding which observations to write up
Branch A parallel version of your project An exploratory side-analysis
Merge Combining two branches Integrating your exploratory findings into your main analysis
Pull Request (PR) A request to merge your branch + review Asking a collaborator to check your work before including it
Clone Downloading a copy of a remote repository Getting a copy of a shared lab notebook
Push / Pull Sending/receiving updates to/from GitHub Syncing your local lab notebook with the shared one

1.4.4 The Git Mental Model

Your Computer (Local)                    GitHub (Remote)
┌─────────────────────┐                ┌──────────────────┐
│  Working Directory   │   git push    │                  │
│  (your files)        │ ──────────►   │   Remote Repo    │
│         │            │               │   (GitHub)       │
│    git add           │   git pull    │                  │
│         ▼            │ ◄──────────   │                  │
│  Staging Area        │               └──────────────────┘
│         │            │
│    git commit        │
│         ▼            │
│  Local Repository    │
│  (commit history)    │
└─────────────────────┘

1.4.5 Git is Not Just for Code

Git tracks any text file: .qmd, .R, .csv (small), .bib, .tex, .md

What Git is NOT Good For
  • Large binary files (images, PDFs, Word docs, big datasets)
  • Files that change constantly in unpredictable ways
  • Sensitive data (passwords, API keys, personally identifiable data)

For large data, look into .gitignore and Git Large File Storage (LFS).

1.4.6 GitHub ≠ Git

Git GitHub
The version control system A website that hosts Git repositories
Runs on your computer Runs in the cloud
Tracks history locally Lets you share, collaborate, and back up
Free, open source Free for public repos; education accounts get extras

1.5 Part 5: Concept Map - Your Understanding of Git (8 min)

Now that you’ve heard the lecture, build a concept map from memory. This tests whether you actually absorbed it.

1.5.1 Individual (5 min)

On a blank piece of paper, create a concept map using these terms. Draw circles for each term and arrows showing how they relate. Label the arrows with verbs (e.g., “creates”, “contains”, “sends to”).

Terms: Repository, Commit, Branch, Merge, Pull Request, Clone, Push, Pull, Staging Area, Working Directory, Remote (GitHub)

1.6 Part 6: Hands-On - Join GitHub, Create Your First Repo, First Commits (≈15 min)

We’re doing this NOW, together, in class.

The rest of Session A is hands-on.

1.6.1 Step 1: Make sure you’re on GitHub (2 min)

  1. Go to github.com
  2. If you don’t have an account, create one now
  3. If you haven’t applied for GitHub Education, do that after class (it gives you free features)
  4. Confirm you can log in

1.6.2 Step 2: Create a new RStudio Project with Git (3 min)

  1. Open RStudio
  2. Go to File → New Project → New Directory → New Project
  3. Name it: water-growth-sim
  4. ✅ Check “Create a git repository”
  5. Click Create Project

You now have a local Git repository! Notice the Git tab in your RStudio pane (usually top-right).

1.6.3 Step 3: Create a new Quarto file (2 min)

  1. File → New File → Quarto Document
  2. Title: "Water and Plant Growth Simulation"
  3. Save it as simulation.qmd

1.6.4 Step 4: Write the simulation - Step 1 only (data simulation) (3 min)

Type this into your simulation.qmd file. This is a simple simulation: water availability (x) affects plant growth (y) through a basic linear relationship.

# =============================================
# Water & Plant Growth: Simple Linear Simulation
# =============================================
# Research question: Does water availability affect plant growth?
# We KNOW the answer because we're simulating the data!

library(tidyverse)

set.seed(2024)

# Define the truth
n_plants <- 40             # number of plants
true_intercept <- 5        # baseline growth (cm) with no water
true_slope <- 2.5          # for every 1 unit increase in water, growth increases by 2.5 cm
noise_sd <- 3              # how much natural variation there is

# Simulate predictor: water availability (liters per week)
water <- runif(n_plants, min = 0, max = 10)

# Simulate response: plant growth (cm)
growth <- true_intercept + true_slope * water + rnorm(n_plants, mean = 0, sd = noise_sd)

# Put it in a data frame
plant_data <- data.frame(
  water = water,
  growth = growth
)

# Take a look
head(plant_data)
summary(plant_data)

# Visualize
ggplot(plant_data, aes(x = water, y = growth)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "Simulated: Water Availability vs. Plant Growth",
    subtitle = paste("True relationship: growth =", true_intercept, "+", true_slope, "× water + noise"),
    x = "Water Availability (L/week)",
    y = "Plant Growth (cm)"
  ) +
  theme_minimal(base_size = 14)

1.6.5 Step 5: Make your FIRST commit (3 min)

  1. Go to the Git tab in RStudio
  2. You’ll see your files listed (.qmd, .Rproj, maybe .gitignore)
  3. Check the boxes next to all the files (this is “staging” - git add)
  4. Click “Commit”
  5. In the commit message box, type: "Simulated water and plant growth data - true slope is 2.5"
  6. Click Commit

🎉 Congratulations - you just made your first Git commit!

Click “History” in the Git pane to see your commit. There’s your first lab notebook entry.

1.6.6 Step 6: Connect to GitHub and push (3 min)

  1. Go to github.com → click “+” (top right) → New repository
  2. Name it: water-growth-sim
  3. Leave it Public (or Private if you prefer)
  4. ⚠️ Do NOT check “Add a README” - your local repo already has files
  5. Click Create repository
  6. GitHub will show you instructions. Copy the two lines under “…or push an existing repository from the command line”:
  7. In RStudio, go to Terminal tab (next to Console) and paste those commands
git remote add origin https://github.com/YOUR-USERNAME/water-growth-sim.git
git branch -M main
git push -u origin main
  1. Go back to GitHub in your browser and refresh the page - your files are there! 🎉
Check GitHub!

Go to https://github.com/YOUR-USERNAME/water-growth-sim and confirm you can see your simulation.qmd file. Click on it - you can read your code right in the browser. Click “Commits” to see your commit history (it has one entry so far).

  1. Now, return to RStudio, and add one line of code to your simulation.qmd. I want you to run the linear model using lm() and run the summary(). After that, commit again, and see if you can push the changes to GitHub.
  2. Commit message: "Fit linear model to simulated data"
  3. Push to GitHub again.
  4. Render your Quarto document to see the output. You can also commit and push the rendered HTML if you want (that’s how I made this website!).

1.7 Muddiest Point (2 min)

On a piece of paper, write down the one thing from today that is most confusing or unclear to you –> We will move this to session 2

2 Session B: Hands-On - Simulations Revisited + Git Workshop (75 min)

Session B

This session has two parts:

This entire session is hands-on. You will be working on your computers the whole time. The goals are:

  1. Practice GitHub — cloning, committing, pushing, pulling, and collaborating on a shared repository. This is your first real deep dive after the intro in Session A.
  2. Revisit Week 3 simulations — specifically the “simulate → analyze correctly → analyze incorrectly” workflow, but this time we start simple (no transformations like log or logit) and work inside a Quarto document with proper source control.
  3. Work collaboratively — you’ll be working in small teams on a shared GitHub repository, experiencing what real collaborative version control feels like.

The rule for today: start simple. Simple linear models. Normal distributions. No transformations. Get comfortable with the simulation → correct analysis → wrong analysis pipeline and with Git before we add complexity.

2.0.1 Muddiest Points (2 min)

2.1 Let’s start the class by taking a couple minutes and write the things that have been the most complicated about this class

2.2 Part 1: Week 3 Simulation Refresher (≈35 min)

2.2.1 Why Simulate? (5 min - mini-lecture)

Simulation is one of the most powerful tools in a statistician’s toolkit. It lets you:

  • Understand your model by seeing how data generated from that model behaves
  • Test your analysis pipeline on data where you know the truth
  • Diagnose problems - if your model can’t recover known parameters from simulated data, something is wrong

In Week 3, many of you found simulation challenging. That’s expected! Let’s break it down together.

The Simulation Recipe

Every simulation follows the same four steps:

  1. Define the truth - What are the true parameter values? (e.g., true effect = 3 units)
  2. Generate structure - Create the experimental design (e.g., 5 fields × 4 plots)
  3. Add randomness - Use a distribution to generate data (e.g., rnorm(), rpois(), rbinom())
  4. Fit the model - Analyze the simulated data and see if you recover the truth

2.3 Part 1: Setting Up Your Team Repository (≈10 min)

2.3.1 Form Pairs (1 min)

Get into pairs. You’ll be working on a shared GitHub repository for the rest of this session. Each person will take responsibility for one of the two simulation exercises, so you’ll both contribute a complete analysis to the same repo.

Why Shared Repos?

In real research, you rarely work alone. Shared repositories force you to practice the pull → edit → commit → push cycle and deal with real collaboration challenges (coordinating who’s working on what, writing clear commit messages for your partner). This is the whole point of today.

2.3.2 One Person Creates the Repo (3 min)

One partner does the following:

  1. Go to github.com“+”New repository
  2. Name it: snr690-sim-team-[your-initials] (e.g., snr690-sim-team-amjk)
  3. ✅ Check “Add a README file”
  4. Set it to Public (so your partner can access it easily)
  5. Click Create repository
  6. Go to Settings → Collaborators → Add people and add your partner by their GitHub username

2.3.3 Both Partners Clone the Repo (3 min)

Both partners (including the creator):

  1. Go to the repository page on GitHub
  2. Click the green “<> Code” button → copy the HTTPS URL
  3. In RStudio: File → New Project → Version Control → Git
  4. Paste the URL, choose a directory, click Create Project
Checkpoint

Both partners should now have the same project open in RStudio, connected to the same GitHub repository. You should see the Git tab in your RStudio pane.

2.3.4 Decide Who Does What (1 min)

  • Partner A will work on Activity 1: The Soil Temperature Experiment
  • Partner B will work on Activity 2: The Fertilizer Yield Experiment

Each person creates their own Quarto file:

  1. File → New File → Quarto Document
  2. Title: "Salamander - [Your Name]" or "Yield - [Your Name]"
  3. Save it as sim-soil-temp-[yourname].qmd or sim-fertilizer-[yourname].qmd
  4. Commit with message: "Add Quarto file for [activity name]"
  5. Push to GitHub
Important: Pull Before You Push!

Before pushing, always pull first (git pull in the Terminal, or the ⬇️ Pull button in the Git tab). Your partner may have pushed changes. Get in the habit now: pull → work → commit → pull → push.


2.4 Part 2: Simulation Exercises — Start Simple, No Transformations (≈45 min)

The Rule: Start Simple

We are deliberately keeping this simple today. I am providing the simulation… Look at it, and figure out what I did different than you

This is what you will be doing here: 1. Simulate data where you know the truth (you have the parameters and I provided the code this time) 2. Comment on the simulation code explain what each step is doing and why… or at least try to 3. Fit the proposed (wrong) model/method from Week 3. You write the code and explain why it might be problematic 4. Fit the correct model. You propose it, write the code, commit, and push 5. Write it all up in your Quarto document with narrative explaining what you did and why 6. Commit after each step with a meaningful commit message

This mirrors the real research process: you try an analysis, realize there’s a problem, and re-analyze correctly. Git tracks every step of that journey.


2.4.1 Activity 1: Salamander Survival (Partner A) — Station 1 from Week 3

Scenario: A researcher wants to know if canopy cover affects salamander survival in forest fragments. They marked 20 salamanders in each of 12 forest fragments and recorded whether each salamander survived (alive/dead) after one year. Canopy cover was measured as a percentage for each fragment.

The truth (what we’re simulating):

  • 12 forest fragments
  • 20 salamanders per fragment (240 total salamanders)
  • Canopy cover ranges from 40% to 90%
  • True effect of canopy cover on survival probability: for every 10% increase in canopy cover, the log-odds of survival increases by 0.09
  • We’ll use a logistic relationship (since survival is binary), but don’t worry about the math - just run the code and comment what each line does

2.4.1.1 Step 1: Simulate the data and visualize (20 min)

Copy the code below into your Quarto document. Your job: add a comment above or next to every line explaining what it does. If you don’t understand a line, ask me.

library(tidyverse)
library(lme4)  # for mixed models later

set.seed(42)

# Define the experimental structure
n_fragments <- 12
n_salamanders_per_fragment <- 20
n_total <- n_fragments * n_salamanders_per_fragment

# Define the truth
true_intercept <- -3.5          
true_canopy_effect <- 0.09    

# Create fragment-level data
fragment_data <- data.frame(
  fragment_id = 1:n_fragments,
  canopy_cover = runif(n_fragments, min = 40, max = 90)
)

# Create salamander-level data (each salamander belongs to a fragment)
salamander_data <- data.frame(
  salamander_id = 1:n_total,
  fragment_id = rep(1:n_fragments, each = n_salamanders_per_fragment)
)

# Merge to get canopy cover for each salamander
salamander_data <- salamander_data %>%
  left_join(fragment_data, by = "fragment_id")

# Simulate survival using logistic regression formula
salamander_data <- salamander_data %>%
  mutate(
    log_odds = true_intercept + true_canopy_effect * canopy_cover,
    prob_survival = 1 / (1 + exp(-log_odds)),
    survived = rbinom(n_total, size = 1, prob = prob_survival)
  )

# Look at the data
head(salamander_data)
summary(salamander_data)

# Visualize: proportion surviving in each fragment
fragment_summary <- salamander_data %>%
  group_by(fragment_id, canopy_cover) %>%
  summarise(
    n_survived = sum(survived),
    n_total = n(),
    prop_survived = n_survived / n_total,
    .groups = "drop"
  )

ggplot(fragment_summary, aes(x = canopy_cover, y = prop_survived)) +
  geom_point(size = 4, alpha = 0.7) +
  labs(
    title = "Simulated: Canopy Cover vs. Salamander Survival",
    subtitle = "Each point = one forest fragment (20 salamanders each)",
    x = "Canopy Cover (%)",
    y = "Proportion Survived"
  ) +
  ylim(0, 1) +
  theme_minimal(base_size = 14)

📝 In your Quarto doc: Write a paragraph above the code chunk explaining: - What is the research question? - What is the experimental design? (How many fragments? How many salamanders per fragment?) - What is the response variable? What type of variable is it? - Why does this structure matter for the analysis?

📝 Commit message: "Simulate salamander survival data - 12 fragments, 20 sal per fragment"

2.4.1.2 Step 2: Fit the proposed (wrong) model from Week 3 (10 min)

In Week 3 Station 1, the proposed method was:

Logistic regression: glm(survival ~ canopy_cover, family = binomial) Use the salamander_data data frame, which has one row per salamander, with a binary survived column and a canopy_cover column.

This treats each salamander as an independent observation. But is that correct?

Your job:

  1. Fit this model using the code below (or write your own):
  1. In your Quarto document, answer these questions:
    • What does this model assume about the salamanders?
    • Are salamanders within the same fragment truly independent? Why or why not?
    • What is the unit of replication in this experiment — the salamander or the fragment?
    • What problem might arise if we ignore the fragment structure?
    • Key question: If salamanders in the same fragment are more similar to each other than to salamanders in other fragments (e.g., because of fragment-specific conditions), what happens to our standard errors and p-values?

📝 Commit message: "Fit proposed model (ignores fragment structure) - identify problem"

Push to GitHub.

2.4.1.3 Step 3: Fit the correct model (15 min)

The correct approach needs to account for the fact that salamanders are grouped within fragments. Salamanders in the same fragment are not independent — they share the same canopy cover, the same local predators, the same microclimate.

Your job:

  1. Propose a better model. Hints:

    • You need to tell the model that salamanders are nested within fragments
    • Look at the lme4 package (already loaded at the top)
    • The function you want is glmer() (generalized linear mixed-effects model)
    • The syntax is similar to glm(), but you add a random effect for fragment: (1 | fragment_id)
    • Example structure: glmer(response ~ predictor + (1 | grouping_variable), family = binomial, data = ...)
  2. Write the code to fit the correct model

  3. In your Quarto document, answer:

    • How did the estimate of the canopy cover effect change?
    • How did the standard error change? (Bigger? Smaller?)
    • Did the p-value change?
    • Why does accounting for fragment structure matter?
    • Which model do you trust more, and why?
  4. Replace the proposed model in your document — meaning, re-organize your .qmd so that the correct model comes first, and the wrong model is shown as “what not to do” or moved to an appendix section. This simulates the real research process: you realize you made a mistake, so you re-do the analysis correctly.

📝 Commit message: "Fit correct model with random effect for fragment - replace wrong approach"

Push to GitHub.

2.4.1.4 Step 4: Write your conclusions

Add a final section called “What I Learned” summarizing: - Why the proposed model was problematic (be specific about pseudoreplication) - Why the correct model is better - One sentence connecting this to the Method-Question-Data Triangle from Week 3 (hint: the question is about canopy cover, but the data have a nested structure — the method must match both)

📝 Commit message: "Add conclusions and connect to Week 3 concepts"

Push to GitHub.


2.4.2 Activity 2: Wheat Yield Trials (Partner B) — Station 2 from Week 3

Scenario: An agronomist wants to know which of 5 wheat varieties produces the highest yield. They use a randomized complete block design: 4 blocks (fields), with one plot per variety per block (20 plots total). Yield (kg/ha) is recorded once per plot.

The truth (what we’re simulating):

  • 5 wheat varieties
  • 4 blocks (to account for field variability)
  • True variety effects: Variety A is the baseline, B is +200 kg/ha better, C is +350 kg/ha better, D is +100 kg/ha better, E is -50 kg/ha worse
  • True block effects: Block 1 is baseline, Block 2 is +150 kg/ha, Block 3 is -100 kg/ha, Block 4 is +200 kg/ha (because soil quality varies)
  • Noise SD = 120 kg/ha

2.4.2.1 Step 1: Simulate the data and visualize (20 min)

Copy the code below into your Quarto document. Your job: add a comment next to lines explaining what it does.

library(tidyverse)

set.seed(123)

# Define experimental structure
varieties <- c("A", "B", "C", "D", "E")
blocks <- c("Block1", "Block2", "Block3", "Block4")

# Define the truth
baseline_yield <- 3000  # kg/ha for variety A in block 1

variety_effects <- c(
  A = 0,
  B = 200,
  C = 350,
  D = 100,
  E = -50
)

block_effects <- c(
  Block1 = 0,
  Block2 = 150,
  Block3 = -100,
  Block4 = 200
)

noise_sd <- 120

# Create all combinations of variety and block (full factorial design)
wheat_data <- expand.grid(
  variety = varieties,
  block = blocks
)

# Add true yield based on variety and block effects
wheat_data <- wheat_data %>%
  mutate(
    variety_effect = variety_effects[variety],
    block_effect = block_effects[block],
    true_yield = baseline_yield + variety_effect + block_effect,
    yield = true_yield + rnorm(n(), mean = 0, sd = noise_sd)
  )

# Look at the data
head(wheat_data)
summary(wheat_data)

# Visualize
ggplot(wheat_data, aes(x = variety, y = yield, color = block)) +
  geom_point(size = 4, alpha = 0.7, position = position_dodge(width = 0.3)) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.5, 
               color = "black", size = 0.3) +
  labs(
    title = "Simulated: Wheat Variety Yields Across Blocks",
    subtitle = "Each point = one plot; crossbar = variety mean across blocks",
    x = "Wheat Variety",
    y = "Yield (kg/ha)",
    color = "Block"
  ) +
  theme_minimal(base_size = 14)

📝 In your Quarto doc: Write a paragraph above the code chunk explaining: - What is the research question? - What is the experimental design? (What is a “block”? Why use blocks?) - What is the response variable? - Why does the block structure matter?

📝 Commit message: "Simulate wheat yield data - 5 varieties x 4 blocks"

2.4.2.2 Step 2: Fit the proposed (wrong) model from Week 3 (10 min)

In Week 3 Station 2, the proposed method was:

One-way ANOVA: aov(yield ~ variety)

This completely ignores the blocks. But blocks were part of the experimental design!

Your job:

  1. Fit this model: write the code to fit the one-way ANOVA that ignores blocks
  1. In your Quarto narrative, answer these questions:
    • What does this model assume about the blocks?
    • If Block 2 has naturally higher-quality soil (true block effect = +150 kg/ha), and we ignore blocks, what happens to the variety estimates?
    • What problem arises when we ignore a source of variation that we know exists?
    • Key question: The blocks were part of the experimental design. If we don’t account for them in the analysis, are we throwing away information? What does that do to our standard errors and our ability to detect real variety differences?

📝 Commit message: "Fit proposed model (ignores blocks) - identify problem"

Push to GitHub.

2.4.2.3 Step 3: Fit the correct model (15 min)

The correct approach is to include both variety AND block in the model. This is a two-way ANOVA (or a linear model with two categorical predictors).

Your job:

  1. Propose a better model. Hints:

  2. Write the code to fit the correct model

  3. In your Quarto document, answer:

    • How did the F-statistic and p-value for variety change?
    • How did the residual error (residual standard error) change? (Hint: look at summary(proposed_model) vs summary(correct_model))
    • Compare the means for each variety from both models — did they change? If so, why?
  4. Replace the proposed model in your document — re-organize your .qmd so the correct analysis comes first, and the wrong model is shown as “what not to do.”

📝 Commit message: "Fit correct model with block effect - replace wrong approach"

Push to GitHub.

2.4.2.4 Step 4: Write your conclusions

Add a final section called “What I Learned” summarizing: - Why the proposed model was problematic (be specific about ignoring the blocking structure) - Why the correct model is better - One sentence connecting this to the Method-Question-Data Triangle from Week 3 (hint: the question is about varieties, but the data were collected using a blocked design — the method must account for that)

📝 Commit message: "Add conclusions and connect to Week 3 concepts"

Push to GitHub.


What You Just Did

You both just completed the full research cycle that Git is designed to track:

  1. Simulate/collect data → commit
  2. Analyze using a proposed method → commit
  3. Realize the method is wrong → write about the problem, commit
  4. Re-analyze correctly → replace the wrong analysis, commit
  5. Write conclusions → commit

Your Git history now tells the complete story of your analytical decisions. In real research, this is invaluable: when a reviewer asks “did you try X?”, you can look at your commit history and say “yes, here’s what happened when we did X, and here’s why we switched to Y.”


2.5 Part 3: Review Your Partner’s Work + Reflection (≈10 min)

2.5.1 Pull and Read (5 min)

  1. Pull from GitHub to get your partner’s file
  2. Open their .qmd file and read through it
  3. Pay attention to:
    • Did they comment every line of the simulation code? Do the comments make sense?
    • Is their narrative clear — could you understand the analysis without them explaining it to you?
    • Do their conclusions connect back to Week 1–3 concepts?
Metacognitive Monitoring

Writing commit messages forces metacognitive monitoring — you have to pause and articulate what you just did and why (Tanner, 2012). This is the same skill that makes for good scientific writing: being explicit about your analytical choices. Over time, the habit of writing commit messages trains you to be a more reflective analyst.

2.5.2 Git Log Review (2 min)

In the Terminal, type:

git log --oneline

Look at the commit history. Each line is a snapshot of your analytical journey — and your partner’s. This is what version control gives you: a complete, narrated record of every decision both of you made.

2.5.3 Exit Ticket (2 min)

On a piece of paper, answer:

  1. One thing you now feel comfortable doing in Git (be specific — e.g., “I can commit and push from RStudio”)
  2. One thing you still want to practice (e.g., “I’m not sure how branches work”)
  3. One connection between today’s Git workshop and the statistical concepts from Weeks 1–3
Looking Ahead

Now that you have Git fundamentals down, future sessions will build on this. You’ll use version control for every hands-on activity going forward.