Tracking Your Analytical Decisions
2001-04-01
Topic: Version Control, Git & Collaboration
Today we connect how you think about data analysis to how you track your analytical decisions.
“Your analytical decisions need to be tracked, transparent, and reproducible - Git makes this possible.” This is, at it’s core, not just useful It follows the scientific iterative process!
Last week was tough. Some things that made it complicated:
Simulations are hard –> Generalized Linear Models
Working in large groups
Everyone coding separately, and trying things without a shared system for tracking changes
| Week | Topic | Key Takeaway |
|---|---|---|
| 1 | Introductions & Papers | Data exploration protects inference (Zuur); always explore before modeling |
| 2 | Philosophy of Data Analysis | “All models are wrong” (Box); find the useful model, not the true one. Also, the best model is based on your obejctives (prediction vs inference vs exploration) |
| 3 | Statistical Modeling Framework | Method-Question-Data Triangle must align; start simple |
| 4 | Version Control & Collaboration | Track, share, and reproduce your analytical journey |
In Weeks 1–3, we talked about how to think about data analysis…
Version control answers all of these questions.
It can also be a very good tool to have. I have applied at jobs, that specifically asked for this. In the future, as more collaborative projects deal with complex models, version control will be a must-have skill AI is expanding the ability to code and do data analysis, which will increase potential collaboration –> source control becomes even more important More importantly… if you are doing science, it should be reproducible.
By the end of the week, you will be able to:
We will collaborate
This will make it easier
No notes, no phones, no laptops
Struggling to recall is the point — it strengthens memory!
. . .
Debrief (5 min): Turn to a neighbor and compare answers.
Research shows that actively recalling information — even when it’s difficult — produces stronger long-term retention than re-reading notes (Roediger & Butler, 2011).
This is a “desirable difficulty.” If it felt hard, that’s good!
Quick Poll: Raise your hand if you’ve ever…
_v2, _v3, _FINAL filesHere is a scenario. Predict what will go wrong:
Your folder looks like this:
analysis.R analysis_v2.R analysis_FINAL.R analysis_FINAL_v2.R analysis_FINAL_ACTUALLY_FINAL.R analysis_FINAL_ACTUALLY_FINAL_USE_THIS_ONE.RYour advisor emails: “The reviewer wants to see the version where you used the Poisson model instead of the linear model.”
Write down: What happens next? Can you even find it?
Here is an alternative scenario. Predict what will go wrong:
You have a single file: analysis.R Your advisor emails: “The reviewer wants to see the version where you used the Poisson model instead of the linear model.”
Write down: What happens next? Can you even find it?
I will now demonstrate how version control solves this problem.
Watch how Git lets us travel through time in our project…
As a class, answer:
If you’re iterating through models, you need a record of which wrong models you tried and why you moved on.
Asking “why?” and “how?” until you reach deep understanding.
With a partner, take turns asking “why?” about this statement. Go at least four levels deep:
“Researchers should use version control for their analyses.”
Each pair picks two statements.
And you do the why? and how? chain for each one.
Data exploration should protect inference, not create it. (Zuur, Week 1)
All models are wrong, but some are useful.” (Box, Week 2)
“Start simple. Add complexity only when needed. Always justify your choices.” (Golden Rule, Week 3)
“Misalignment between method, question, and data leads to wrong conclusions, wasted effort, rejected papers, and sad grad students.” (Week 3)
“Exploration and inference should come from independent studies. (Tredennick, Week 1)
Share Out (2–3 min): 3–4 pairs share their best sentence.
A system that records changes to files over time
You can recall any previous version at any time
Think of it as Track Changes for your entire project — but much more powerful
| Concept | What It Is | Analogy |
|---|---|---|
| Repository (repo) | A project folder tracked by Git | Your lab notebook for a project |
| Commit | A snapshot of your files at a point in time | An entry in your lab notebook |
| Staging (git add) | Choosing which changes to include in the next commit | Deciding which observations to write up |
| Branch | A parallel version of your project | An exploratory side-analysis |
| Concept | What It Is | Analogy |
|---|---|---|
| Merge | Combining two branches | Integrating exploratory findings into main analysis |
| Pull Request (PR) | A request to merge your branch + review | Asking a collaborator to check your work |
| Clone | Downloading a copy of a remote repository | Getting a copy of a shared lab notebook |
| Push / Pull | Sending/receiving updates to/from GitHub | Syncing your local notebook with the shared one |
Your Computer (Local) GitHub (Remote)
┌─────────────────────┐ ┌──────────────────┐
│ Working Directory │ git push │ │
│ (your files) │ ──────────► │ Remote Repo │
│ │ │ │ (GitHub) │
│ git add │ git pull │ │
│ ▼ │ ◄────────── │ │
│ Staging Area │ └──────────────────┘
│ │ │
│ git commit │
│ ▼ │
│ Local Repository │
│ (commit history) │
└─────────────────────┘
git add) the changes you want to recordGit tracks any text file:
.qmd, .R, .csv (small), .bib, .tex, .md. . .
For large data → .gitignore and Git Large File Storage (LFS)
| Git | GitHub |
|---|---|
| The version control system | A website that hosts Git repositories |
| Runs on your computer | Runs in the cloud |
| Tracks history locally | Lets you share, collaborate, and back up |
| Free, open source | Free for public repos; education accounts get extras |
Git is the engine. GitHub is the garage where everyone parks.
Now that you’ve heard the lecture, build a concept map from memory.
On a blank piece of paper, create a concept map using these terms. Draw circles and arrows. Label arrows with verbs.
. . .
Terms: Repository, Commit, Branch, Merge, Pull Request, Clone, Push, Pull, Staging Area, Working Directory, Remote (GitHub)
| ## Activity: Hands-On — Your First Repo (15 min) {background-color=“#d4edda”} Once you are done… start working on this > Who here did a Github account? > Join an educational account |
| What We Covered | Key Takeaway |
|---|---|
| Why version control? | File-naming fails; Git preserves what, when, and why |
| What is Git? | A system that snapshots your project over time |
| Git ≠ GitHub | Git is the engine; GitHub is the cloud platform |
| Key workflow | edit → add → commit → push |
| Connection to research | Track model iterations, exploration vs. inference, analytical decisions |