Tick Phenology & Prescribed Burning: GEE for Repeated-Measures Count Data

SNR 690 - Contemporary Approaches to Quantitative Methods

Initial question

“Why not just use a mixed model?”

GLMMs are usually the default for repeated measures, right?

Main difference between GLMM and GEE from interpretaion

  • Use GLMM when you want to make inference about a specific cluster (or individual) (e.g., plot, lake, animal) -> the conditional mean.
  • Use GEE when you want to make inference about the population-average effect across all clusters -> the marginal mean.

Marginal vs Conditional Inference/Mean

  • Marginal mean: \(E[Y]\) - the average response across the entire population of clusters (plots). This is what GEE estimates.
  • The marginal mean is the overall expected value of a variable by itself, averaging over (i.e., ignoring) other variables.
  • Interpretation: “What is the average value of Y in the whole population/sample?”
  • Example: If Y is income, the marginal mean is the average income across everyone, regardless of age, education, etc.

Marginal vs Conditional Inference/Mean

  • Conditional mean: \(E[Y \mid \mathbf{b}]\) - the expected response for a specific cluster (plot) given its random effect. This is what GLMM estimates.
  • The conditional mean is the expected value of a variable given specific values of other variables (e.g., random effects).
  • Notation: \(E[Y \mid \mathbf{b}]\) means the expected value of Y given the random effects b.
  • Interpretation: “What is the expected value of Y for a specific cluster (plot) given its unique characteristics (random effect)?”
  • Example: \(E[ Y \mid education = college]\) is the expected income for a specific individual i, given that they went to college.

Marginal means we are averaging over the random effects

It is a weighted average of the conditional means across all clusters, where the weights are determined by the distribution of the random effects. \[E[Y] = \int E[Y \mid \mathbf{b}] f(\mathbf{b}) d\mathbf{b}\] ::: {.notes} Expected Y equals the integral of expected Y given b times f(b) db

:::

  • “The overall mean of \(Y\) is the average of the conditional mean of \(Y\), averaged over how likely each \(\mathbf{b}\) value is.”

Why GEE? Clustered Data & Marginal Inference

The data structure problem:

  • 21 plots, each measured 24 times (monthly, 2010–2011)
  • Tick counts within a plot are correlated (not independent)
  • Ordinary GLM assumes independence → inflated Type I error
  • GLMM models this correlation with random effects → estimates conditional mean
  • GEE models this correlation with a working covariance → estimates marginal mean

Two solutions:

Approach What it models Interpretation
GLMM Conditional mean \(E[Y \mid \mathbf{b}]\) Effect within a specific plot
GEE Marginal mean \(E[Y]\) Effect averaged across all plots

For policy-relevant questions - “Does burning reduce ticks on average?” - GEE is a better choice.

The GEE Estimating Equation

\[ \sum_{i=1}^{K} \mathbf{D}_i^\top \mathbf{V}_i^{-1}\left(\mathbf{Y}_i - \boldsymbol{\mu}_i\right) = \mathbf{0} \]

Symbol Meaning
\(K = 21\) Number of plots (clusters)
\(\mathbf{Y}_i\) Observed tick counts for plot \(i\) (length 24)
\(\boldsymbol{\mu}_i = E[\mathbf{Y}_i \mid \mathbf{X}_i]\) Predicted marginal means
\(\mathbf{D}_i = \partial \boldsymbol{\mu}_i / \partial \boldsymbol{\beta}\) Derivative of mean w.r.t. parameters
\(\mathbf{V}_i = \phi\,\mathbf{A}_i^{1/2} R(\alpha) \mathbf{A}_i^{1/2}\) Working covariance
\(R(\alpha)\) Working correlation matrix

Robust standard errors with the sandwich estimator

  • Robust S.E. allow us to get good uncertainty estimates even if our working correlation is wrong. This is critical because we often don’t know the true correlation structure.

  • We usually assume the variance has a correct structure:

  • Normal: \(\text{Var}(Y_{it}) = \sigma^2\)

  • Poisson: \(\text{Var}(Y_{it}) = \lambda_{it}\)

  • Negative binomial: \(\text{Var}(Y_{it}) = \mu_{it} + \frac{\mu_{it}^2}{k}\)

  • What happens is those are violated???

  • What happens to our estimates

  • What happens to our standard errors? And decisions?

Sandwich (Robust) Variance

\[ \widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \mathbf{B}^{-1} \mathbf{M} \mathbf{B}^{-1} \]

  • \(\mathbf{B} = \displaystyle\sum_i \mathbf{D}_i^\top \mathbf{V}_i^{-1} \mathbf{D}_i\)   (the bread -> model-based information)
  • \(\mathbf{M} = \displaystyle\sum_i \mathbf{D}_i^\top \mathbf{V}_i^{-1} \widehat{\text{Cov}}(\mathbf{Y}_i) \mathbf{V}_i^{-1} \mathbf{D}_i\)   (the meat -> empirical)

🥖 Bread (Model-Based)

  • Comes from the assumed model
  • Uses design matrix (X) and weights (W)
  • Represents: > “How precise estimates would be if the model were correct

🥩 Meat (Data-Based)

  • Comes from observed residual variability
  • Uses cluster-level contributions ( _i )
  • Represents: > “How much variability we actually saw in the data”

🧠 Intuition

  • Bread = theory
  • Meat = reality
  • Sandwich = robust standard errors

Key insight: model gives the structure, the data gives the variability. The sandwich estimator combines both to give us robust standard errors.

When to use?

  • Not only based on the type of mean structure (marginal vs conditional) but also based on the number of clusters!!!
  • Generally, it is advised to have MANY CLUSTERS for the sandwich estimator to perform well. A common rule of thumb is at least 40 clusters. With fewer clusters, the sandwich estimator can be biased and produce anti-conservative confidence intervals (too narrow).

Warning

With only \(K = 21\) clusters, the sandwich estimator may be anti-conservative (too narrow CIs). Small-sample corrections (e.g., std.err = "jack" in geepack) are advisable.

The Paper’s Model Specification

Note

Note: Gleim et al. (2014) do not provide an explicit model equation. The following is an inferred defensible specification based on their Methods section (p. 3).

\[ \log\!\bigl(E[\text{ticks}_{it}]\bigr) = \beta_0 + \beta_1\,\text{BurnRegime}_i + \beta_2\,\text{HostAbundance}_{it} + \beta_3\,\text{Canopy}_{it} + \beta_4\,\text{Season}_{it} + \cdots \]

  • Family: Negative binomial (overdispersed counts; 47,185 total; 99% A. americanum)
  • Link: Log (counts → multiplicative effects on tick abundance)
  • Correlation: Exchangeable (equal correlation among all 24 monthly observations per plot)
  • Clusters: 21 plots (7.2–78.9 ha); ≥10 years of consistent burn management
  • Predictors: Burn regime (BB, BUB, UBB, UBUB), host abundance (cameras), canopy closure, tree density, ground cover, temperature, precipitation

Working Correlation Structures

Structure Formula Assumption Best for…
Independence \(R = I\) Zero correlation Sensitivity check; large \(K\)
Exchangeable \(\text{Corr}(Y_{it}, Y_{is}) = \alpha\) Equal correlation at all lags Short series; random effects analogy
AR(1) \(\text{Corr}(Y_{it}, Y_{is}) = \alpha^{|t-s|}\) Correlation decays with lag Time series; seasonal ecological data

For monthly tick counts over 2 years, AR(1) is more ecologically realistic than exchangeable - but GEE estimates are consistent under either.

Model selection: Use QIC (quasi-likelihood under independence model criterion) to compare working correlation structures.

How to run it?

In R, you can use the geepack package:

library(geepack)

fit_gee <- geeglm(
  y ~ x1 + x2 + time,
  id      = subject_id,          # cluster / repeated unit
  data    = dat,
  family  = binomial(link="logit"),
  corstr  = "ar1"                 # working AR(1)
)

summary(fit_gee)

Group Worksheet Critique - 0:55–1:05 (10 min)

Format: Small groups (2–3 students)

Instructions: 1. Each group gives 3 specific critiques of the paper as if you were a reviewer

Before You Leave

On an index card :

1. What is the single most important thing you learned about GEE today?

2. What is one question you still have about GEE or do you think you could apply GEE to your own research data?

Be specific. “I learned GEE exists” is not an acceptable answer for question 1.