I'm working on an analysis evaluating environmental and management factors that influence regeneration of conifers post-removal. Response variable is % tree cover from satellite fractional cover estimates (eventually will include other categories of vegetation cover, crossing that bridge later). Explanatory variables include conifer removal method (categorical), years since treatment, aspect (categorical), slope, elevation, pre-treatment tree cover, avg. annual precipitation, etc. at each pixel (12 possible explanatory variables). I'm also including a random effect of the pixel "ID" to account for repeated measures. ~650,000 rows of data across years, ~250,000 pixels Because the response variable is a proportion, it seems like a beta distribution is the way to go? However, there are a lot of zeros (i.e., 0% tree cover). An initial google search says I should use a zero-inflated beta regression, but the more I dig into it, the more I'm seeing things that suggest that's not necessarily the case (e.g., R: GLMM for unbalanced zero-inflated data (glmmTMB)). If zero-inflation is not the way to go, is it appropriate to "squeeze"/add a small constant to the zero-values, or do I need to look into a different distribution??
1 Answer
Choices:
- zero-inflated Beta (you haven't been specific about what you've read that "suggest[s] that's not necessarily the case" ... the Stack Exchange post you link to in your question is about zero-inflated Poisson data, which is conceptually very different from the zero-inflated Beta. In the Poisson case, sampling zeros are possible (i.e. the original distribution can give rise to zero outcomes), which may or may not be augmented by structural zeros. For the Beta distribution zero values are often impossible (if the $\alpha$/
shape1parameter is < 1), or may have infinite probability density (if $\alpha > 1$), so it's better to treat any zeros as arising from a separate process ... - "squeezing". The canonical reference for this is Smithson and Verkuilen, but a very thorough blog post by Robert Kubinec points out that their approach ("squeeze" $y$ to $(y \cdot (N-1) + 0.5)/N$) gets weird for large sample sizes. (You could also choose to squeeze by some other constant chosen ad hoc, I believe Smithson and Verkuilen discuss some possible choices ...)
- Kubinec proposes the ordered beta distribution, which is implemented in the
ordbetaregpackage (uses Stan to do Hamiltonian Monte Carlo sampling, could be slow for a large data set!) and in theglmmTMBpackage (frequentist/MLE).
I would say the decision between zero-inflation and the ordered-beta distribution would come down to whether you think the zeros represent a separate set of ecological processes, or whether they are essentially a kind of censoring (small cover values end up being estimated as zero).
I'd definitely recommend reading Kubinec's blog post linked above.
Kubinec, Robert. “Ordered Beta Regression: A Parsimonious, Well-Fitting Model for Continuous Data with Lower and Upper Bounds.” Political Analysis, Cambridge University Press, July 27, 2022, 1–18. https://doi.org/10.1017/pan.2022.20.
Smithson, Michael, and Jay Verkuilen. “A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables.” Psychological Methods 11, no. 1 (2006): 54–71. https://doi.org/2006-03820-004.
-
$\begingroup$ thanks. This is very helpful. Added a link to another stack exchange question that made me realize I needed to do a deeper dive into the literature. $\endgroup$TK_montana– TK_montana2025-12-06 14:06:32 +00:00Commented 9 hours ago