Delimiting the modelling background for scattered uneven occurrence data

In species distribution modelling and ecological niche modelling (SDM & ENM), the region from where background or pseudoabsence points are picked is key to how well a model turns out. This region should include sufficient localities for the model to assess the species’ (non-)preferences, but it should also be within the species’ reach AND reasonably evenly surveyed for it. Survey bias within the background or the pseudoabsence region can seriously reduce model quality.

Yet, real-life occurrence data (especially at the wide scale of species distributions) often are strongly biased, and it’s rarely straightforward to define an adequate region around the existing points. Very nice criteria exist that work well in theory, but not under most actual scenarios of biodiversity data bias. Until recently, my preferred approach was a buffer around the known occurrences, whose radius was e.g. the mean pairwise distance between them (as a rough measure of their spatial spread, or the spatial reach of the species), or half the width of the area defined by the occurrences (usually better for elongated distributions or surveys):

# download some example occurrence records:

occs <- geodata::sp_occurrence("Rupicapra", "pyrenaica", args = c("year=2024"))


# map occurrence records:

occs <- terra::vect(occs, geom = c("lon", "lat"), crs = "EPSG:4326")
terra::plot(occs, cex = 0.2, ext = terra::ext(occs) + 4)
cntry <- geodata::world(path = tempdir())
terra::plot(cntry, lwd = 0.2, add = TRUE)
# note you should normally clean these records from errors before moving on!


# compute mean distance and width of occurrence records:

occs_mdist <- mean(terra::distance(occs))
occs_width <- terra::width(terra::aggregate(occs))


# compute buffers around occurrences using both metrics:

occs_buff_d <- terra::aggregate(terra::buffer(occs, width = occs_mdist))
occs_buff_w <- terra::aggregate(terra::buffer(occs, width = occs_width / 2))


# plot both buffers and occurrence records:

par(mfrow = c(1, 2))
terra::plot(occs_buff_d, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Mean distance")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)
terra::plot(occs_buff_w, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Half width")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)
Image

This may work well for relatively evenly surveyed species or regions, where we can reasonably assume that most of the buffered area has been (roughly evenly) surveyed, and that most places without presence are likely absences. However, many species have isolated records in remote understudied areas, with survey effort hugely imbalanced across the distribution range. So, a buffer with a uniform radius will not work well:

occs <- geodata::sp_occurrence("Lutra", "lutra", args = c("year=2024"))

# everything else same code as above...
Image

You can use smaller multipliers for distance and width, but these uniform-radius buffers will always include large regions with only scarce and isolated occurrence records, where it cannot be assumed that absence of records mostly reflects absence of the species. Normally I would terra::erase() from the buffer the regions that look insufficiently surveyed, but that implies discarding potentially valuable occurrence data.

So, my latest approach consists of finding clusters of occurrence points that are within a given distance of one another; and then defining a varying buffer radius of, e.g., the width or the mean pairwise distance among the points in each cluster, down-weighted by the distance to other points or clusters (as an indicator of isolated observations in under-surveyed areas). This way, the modelling background will avoid areas that were most likely not surveyed, yet without discarding the occurrence records from those areas. I’ve implemented several variations of this and related approaches in the new getRegion() function of R package fuzzySim (version 4.26):

# try regions with two different methods:

reg1 <- fuzzySim::getRegion(occs, type = "inv_dist")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg1, lwd = 4, border = "orange", add = TRUE)

reg2 <- fuzzySim::getRegion(occs, type = "clust_width", width_mult = 0.5)
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg2, lwd = 4, border = "orange", add = TRUE)
Image

Read the function help file for all the different options available, and try them out to see what feels best for your study system. Feedback and bug reports welcome!

Range change based on continuous (fuzzy) values

This function, included in the fuzzySim package (version 1.6 just released), uses the fuzzyOverlay function in the previous post and then quantifies overall range change (including gain, loss, maintenance/stability and total change) based on the continuous predictions of two species distribution models (see Gutiérrez-Rodríguez et al., in press).

fuzzyRangeChange <- function(pred1, pred2, number = TRUE, prop = TRUE, na.rm = TRUE, round.digits = 2, measures = c("Gain", "Loss", "Stable_presence", "Stable_absence", "Balance"), plot = TRUE, col = colorRampPalette(c("white", "black"))(length(measures)), ...) {
 
  # version 1.4 (23 Mar 2016)
 
  stopifnot(ncol(pred1) == ncol(pred2),
            all(pred1[is.finite(pred1)] >= 0 && pred1[is.finite(pred1)] <= 1),
            all(pred2[is.finite(pred2)] >= 0 && pred2[is.finite(pred2)] <= 1)
  )
 
  if (!number & !prop) stop ("Nothing to calculate if both 'number' and 'prop' are FALSE.")
 
  values <- vector("numeric", length(measures))
  names(values) <- measures
  if ("Gain" %in% measures)  values["Gain"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "expansion", na.rm = na.rm), na.rm = na.rm)
  if ("Loss" %in% measures)  values["Loss"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "contraction", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_presence" %in% measures)  values["Stable_presence"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Stable_absence" %in% measures)  values["Stable_absence"] <- sum(fuzzyOverlay(1 - data.frame(pred1, pred2), op = "maintenance", na.rm = na.rm), na.rm = na.rm)
  if ("Balance" %in% measures)  values["Balance"] <- sum(fuzzyOverlay(data.frame(pred1, pred2), op = "change", na.rm = na.rm), na.rm = na.rm)
 
  result <- data.frame(Measure = measures, Number = values)
 
  if (prop) {
    if (na.rm) n <- length(na.omit(pred1))
    else n <- length(pred1)
    range.size <- sum(pred1, na.rm = na.rm)
    stable.abs <- result[result$Measure == "Stable_absence", "Number"]
    result$Proportion <- result[ , "Number"] / range.size
    result[result$Measure == "Stable_absence", "Proportion"] <- stable.abs / (n - range.size)
  }
 
  if (!number) {
    result <- result[ , - (ncol(result) - 1)]
  }
 
  if (plot) {
    barplot(result[ , ncol(result)], legend.text = rownames(result), col = col, ...)
    abline(h = 0)
  }
 
  result[ , "Measure"] <- NULL
  if (is.finite(round.digits))  result <- round(result, round.digits)
 
  result
}

[presented with Pretty R]

The function will produce numeric and (since FuzzySim 1.7.2) also graphical results quantifying the fuzzy overall changes:

FuzzyRangeChange

You can install and load fuzzySim (>= 1.6) and then check help(fuzzyRangeChange) for further  info and reproducible usage examples.

 

REFERENCES

Gutiérrez-Rodríguez J., Barbosa A.M. & Martínez-Solano I. (in press) Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach. Journal of Biogeography.

Select among correlated variables based on a given criterion

Correlations among variables are problematic in multivariate models, as they inflate the variance of coefficients and thus may bias the interpretation of the effects of those variables on the response. One of the strategies to circumvent this problem is to eliminate a priori one from each pair of correlated variables, but it is not always straightforward to pick the right variable among these. The corSelect function selects among correlated variables based either on their (stepwise) variance inflation factor (VIF); or on their relationship with the response variable, by building a bivariate model of each individual variable against the response and choosing, among each of two correlated variables, the one with the best p-value, AIC or BIC.

This function is now included in fuzzySim and depends on other functions in the package, such as FDR and multicol. It is also included as a new option within function multGLM in this package, so that multiple models can be built based on uncorrelated variables chosen appropriately for each species.

corSelect <- function(data, sp.cols = NULL, var.cols, cor.thresh = 0.8, select = "p.value", ...) {

# version 1.6 (6 Jul 2017)

if (length(sp.cols) > 1) stop ("Sorry, 'corSelect' is currently implemented for only one 'sp.col' at a time.")

univar.criteria <- c("VIF")
 bivar.criteria <- c("p.value", "AIC", "BIC")

if (!(select %in% c(univar.criteria, bivar.criteria))) stop ("Invalid 'select' criterion.")

if (!is.null(sp.cols) & select %in% bivar.criteria) {
 n.in <- nrow(data)
 data <- data[is.finite(data[ , sp.cols]), ]
 n.out <- nrow(data)
 if (n.out < n.in) warning (n.in - n.out, " observations removed due to missing data in 'sp.cols'; ", n.out, " observations actually evaluated.")
 }

cor.mat <- cor(data[ , var.cols], ...)
 cor.mat[upper.tri(cor.mat, diag = TRUE)] <- NA
 high.cor.mat <- bivar.mat <- numeric(0)

if (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 high.cor.rowcol <- as.matrix(which(abs(cor.mat) >= cor.thresh, arr.ind = TRUE))
 high.cor.inds <- sort(unique(as.vector(high.cor.rowcol)))
 high.cor.mat <- data.frame(var1 = rownames(cor.mat)[high.cor.rowcol[ , "row"]], var2 = colnames(cor.mat)[high.cor.rowcol[ , "col"]])
 high.cor.mat$corr <- NULL
 for (r in 1:nrow(high.cor.mat)) high.cor.mat$corr[r] <- cor.mat[high.cor.rowcol[ ,"row"][r], high.cor.rowcol[ ,"col"][r]]
 high.cor.mat <- high.cor.mat[order(abs(high.cor.mat$corr), decreasing = TRUE), ]

if (is.null(sp.cols) & select %in% bivar.criteria) {
 message("Bivariate relationships not assessable without a response variable ('sp.cols'). Returning high pairwise correlations among 'var.cols'.")
 return (high.cor.mat)
 }

high.cor.vars <- unique(rownames(cor.mat[high.cor.inds, high.cor.inds]))

if (select %in% bivar.criteria) {
 bivar.mat <- FDR(data = data, sp.cols = sp.cols, var.cols = match(high.cor.vars, colnames(data)), simplif = TRUE)[ , c("p.value", "AIC", "BIC")]
 if (all.equal(order(bivar.mat[ , c("p.value")]), order(bivar.mat[ , c("AIC")]), order(bivar.mat[ , c("BIC")]))) message("Results identical using whether p-value, AIC or BIC to select variables.\n")
 else message("Results NOT identical using whether p-value, AIC or BIC to select variables.\n")
 } # end if select in bivar

data.remaining <- data[ , var.cols]
 while (max(abs(cor.mat), na.rm = TRUE) >= cor.thresh) {
 max.cor.ind <- which(abs(cor.mat) == max(abs(cor.mat), na.rm = TRUE), arr.ind = TRUE)
 v1 <- rownames(cor.mat)[max.cor.ind[1]]
 v2 <- colnames(cor.mat)[max.cor.ind[2]]
 if (select %in% univar.criteria) {
 vif <- multicol(data.remaining)
 targets <- vif[c(v1, v2), "VIF", drop = FALSE]
 } # end if univar
 else if (select %in% bivar.criteria) {
 targets <- bivar.mat[c(v1, v2), select, drop = FALSE]
 } # end if bivar

exclude <- which.max(as.matrix(targets))
 excl.name <- rownames(targets)[exclude]
 excl.ind <- match(excl.name, colnames(cor.mat))
 data.remaining <- data.remaining[ , -excl.ind, drop = FALSE]
 cor.mat <- cor.mat[-excl.ind, -excl.ind, drop = FALSE]
 if (all(is.na(cor.mat))) cor.mat <- numeric(0)
 if (length(cor.mat) == 0) break
 } # end while max

} # end if max > thresh

selected.vars <- colnames(cor.mat)
 selected.var.cols <- match(colnames(cor.mat), colnames(data))
 excluded.vars <- colnames(data)[var.cols][!(colnames(data)[var.cols] %in% colnames(cor.mat))]

vif2 <- multicol(data[ , selected.var.cols])

list(high.correlations = high.cor.mat,
 bivariate.significance = bivar.mat,
 excluded.vars = excluded.vars,
 selected.vars = selected.vars,
 selected.var.cols = selected.var.cols,
 strongest.remaining.corr = cor.mat[which.max(abs(cor.mat))],
 remaining.multicollinearity = vif2
 )
} # end corSelect function

 

You can install and load fuzzySim (>=version 1.5) and then check help(corSelect) for reproducible usage examples.

Analyse and compare stepwise model predictions

This function builds a generalized linear model with forward stepwise inclusion of variables, using AIC as the selection criterion, and gives the values predicted at each step, as well as their correlation with the final model predictions. This can be useful to see if a model with just the first few variables could already provide predictions very close to the final ones (see e.g. Fig. 3 in Muñoz et al., 2005). This function is now included in the fuzzySim package (Barbosa, 2015; version 1.3 just released). The correlation coefficient (cor.method) is “pearson” by default, but can be set to “spearman” or “kendall” as well.

stepByStep <- function(data, # name of dataset
                       sp.col, # index of species (target variable) column
                       var.cols, # indices of variables columns
                       family = binomial(link = "logit"),
                       Favourability = FALSE, # used only if family=binomial(logit)
                       trace = 0, # argument for 'step'
                       cor.method = "pearson") {
 
# version 1.3 (22 Jul 2015)
 
n.init <- nrow(data)
data <- na.omit(data)
na.loss <- n.init - nrow(data)
if (na.loss > 0) message(na.loss, " cases excluded due to missing values.")
 
response <- colnames(data)[sp.col]
null.model.formula <- as.formula(paste(response, "~", 1))
scope.formula <- as.formula(paste("~", paste(colnames(data)[var.cols], collapse = "+")))
mod <- step(glm(null.model.formula, family = family, data = data), scope = scope.formula, direction = "forward", trace = trace)
pred.final <- mod$fitted.values
 
if (!all(c("binomial", "logit") %in% mod$family)) Favourability <- FALSE
if (Favourability) fav.final <- Fav(model = mod)
 
model.vars.split <- sapply(mod$anova[ , 1], strsplit, split = " ")
model.vars <- lapply(model.vars.split, `[`, 2)
model.vars <- as.character(model.vars)[-1]
n.steps <- length(model.vars)
 
preds <- favs <- as.data.frame(matrix(nrow = nrow(data), ncol = n.steps))
cor.P <- cor.F <- vector("numeric", n.steps)
names(model.vars) <- names(preds) <- names(favs) <- names(cor.P) <- names(cor.F) <- paste0("step", 1:n.steps)
 
for (s in 1:n.steps) {
step.vars <- model.vars[1:s]
mod.step <- glm(as.formula(paste(response, "~", paste(step.vars, collapse = "+"))), family = family, data = data)
if (Favourability) {
favs[ , s] <- Fav(model = mod.step)
cor.F[s] <- cor(favs[ , s], fav.final, method = cor.method)
}
else {
preds[ , s] <- mod.step$fitted.values
cor.P[s] <- cor(preds[ , s], pred.final, method = cor.method)
}
}; rm(s)
 
if (Favourability) result <- list(predictions = favs, correlations = cor.F, variables = model.vars, model = mod)
else result <- list(predictions = preds, correlations = cor.P, variables = model.vars, model = mod)
 
return(result)
}

Usage example:

library(fuzzySim)
data(rotif.env)
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, cor.method = "spearman")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, Favourability = TRUE, cor.method = "pearson")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, family = poisson)

[presented with Pretty R]

Only forward selection is implemented for now. I may or may not eventually find the time to implement other selection methods.

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Muñoz, A.R., Real R., BARBOSA A.M. & Vargas J.M. (2005) Modelling the distribution of Bonelli’s Eagle in Spain: Implications for conservation planning. Diversity and Distributions 11: 477-486