Weighted probability vs. favourability

Posted on July 11, 2023 by A.M. Barbosa

Presence probability, typically obtained with presence-(pseudo)absence modelling methods like GLM, GAM, GBM or Random Forest, is conditional not only on the suitability of the environmental conditions, but also on the general prevalence (proportion of presences) of the species in the study area. So, a species with few presences will generally have low presence probabilities, even in suitable conditions, simply because its presence is indeed rare.

As species distribution modellers often want to remove the efect of unbalanced prevalence from model predictions, a common procedure is to only pick (pseudo)absences in the same number as the presences, for a modelled prevalence of 50%, though this may imply significant loss of data. Alternatively, most modelling functions (e.g. glm() of base R, but also functions that implement GAM, GBM, Random Forests, etc. in a variety of R packages) allow attributing different weights to presences and absences, although they typically do not do this by default. However, some modelling packages that use these functions, like ENMTools and biomod2, use their own defaults and silently apply different weights to presences and absences, to balance their contributions and thus produce prevalence-corrected suitability predictions. Beware: as per these packages’ help files (which users should always read!), ENMTools always does this by default (see e.g. here), while biomod2 does it by default when the pseudoabsences or background points are automatically generated by the package, but not when they are provided by the user (see e.g. here).

A less compromising alternative may be the favourability function (Real et al., 2006; Acevedo & Real, 2012), which removes the effect of species prevalence from predictions of actual presence probability, without the need to restrict the number of (pseudo)absences to the same as the number of presences (i.e. without losing data), and without the need to alter the actual contributions of the data by attributing them different weights. Below is a simple and reproducible comparison in R between favourability, raw probability, and probability based on a model with down-weighted absences so that they balance the number of presences. I’ve used GLM, but this applies to other presence-(pseudo)absence models as well.

library(fuzzySim)

data("rotif.env")
names(rotif.env)

spp_cols <- names(rotif.env)[18:47]
var_cols <- names(rotif.env)[5:17]

nrow(rotif.env)  # 291 sites
sort(sapply(rotif.env[ , spp_cols], sum))  # from 99 to 172 presences
sort(sapply(rotif.env[ , spp_cols], fuzzySim::prevalence))  # from 34 to 59%

species <- spp_cols[8]  # 8 for example; try with others too
species

npres <- sum(rotif.env[ , species])
nabs <- nrow(rotif.env) - npres
npres
nabs

prevalence(rotif.env[ , species])

# set weights as in weights="equal" of ENMTools::enmtools.glm():
weights <- rep(NA, nrow(rotif.env))
weights[rotif.env[ , species] == 1] <- 1
weights[rotif.env[ , species] == 0] <- npres / nabs
weights
sum(weights[rotif.env[ , species] == 1])
sum(weights[rotif.env[ , species] == 0])  # same

formula <- reformulate(termlabels = var_cols, response = species)
formula 

mod <- glm(formula, data = rotif.env, family = binomial)
modw <- glm(formula, data = rotif.env, family = binomial, weights = weights)

pred <- predict(mod, rotif.env, type = "response")
predw <- predict(modw, rotif.env, type = "response")
fav <- Fav(mod) # note favourability is only applicable to unweighted predictons

par(mfrow = c(1, 3))
plot(pred, predw, pch = 20, cex = 0.2)  # curve
plot(pred, fav, pch = 20, cex = 0.2)  # cleaner curve
plot(fav, predw, pch = 20, cex = 0.2)  # ~linear but with noise

par(mfrow = c(1, 1))
plot(pred[order(pred)], pch = 20, cex = 0.5, col = "grey30", ylab = "Prediction")
points(predw[order(pred)], pch = 20, cex = 0.5, col = "blue")  # higher, as expected after down-weighting the unbalancedly numerous absences
points(fav[order(pred)], pch = 20, cex = 0.5, col = "salmon")  # higher like predw, but with less noise (more like the original pred)
legend("topleft", legend = c("probability", "weighted probability", "favourability"), pch = 20, col = c("grey30", "blue", "salmon"), bty = "n")

As you can see, favourability and weighted probability (which serve the same purpose of removing the effect of unbalanced sample prevalence on model predictions) are highly similar. However, favourability does not alter the original data in any way (i.e., it lets the model weigh presences and absences proportionally to the numbers in which they actually occur in the data); and it provides less noisy results that are more aligned with the original (unweighted, non-manipulated) probability.

I’ve checked this for some species already, but further tests and feedback are welcome!

Model overlay operations based on fuzzy logic

Posted on November 6, 2015 by A.M. Barbosa

Logical and set operations are useful for comparative distribution modelling, to assess consensus or mismatch between the predictions of different models, and to quantify differences between models obtained for different time periods. Fuzzy set theory (Zadeh 1965, Barbosa & Real 2012) allows performing such operations without converting the predictions from continuous to binary, with the inherent application of arbitrary thresholds and over-simplification of model predictions. The result is a continuous numerical value quantifying the intersection, union, sum, or other operation among model predictions, whether binary or continuous.

The fuzzyOverlay function, used e.g. by Gutiérrez-Rodríguez et al. (in press) and by Reino et al. (in press) and included in the fuzzySim package (>=1.6), requires a data frame (or a matrix) containing the model prediction columns to compare, an indication of which columns are to be compared (overlay.cols; by default they are all included), and an op indicating the operation to perform between them. Can be ‘consensus‘ for the arithmetic mean of predictions (or the fuzzy equivalent of the proportion of models that agree that the species occurs at each site), ‘fuzzy_and‘ or ‘intersection‘ for fuzzy intersection; ‘fuzzy_or‘ or ‘union‘ for fuzzy union; ‘prob_and‘ or ‘prob_or‘ for probabilistic and/or, respectively (see Details); ‘maintenance‘ for the values where all predictions for the same row (rounded to the number of digits specified in the next argument) are the same. If data has only two columns to compare, you can also calculate ‘xor‘ for exclusive or, ‘AnotB‘ for the the occurrence of the species in column 1 in detriment of that in column 2,’expansion‘ for the prediction increase in rows where column 2 has higher values than column 1, ‘contraction‘ for the prediction decrease in rows where column 2 has lower values than column 1, or ‘change‘ for a mix of the latter two, with positive values where there has been an increase and negative values where there was decrease in favourability from columns 1 to 2.

fuzzyOverlay <- function(data, 
                         overlay.cols = 1:ncol(data),
                         op = "intersection",
                         na.rm = FALSE,
                         round.digits = 2
                         ) {
 
  # version 1.2 (13 Nov 2015)
 
  data <- data[ , overlay.cols]
  stopifnot(all(data[!is.na(data), ] >= 0 && data[!is.na(data), ] <= 1))
 
  if (op == "consensus") rowSums(data, na.rm = na.rm) / ncol(data)
  else if (op %in% c("fuzzy_and", "intersection")) apply(data, MARGIN = 1, FUN = min, na.rm = na.rm)
  else if (op == "prob_and") apply(data, MARGIN = 1, FUN = prod, na.rm = na.rm)
  else if (op %in% c("fuzzy_or", "union")) apply(data, MARGIN = 1, FUN = max, na.rm = na.rm)
  else if (op == "prob_or") apply(data, MARGIN = 1, FUN = sum, na.rm = na.rm) - apply(data, MARGIN = 1, FUN = prod, na.rm = na.rm)
  else if (op == "maintenance") ifelse(round(data[ , 2], digits = round.digits) == round(data[ , 1], digits = round.digits), round(data[ , 1], digits = round.digits), 0)
 
  else if (op %in% c("xor", "AnotB", "expansion", "contraction", "change")) {
    if (ncol(data) != 2) stop ("This 'op' works only for 'data' with 2 columns.")
    if (op == "xor") pmax(pmin(data[,1], 1 - data[,2], na.rm = na.rm), pmin(1 - data[,1], data[,2], na.rm = na.rm), na.rm = na.rm)  # http://www.dmitry-kazakov.de/ada/fuzzy.htm#Fuzzy
    else if (op == "AnotB") pmin(data[,1], 1 - data[,2], na.rm = na.rm)
    else if (op == "expansion") ifelse(data[ , 2] > data[ , 1], 
                                       data[ , 2] - data[ , 1], 
                                       0)
    else if (op == "contraction") ifelse(data[ , 2] < data[ , 1], 
                                         data[ , 2] - data[ , 1], 
                                         0)
    else if (op == "change") data[ , 2] - data[ , 1]
  }
  else stop ("Invalid 'op' name.")
}

[presented with Pretty R]

You can install and load fuzzySim (>= 1.6) and then check help(fuzzyOverlay) for further info and reproducible usage examples.

REFERENCES

Barbosa A.M. & Real R. (2012) Applying fuzzy logic to comparative distribution modelling: a case study with two sympatric amphibians. The Scientific World Journal, 2012, Article ID 428206

Gutiérrez-Rodríguez J., Barbosa A.M. & Martínez-Solano I. (in press) Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach. Journal of Biogeography

Reino L., Ferreira M., Martínez-Solano I., Segurado P., Xu C. & Barbosa A.M. (in press) Favourable areas for co-occurrence of parapatric species: niche conservatism and niche divergence in Iberian tree frogs and midwife toads. Journal of Biogeography

Zadeh, L.A. (1965) Fuzzy sets. Information and Control, 8: 338-353

Assess model overlap with Schoener’s D, Hellinger distance and Warren’s I

Posted on October 30, 2015 by A.M. Barbosa

Metrics for quantifying the similarity among ecological niche models are important for testing patterns of niche evolution. The modOverlap function, used e.g. by Reino et al. (in press) and now included in the fuzzySim package (>=1.5), calculates three such metrics: Shoener’s D statistic for niche overlap, Hellinger distance between probability distributions, and the I similarity statistic of Warren et al. (2008).

These formulas are also implemented e.g. within the niche.overlap function of R package phyloclim, but there they require input data in complex and software-specific formats. The function below calculates these indices from simply two vectors or columns containing the predictions of the two models to compare.

modOverlap <- function (pred1, pred2, na.rm = TRUE) 
{
    p1 <- pred1/sum(pred1, na.rm = na.rm)
    p2 <- pred2/sum(pred2, na.rm = na.rm)
    SchoenerD <- 1 - 0.5 * sum(abs(p1 - p2), na.rm = na.rm)
    HellingerDist <- sqrt(sum((sqrt(p1) - sqrt(p2))^2, na.rm = na.rm))
    WarrenI <- 1 - ((HellingerDist^2)/2)
    list(SchoenerD = SchoenerD, WarrenI = WarrenI, HellingerDist = HellingerDist)
}

[presented with Pretty R]

See also the fuzSim function in the fuzzySim package, which calculates fuzzy versions of classic binary similarity indices such as Jaccard, Baroni-Urbani & Buser, Sorensen and Simpson (Barbosa 2015).

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Warren D.L., Glor R.E. & Turelli, M. (2008) Environmental niche equivalency versus conservatism: quantitative approaches to niche evolution. Evolution, 62: 2868-83 [plus ERRATUM]

Analyse and compare stepwise model predictions

Posted on July 3, 2015 by A.M. Barbosa

This function builds a generalized linear model with forward stepwise inclusion of variables, using AIC as the selection criterion, and gives the values predicted at each step, as well as their correlation with the final model predictions. This can be useful to see if a model with just the first few variables could already provide predictions very close to the final ones (see e.g. Fig. 3 in Muñoz et al., 2005). This function is now included in the fuzzySim package (Barbosa, 2015; version 1.3 just released). The correlation coefficient (cor.method) is “pearson” by default, but can be set to “spearman” or “kendall” as well.

stepByStep <- function(data, # name of dataset
                       sp.col, # index of species (target variable) column
                       var.cols, # indices of variables columns
                       family = binomial(link = "logit"),
                       Favourability = FALSE, # used only if family=binomial(logit)
                       trace = 0, # argument for 'step'
                       cor.method = "pearson") {
 
# version 1.3 (22 Jul 2015)
 
n.init <- nrow(data)
data <- na.omit(data)
na.loss <- n.init - nrow(data)
if (na.loss > 0) message(na.loss, " cases excluded due to missing values.")
 
response <- colnames(data)[sp.col]
null.model.formula <- as.formula(paste(response, "~", 1))
scope.formula <- as.formula(paste("~", paste(colnames(data)[var.cols], collapse = "+")))
mod <- step(glm(null.model.formula, family = family, data = data), scope = scope.formula, direction = "forward", trace = trace)
pred.final <- mod$fitted.values
 
if (!all(c("binomial", "logit") %in% mod$family)) Favourability <- FALSE
if (Favourability) fav.final <- Fav(model = mod)
 
model.vars.split <- sapply(mod$anova[ , 1], strsplit, split = " ")
model.vars <- lapply(model.vars.split, `[`, 2)
model.vars <- as.character(model.vars)[-1]
n.steps <- length(model.vars)
 
preds <- favs <- as.data.frame(matrix(nrow = nrow(data), ncol = n.steps))
cor.P <- cor.F <- vector("numeric", n.steps)
names(model.vars) <- names(preds) <- names(favs) <- names(cor.P) <- names(cor.F) <- paste0("step", 1:n.steps)
 
for (s in 1:n.steps) {
step.vars <- model.vars[1:s]
mod.step <- glm(as.formula(paste(response, "~", paste(step.vars, collapse = "+"))), family = family, data = data)
if (Favourability) {
favs[ , s] <- Fav(model = mod.step)
cor.F[s] <- cor(favs[ , s], fav.final, method = cor.method)
}
else {
preds[ , s] <- mod.step$fitted.values
cor.P[s] <- cor(preds[ , s], pred.final, method = cor.method)
}
}; rm(s)
 
if (Favourability) result <- list(predictions = favs, correlations = cor.F, variables = model.vars, model = mod)
else result <- list(predictions = preds, correlations = cor.P, variables = model.vars, model = mod)
 
return(result)
}

Usage example:

library(fuzzySim)
data(rotif.env)
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, cor.method = "spearman")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, Favourability = TRUE, cor.method = "pearson")
 
stepByStep(data = rotif.env, sp.col = 18, var.cols = 5:17, family = poisson)

[presented with Pretty R]

Only forward selection is implemented for now. I may or may not eventually find the time to implement other selection methods.

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Muñoz, A.R., Real R., BARBOSA A.M. & Vargas J.M. (2005) Modelling the distribution of Bonelli’s Eagle in Spain: Implications for conservation planning. Diversity and Distributions 11: 477-486

Calculate the amount of deviance explained by a GLM

Posted on August 14, 2013 by A.M. Barbosa

Linear models come with an R-squared value that measures the proportion of variation that the model accounts for. The R-squared is provided with summary(model) in R. For generalized linear models (GLMs), the equivalent is the amount of deviance accounted for (D-squared; Guisan & Zimmermann 2000), but this value is not normally provided with the model summary. The Dsquared function, now included in the modEvA package (Barbosa et al. 2014), calculates it. There is also an option to calculate the adjusted D-squared, which takes into account the number of observations and the number of model parameters, thus allowing direct comparison among different models (Weisberg 1980, Guisan & Zimmermann 2000). UPDATE: see also further measures in the new RsqGLM function.

Dsquared <- function(model = NULL, 
                     obs = NULL, 
                     pred = NULL, 
                     family = NULL, # needed only when 'model' not provided
                     adjust = FALSE, 
                     npar = NULL) { # needed only when 'model' not provided
  # version 1.4 (31 Aug 2015)
 
  model.provided <- ifelse(is.null(model), FALSE, TRUE)
 
  if (model.provided) {
    if (!("glm" %in% class(model))) stop ("'model' must be of class 'glm'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
 
  } else { # if model not provided
    if (is.null(obs) | is.null(pred)) stop ("You must provide either 'obs' and 'pred', or a 'model' object of class 'glm'.")
    if (length(obs) != length(pred)) stop ("'obs' and 'pred' must be of the same length (and in the same order).")
    if (is.null(family)) stop ("With 'obs' and 'pred' arguments (rather than a model object), you must also specify one of two model family options: 'binomial' or 'poisson' (in quotes).")
    else if (!is.character(family)) stop ("Argument 'family' must be provided as character (i.e. in quotes: 'binomial' or 'poisson').")
    else if (length(family) != 1 | !(family %in% c("binomial", "poisson"))) stop ("'family' must be either 'binomial' or 'poisson' (in quotes).")
 
    if (family == "binomial") {
      if (any(!(obs %in% c(0, 1)) | pred < 0 | pred > 1)) stop ("'binomial' family implies that 'obs' data should be binary (with values 0 or 1) and 'pred' data should be bounded between 0 and 1.")
      link <- log(pred / (1 - pred))  # logit
    }  # end if binomial
 
    else if (family == "poisson") {
      if (any(obs %%1 != 0)) stop ("'poisson' family implies that 'obs' data should consist of whole numbers.")
      link <- log(pred)
    }  # end if poisson
 
    model <- glm(obs ~ link, family = family)
  }  # end if model not provided
 
  D2 <- (model$null.deviance - model$deviance) / model$null.deviance
 
  if (adjust) {
    if (model.provided) {
      n <- length(model$y)
      #p <- length(model$coefficients)
      p <- attributes(logLik(model))$df
    } else {
      if (is.null(npar)) stop ("Adjusted D-squared from 'obs' and 'pred' values (rather than a model object) requires specifying the number of parameters in the underlying model ('npar').")
      n <- length(na.omit(obs))
      p <- npar
    }  # end if model.provided else
 
    D2 <- 1 - ((n - 1) / (n - p)) * (1 - D2)
  }  # end if adjust
 
  return (D2)
}

[presented with Pretty R]

Usage examples:

Dsquared(model = mymodel)

Dsquared(model = mymodel, adjust = TRUE)

Dsquared(obs = myobs, pred= mypred, family = “poisson”)

Dsquared(obs = myobs, pred = mypred, family = “poisson”, adjust = TRUE, npar = 5)

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Guisan, A. & Zimmermann, N.E. (2000). Predictive habitat distribution models in ecology. Ecological Modelling 135: 147-186

Weisberg, S. (1980). Applied Linear Regression. Wiley, New York

Get the equation of a model to apply elsewhere

Posted on August 14, 2013 by A.M. Barbosa

The summary of a model in R gives you a table of the coefficient estimates and other parameters. Sometimes it may be useful to have a string of text with the model’s equation, so that you can present it in an article (e.g. Real et al. 2005) or apply it in a (raster map) calculation, either in R (although here you can usually use the ‘predict’ function) or in a GIS software (e.g. Barbosa et al. 2010). The getModEqn function, now included in the modEvA package (Barbosa et al. 2014), gets this equation for linear or generalized linear models.

By default it prints the “y” linear equation, but for generalized linear models you can also set type = “P” (for the equation of probability) or type = “F” (for favourability, which corrects the intercept to eliminate the effect of prevalence — see Real et al. 2006).

If the variables to which you want to apply the model have a prefix or suffix (e.g. prefix = “raster.stack$” for the R raster package, or prefix = “mydata$” for a data frame, or suffix = “@1” in Quantum GIS, or suffix = “@mapset” in GRASS), you can get these in the equation too, using the prefix and/or the suffix arguments.

getModEqn <-
function(model, type = "Y", digits = NULL, prefix = NULL, suffix = NULL) {
  # version 1.7 (4 Apr 2022)

  stopifnot(class(model) %in% c("lm", "glm"))
  
  if (length(type) != 1 | !(type %in% c("Y", "P", "F"))) stop("'type' must be either 'Y', 'P', or 'F'")
  if(!("glm" %in% class(model)) && type != "Y") {
    message("types 'P' and 'F' are only applicable to models of class 'glm', so type was reset to 'Y'")
    type <- "Y"
  }
  coeffs <- summary(model)$coefficients[ , 1]
  if (type == "F" & !("(Intercept)") %in% names(coeffs)) {
    message("'F' is currently only applicable to models with an intercept, so type was set to 'P'")
    type <- "P"
  }
  if (type == "F") {
    n1 <- sum(model$y == 1)
    n0 <- sum(model$y == 0)
    coeffs["(Intercept)"] <- coeffs["(Intercept)"] - log(n1/n0)
  }
  names(coeffs) <- paste(prefix, names(coeffs), suffix, sep = "")
  if (!is.null(digits)) coeffs <- signif(coeffs, digits)  # was previously 'round'
  coeffs <- ifelse(coeffs < 0, coeffs, paste("+", coeffs, sep = ""))
  multips <- paste(coeffs, names(coeffs), sep = "*")
  multips <- sub(x = multips, pattern = paste(prefix, "*(Intercept)", suffix, sep = ""), replacement = "", fixed = TRUE)
  eqn <- apply(X = as.matrix(multips), MARGIN = 2, FUN = paste, collapse = "")
  if (substring(eqn, 1, 1) == "+")  eqn <- substring(eqn, 2)
  if (type == "Y")  eqn <- paste("Y=", eqn, sep = "")
  if (type == "P")  eqn <- paste("P=1-(1/(1+exp(", eqn, ")))", sep = "")
  if (type == "F")  eqn <- paste("F=1-(1/(1+exp(", eqn, ")))", sep = "")
  cat(eqn)
  return(invisible(eqn))
}

Usage example:

data(rotif.mods)
mod <- rotif.mods$models[[1]]
getModEqn(mod, digits = 2)
#Y=-2.2+6.7e-07*Area+0.36*HabitatDiversity+1.1e-08*HumanPopulation-0.00062*Precipitation-0.012*PrecipitationSeasonality-0.00022*TemperatureSeasonality

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa, A.M., Real, R. & Vargas, J.M. (2010). Use of coarse-resolution models of species’ distributions to guide local conservation inferences. Conservation Biology 24: 1378–87

Real, R., Barbosa, A.M., Martínez-Solano, Í. & García-París, M. (2005). Distinguishing the distributions of two cryptic frogs (Anura: Discoglossidae) using molecular data and environmental modeling. Canadian Journal of Zoology 83: 536-545

Real, R., Barbosa, A.M. & Vargas, J.M. (2006). Obtaining environmental favourability functions from logistic regression. Environmental and Ecological Statistics 13: 237-245

Prevalence and evenness in binary data

Posted on June 17, 2013 by A.M. Barbosa

For building and evaluating species distribution models, the porportion of presences of the species and the balance between the number of presences and absences may be issues to take into account (e.g. Jiménez-Valverde & Lobo 2006, Barbosa et al. 2013). The prevalence and the evenness functions, included in the modEvA package (Barbosa et al. 2014), can calculate these measures:

prevalence <- function (obs, event = 1) {
  # version 1.0
  # calculates the prevalence (proportion of occurrences) of a value (event) in a vector
  # obs: a vector of binary observations (e.g. 1 vs. 0, male vs. female, disease vs. no disease, etc.)
  # event: the value whose prevalence we want to calculate (e.g. 1, "present", etc.)
  sum(obs == event) / length(obs)
}  # end prevalence function

evenness <- function (obs) {
  # version 1.3 (18 June 2013)
  # calculates the evenness (equilibrium) of cases in a binary vector; result ranges from 0 when all values are the same, to 1 when there are the same number of cases with each value
  # obs: a vector of binary observations (e.g. 1 vs. 0, male vs. female, disease vs. no disease, etc.)
  values <- unique(obs)
  nvalues <- length(values)
  if (!(nvalues %in% c(1, 2))) stop("Input vector includes ", nvalues, " different values; 'evenness' is only implemented for binary observations (with 1 or 2 different values).")
  proportion <- (sum(obs == values[1])) / length(obs)
  if (proportion > 0.5)  balance <- 1 - proportion  else  balance <- proportion
  return(2 * balance)  # so evenness ranges between 0 and 1
}  # end evenness function

Let’s exemplify with 3 sample binary vectors x, y and z:

(x <- rep(c(0, 1), each = 5))
(y <- c(rep(0, 3), rep(1, 7)))
(z <- c(rep(0, 7), rep(1, 3)))

prevalence(x)
evenness(x)

prevalence(y)
evenness(y)

prevalence(z)
evenness(z)

[presented with Pretty R]

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa A.M., Real R., Muñoz A.R. & Brown J.A. (2013) New measures for assessing model equilibrium and prediction mismatch in species distribution models. Diversity and Distributions, 19: 1333-1338

Jiménez-Valverde A. & Lobo J.M. (2006) The ghost of unbalanced species distribution data in geographical model predictions. Diversity and Distributions, 12: 521–524.

Plot a generalized linear model

Posted on February 17, 2013 by A.M. Barbosa

The plotGLM function, now included in the modEvA package (Barbosa et al. 2014), plots the observed (presence/absence) data and the predicted (probability) values of a generalized linear model against the y regression equation (logit) values. For now, it works only for logistic regression (binomial response, logit link). Optionally (if plot.values = TRUE) it can display in the plot the proportion of deviance explained (Weisberg 1980, Guisan & Zimmermann 2000), but this requires that the Dsquared function is loaded, and that a model object is provided (rather than just the observed and predicted values) if you want the adjusted deviance as well.

plotGLM <-
function(obs = NULL, pred = NULL, model = NULL, link = "logit", 
                    plot.values = FALSE, xlab = "Logit (y)", 
                    ylab = "Predicted probability", main = "Model plot", ...) {
  # version 1.7 (28 Oct 2014)
  # obs: presence/absence or other binary (1-0) observed data
  # pred: values predicted by a GLM of the binary observed data
  # model: instead of (and overriding) obs & pred, you can provide a model object of class "glm"
  # link: link function of the GLM; only 'logit' is implemented
  # plot.values: logical, whether to report the values of deviance and adjusted deviance in the plot (requires the 'model' argument and the 'Dsquared' function)
  # ...: arguments to pass to the 'plot' function
 
  model.provided <- ifelse(is.null(model), FALSE, TRUE)
 
  if (model.provided) {
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  }  else { # if model not provided
    if (is.null(obs) | is.null(pred)) stop("You must provide either 'obs' and 'pred', or a 'model' object of class 'glm'")
  }  # end if model
 
  stopifnot(
    length(obs) == length(pred),
    obs %in% c(0, 1),
    pred >= 0,
    pred <= 1)
 
  logit <- log(pred / (1 - pred))
  if (!model.provided) model <- glm(obs ~ logit, family = "binomial")
 
  plot(pred ~ logit, ylim = c(0, 1), type = "n", xlab = xlab, ylab = ylab, 
       main = main, ...)
  points(obs ~ logit, pch = 1, col = "darkgrey")
  abline(v = 0, lty = 5, col = "grey")  # y of P = 0.5
  #abline(h = pred[which.min(abs(logit))], col = "lightgrey", lty = 2)
  points(pred ~ logit, pch = 20, cex = 0.6)
 
  if (plot.values) {
    Dsq <- Dsquared(model = model, adjust = FALSE)
    if (max(logit) > abs(min(logit)))  x.loc <- c(max(logit), 1)
    else x.loc <- c(min(logit), 0)
    text(x = x.loc[1], y = 0.6, adj = x.loc[2], labels = substitute(paste(D^2 == a), list(a = round(Dsq, 3))))
    if (model.provided) {  # adjDsq needs n parameters in original model, not just our model created from obs~logit
      adjDsq <- Dsquared(model = model, adjust = TRUE)
      text(x = x.loc[1], y = 0.4, adj = x.loc[2], labels = substitute(paste('D'['adj']^2) == a, list(a = round(adjDsq, 3))))  # http://stackoverflow.com/questions/10156417/subscripts-in-plots-in-r, plus a mail on substitute(paste) in Silwod-R
 
    }
  }  # end if plot.val
}

Created by Pretty R at inside-R.org

[presented with Pretty R]

For example, if you have a table called mydata with your species’ presence/absence in column 3 and your predicted data in column 20, paste in R the plotGLM function and then this command:

plotGLM(obs = mydata[ , 3], pred = mydata[ , 20])

Instead of obs and pred, you can provide a model object:

# mymodel <- glm(myspecies ~ var1 + var2 ……)

plotGLM(model = mymodel, plot.values = TRUE)

Grey circles are the presences (0) and absences (1); black dots are the corresponding predicted values. The vertical line at y = 0 corresponds to a predicted value of 0.5 (inflection point). You can add a legend and also alternative or additional titles and labels to the plot, e.g with arguments legend, ylab, xlab and main (see ?plot).

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Guisan, A. & Zimmermann, N.E. (2000). Predictive habitat distribution models in ecology. Ecological Modelling 135: 147-186

Weisberg, S. (1980). Applied Linear Regression. Wiley, New York

Overlap Analysis

Posted on December 2, 2012 by A.M. Barbosa

Overlap Analysis is one of the simplest forms of modelling species’ distributions. It assesses the ranges of values of the given predictor variables at the sites where a species has been recorded present, and predicts where that species should be able to occur based on those presence data (e.g. Brito et al. 1999, Arntzen & Teixeira 2006).

OA, now included in the modEvA package (Barbosa et al. 2014), can also be useful when extrapolating models outside their original scope (geographical area, time period or spatial resolution), as it can identify which localities are within the model’s domain — i.e., within the analysed ranges of values of the variables, outside which the model may not be reliable (e.g. Barbosa et al. 2009). In this case, the sp.cols should contain not species’ presence/absence, but rather indicate (with value 1) all observations that have been included in the model. Cf. modEvA function MESS, which is more informative but also more computationally intensive.

OA <-
function(data, sp.cols, var.cols) {
  # version 2.1 (13 May 2014)
 
  if (length(sp.cols) > 1) stop ("Sorry, OA is currently implemented for only one response variable at a time, so 'sp.cols' must indicate only one column")
 
  input.data <- data
 
  na.rm = TRUE
  if(na.rm) {
    mod.data <- data[ , c(sp.cols, var.cols)]
    data <- data[complete.cases(mod.data), ]
  }
 
  predictors <- data[ , var.cols]
  nvar <- ncol(predictors)
 
  response <- data[ , sp.cols]
  predictors.presence <- predictors[response > 0, ]
  var.presmin <- var.presmax <- vector("numeric", nvar)
  predictors.overlap <- matrix(data = NA, nrow = nrow(predictors), ncol = nvar)
  for (v in 1:nvar) {
    var.presmin[v] <- min(predictors.presence[ , v])
    var.presmax[v] <- max(predictors.presence[ , v])
    predictors.overlap[ , v] <- ifelse((predictors[ , v] >= var.presmin[v] & predictors[ , v] <= var.presmax[v]), 1, 0)
  }  # end for v
  overlap.noNA <- as.integer(rowSums(predictors.overlap) == nvar)
  return(overlap.noNA)
}

[presented with Pretty R]

NOTE that the input data format for this function has changed in mid May 2014, with former parameters response and predictors now replaced with data, sp.cols and var.cols (for coherence with multGLM and other functions in modEvA). Input data for the OA function are now a matrix or data frame containing the response variable (presences vs. absences of a species if we want to model its occurrence, or modelled vs. non-modelled sites if we want to know which non-modelled sites are within the modelled range), and the predictor variables to consider. The output is a binary vector whith 1 where the values of all predictors lie within the ranges observed for the presence records, and 0 otherwise. For example, if you have a table called mydata with your species’ presence data in the first column and your predictor variables in columns 2 to 12, paste in R the OA function followed by this command:

OA(data = mydata, sp.cols = 1, var.cols = 2:12)

References

Arntzen JW, Teixeira J. 2006. History and new developments in the mapping and modelling of the distribution of the golden-striped salamander, Chioglossa lusitanica. Zeitschrift für Feldherpetologie, Supplement: 1-14.

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa, A.M., Real, R. & Vargas, J.M. (2009) Transferability of environmental favourability models in geographic space: the case of the Iberian desman (Galemys pyrenaicus) in Portugal and Spain. Ecological Modelling 220: 747–754.

Brito JC, Crespo EG, Paulo OS 1999. Modelling wildlife distributions: Logistic Multiple Regression vs Overlap Analysis. Ecography 22: 251-260.

Variation partitioning

Posted on November 6, 2012 by A.M. Barbosa

The varPart function, now included in the modEvA package (Barbosa et al. 2014), performs variation partitioning (Borcard et al. 1992) among two factors (e.g. Ribas et al. 2006) or three factors (e.g. Real et al. 2003) for either multiple linear regression models (LM) or generalized linear models (GLM).

If you have linear models, input data for varPart are the coefficients of determination (R-squared values) of the linear regressions of the target variable on all the variables in the model, on the variables related to each particular factor, and (when there are 3 factors) on the variables related to each pair of factors. The outputs are the amounts of variance explained exclusively by each factor, the amounts explained exclusively by the overlapping effects of each pair of factors, and the amount explained by the overlap of the 3 factors if this is the case (e.g. Real et al. 2003). The amount of variation not explained by the complete model is also provided.

If you have generalized linear models (GLMs) such as logistic regression or the favourability function, you have no true R-squared values from these models; inputs can be squared coefficients of (Spearman’s) correlation between the model predictions given by each factor or pair of factors and the predictions of the complete model (e.g. Muñoz & Real 2006), or the R-squared values of the corresponding logit (y) functions (Real et al. 2013), or an adjusted R-squared (De Araújo et al. 2014). Consequently, output values are not the total amounts of variance (of the target variable) explained by factors and overlaps, but rather their proportional contribution to the total variation explained by the model. The amount of unexplained variation is not available for GLMs.

varPart <- function(A, B, C = NA, AB, AC = NA, BC = NA, ABC = NA,
         model.type = NULL, A.name = "Factor A", B.name = "Factor B", 
         C.name = "Factor C", plot = TRUE, plot.digits = 3, cex.names = 1.5, 
         cex.values = 1.2, main = "", cex.main = 2, plot.unexpl = TRUE) {
 
  # version 1.7 (17 Mar 2017)
 
  if (!is.null(model.type)) message ("NOTE: Argument 'model.type' is no longer used.")
 
  partials c(A, B, C, AB, BC, AC, ABC)
  if (is.finite(partials[c(1:2, 4)]) && is.na(partials[c(3, 5:7)]))  
    twofactors TRUE
  else if (all(is.finite(partials)))  
    twofactors FALSE
  else stop ("You must provide numeric values for either A, B and AB (for variation partitioning among two factors) or A, B, C, AB, BC, AC and ABC (for variation partitioning among three factors). See Details.")
 
  if (!all(na.omit(partials) >= 0 & na.omit(partials) <= 1)) stop ("Values must be between 0 and 1.")
 
  totalexpl ifelse(twofactors, AB, ABC)
  unexpl 1 - totalexpl
 
  if (twofactors) {
    Apure c(paste("Pure", A.name), paste("Pure", B.name),
                      paste0("Pure ", A.name, "_", B.name, " overlap"),
                      "Unexplained")
    results data.frame(c(Apure, Bpure, ABoverlap, unexpl),
                          row.names = output.names)
 
  } else { # end if 2 factors
 
    Apure C
    BCoverlap c(paste("Pure", A.name),
                      paste("Pure", B.name),
                      paste("Pure", C.name),
                      paste0("Pure ", A.name, "_", B.name, " overlap"),
                      paste0("Pure ", B.name, "_", C.name, " overlap"),
                      paste0("Pure ", A.name,"_", C.name," overlap"),
                      paste0(A.name,"_",B.name,"_",C.name," overlap"),
                      "Unexplained")
    results data.frame(c(Apure, Bpure, Cpure, ABoverlap, BCoverlap,
                            ACoverlap, ABCoverlap, unexpl),
                          row.names = output.names)
  }  # end else
 
  colnames(results) "Proportion"
 
  if (plot) {  # adapted from Daniel's http://stackoverflow.com/questions/1428946/venn-diagrams-with-r
 
    circle function(x, y, r) {
      ang seq(0, 2*pi, length = 100)
      xx cos(ang)
      yy sin(ang)
      polygon(xx, yy)
    }  # end circle funtion (by Daniel)
 
    Apure round(Apure, plot.digits)  # shorten values for plotting
    Bpure round(Bpure, plot.digits)
    ABoverlap round(ABoverlap, plot.digits)
    if(!twofactors) {
      Cpure round(Cpure, plot.digits)
      BCoverlap round(BCoverlap, plot.digits)
      ACoverlap round(ACoverlap, plot.digits)
      ABCoverlap round(ABCoverlap, plot.digits)
    }
 
    if (twofactors) {
      plot(0, 0, ylim = c(-1, 10), xlim = c(-1, 7), type = "n", axes = FALSE,
           ylab = "", xlab = "", main = main, cex.main = cex.main)
      circle(3,3,3)
      circle(3,6,3)
      text(x = c(3, 3), y = c(9.5, -0.5), labels = c(A.name, B.name),
           cex = cex.names)
      text(x = c(3, 3, 3), y = c(7, 4.75, 2), c(Apure, ABoverlap, Bpure),
           cex = cex.values)
 
    } else { # end if 2 factors
 
      plot(0, 0, ylim = c(-1, 10), xlim = c(-1, 10), type = "n", axes = FALSE,
           ylab = "", xlab = "", main = main, cex.main = cex.main)
      circle(3, 6, 3)
      circle(6, 6, 3)
      circle(4.5, 3,  3)
      text(x = c(2.5, 6.5, 4.5), y = c(9.5, 9.5, -0.5),
           labels = c(A.name, B.name, C.name), cex = cex.names, adj = c(0.5, 0.5, 0))
      text(x = c(1.8, 7.2, 4.5, 4.5, 2.8, 6.2, 4.5), y = c(6.6, 6.6, 2, 7, 4, 4, 5), labels = c(Apure, Bpure, Cpure, ABoverlap, ACoverlap, BCoverlap, ABCoverlap), cex = cex.values)
    } # end if 2 factors else
 
    if (plot.unexpl)  {
      rect(-1, -1, 10, 10)
      text(x = -0.9, y = -0.2, label = paste0("Unexplained\n", round(unexpl, plot.digits)), adj = 0, cex = cex.values)
    }
 
  }  # end if plot
  results
}

[presented with Pretty R]

You just need to copy the function above, paste it into R and press enter. An example of how to use it then is provided below. The results are returned numerically and also plotted in a diagram:

# if you have a linear model (LM), use (non-adjusted) R-squared values 
# for each factor and for their combinations as inputs:
 
varPart(A = 0.456, B = 0.315, C = 0.281, AB = 0.051, BC = 0.444, 
AC = 0.569, ABC = 0.624, A.name = "Spatial", B.name = "Human", 
C.name = "Environmental", main = "Small whale")
 
# if you have a generalized linear model (GLM), 
# you can use squared correlation coefficients 
# of the predictions of each factor with those of the complete model:
 
varPart(A = (-0.005)^2, B = 0.698^2, C = 0.922^2, AB = 0.696^2, 
BC = 0.994^2, AC = 0.953^2, ABC = 1, A.name = "Topographic", 
B.name = "Climatic", C.name = "Geographic", main = "Big bird", 
plot.unexpl = FALSE)

varPart_GLM You can change the number of digits in the plotted values with the argument plot.digits. You can also change the character sizes of the plot’s main title, circle names, and values, using arguments cex.main, cex.names and cex.values. Note that these results derive from arithmetic operations between your input values, and they always sum up to 1; if your input is incorrect, the results will be incorrect as well.

References

Barbosa A.M., Brown J.A., Jiménez-Valverde & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Borcard D, Legendre P, Drapeau P (1992) Partialling out the spatial component of ecological variation. Ecology 73: 1045-1055

De Araújo, C.B., Marcondes-Machado, L.O. & Costa, G.C. (2014) The importance of biotic interactions in species distribution models: a test of the Eltonian noise hypothesis using parrots. Journal of Biogeography 41: 513-523 (DOI: 10.1111/jbi.12234)

Muñoz, A.-R. & Real, R. (2006) Assessing the potential range expansion of the exotic monk parakeet in Spain. Diversity and Distributions 12: 656–665.

Real, R., Barbosa, A.M., Porras, D., Kin, M.S., Márquez, A.L., Guerrero, J.C., Palomo, L.J., Justo, E.R. & Vargas, J.M. (2003) Relative importance of environment, human activity and spatial situation in determining the distribution of terrestrial mammal diversity in Argentina. Journal of Biogeography 30: 939-947.

Real R., Romero D., Olivero J., Estrada A. & Márquez A.L. (2013) Estimating how inflated or obscured effects of climate affect forecasted species distribution. PLoS ONE 8: e53646. doi:10.1371/journal.pone.0053646

Ribas, A., Barbosa, A.M., Casanova, J.C., Real, R., Feliu, C. & Vargas, J.M. (2006) Geographical patterns of the species richness of helminth parasites of moles (Talpa spp.) in Spain: separating the effect of sampling effort from those of other conditioning factors. Vie et Milieu 56: 1-8.

modTools

simple tools for simpler modelling

Tag Archives: model analysis

Weighted probability vs. favourability

Model overlay operations based on fuzzy logic

Assess model overlap with Schoener’s D, Hellinger distance and Warren’s I

Analyse and compare stepwise model predictions

Calculate the amount of deviance explained by a GLM

Get the equation of a model to apply elsewhere

Prevalence and evenness in binary data

Plot a generalized linear model

Overlap Analysis

Variation partitioning