Trim off non-significant variables from a model

Posted on April 11, 2013 by A.M. Barbosa

Stepwise variable selection is a common procedure for simplifying models. It maximizes predictive efficiency in an objective and reproducible way, and is useful when the individual importance of the predictors is not known a priori (Hosmer & Lemeshow, 2000). The step function in R performs such procedure using an information criterion (AIC) to select the variables, but it often leaves variables that are not significant in the model. Such variables can be subsequently removed with a manual stepwise procedure (e.g. Crawley 2007, p. 442; Barbosa & Real 2010, 2012; Estrada & Arroyo 2012). The modelTrim function, now included in the fuzzySim package (Barbosa, 2015), performs such removal automatically until all remaining variables are significant. It can also be applied to a full model (i.e., without previous use of the step function), as it serves as a backward stepwise selection procedure based on the significance of the coefficients (if method = ‘summary’, the default) or on the significance of the variables themselves (if method = ‘anova’, better when there are categorical variables in the model).

modelTrim <-
function(model, method = "summary", alpha = 0.05) {
# version 1.9 (16 Nov 2018)

stopifnot(
class(model) %in% c("glm", "lm"),
alpha > 0,
alpha < 1
)

dat <- as.data.frame(model$model) # new
n.vars.start <- length(model$coefficients) - 1
names.vars.start <- names(model$coefficients)[-1]

if (method == "summary") {
p.values <- expression(summary(model)$coefficients[ , 4])
} else if (method == "anova") {
p.values <- expression(as.matrix(anova(model, test = "Chi"))[ , 5])
} else stop ("'method' must be either 'summary' or 'anova'")

while (max(eval(p.values)[-1]) > alpha) { # excludes p-value of intercept
exclude <- names(which.max(eval(p.values)[-1]))
model <- update(model, as.formula(paste("~.-", exclude)), data = dat)
if (length(model$coefficients) == 1) { # only intercept remains
message("No significant variables left in the model.")
break
} # end if length
} # end while

n.vars.trim <- length(model$coefficients) - 1
excluded.vars <- setdiff(names.vars.start, names(model$coefficients)[-1])
message(n.vars.start - n.vars.trim, " variable(s) excluded by 'modelTrim' function\n ", paste(excluded.vars, collapse = ", "), "\n\n")

return(model)
}

If you have a model object (resulting e.g. from the glm function) called, for example, mod.sp1, load the modelTrim function and then just type

mod.sp1 <- modelTrim(mod.sp1, method = “summary”)

summary(mod.sp1)

An option to trim the models using this function is now also included as an option in the multGLM function.

References

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Barbosa A.M. & Real R. (2010) Favourable areas for expansion and reintroduction of Iberian lynx accounting for distribution trends and genetic diversity of the European rabbit. Wildlife Biology in Practice 6: 34-47

Barbosa A.M. & Real R. (2012) Applying fuzzy logic to comparative distribution modelling: a case study with two sympatric amphibians. The Scientific World Journal, Article ID 428206

Crawley M.J. (2007) The R Book. John Wiley & Sons, Chichester (UK)

Estrada A. & Arroyo B. (2012) Occurrence vs abundance models: Differences between species with varying aggregation patterns. Biological Conservation, 152: 37-45

Hosmer D. W. & Lemeshow S. (2000) Applied Logistic Regression (2nd ed). John Wiley and Sons, New York

Create species name codes

Posted on April 5, 2013 by A.M. Barbosa

Often we need to work with data sets that have species in column names. Species names are often larger than the maximum size allowed, for example, in .dbf tables to associate with maps in a GIS. We thus need to create codes for the species names, which should be short but also unambiguous (you should recognize the species through the species code) and unique (there must not be two species with the same code). The spCodes function, now included in the fuzzySim package (Barbosa, 2014), combines a specified number of characters of the genus and species (and optionally also the subspecies) names, and checks if the codes are unique:

spCodes <- function(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "") {
  # version 1.8 (7 Mai 2013)

  species <- as.character(species)
  splits <- strsplit(species, split = sep.species)
  gen <- sp <- vector("character", length(splits))
  for (i in 1:length(splits)) {
    gen[i] <- splits[[i]][1]
    sp[i] <- splits[[i]][2]
    if (is.na(sp[i]))  sp[i] <- ""
  }  # end for i

  abbrev <- function (string, nchar) {
    abv <- substr(string, start = 1, stop = nchar)
    return(abv)
  }  # end abbrev function
  gen.code <- abbrev(gen, nchar.gen)
  sp.code <- abbrev(sp, nchar.sp)
  spcode <- paste(gen.code, sp.code, sep = sep.spcode)

  if (nchar.ssp > 0) {
    ssp <- vector("character", length(splits))
    for (i in 1:length(splits)) {
      ssp[i] <- splits[[i]][3]
      if (is.na(ssp[i]))  ssp[i] <- ""
    }
    ssp.code <- abbrev(ssp, nchar.ssp)
    spcode <- paste(spcode, ssp.code, sep = sep.spcode)
  }

  if (length(unique(species)) != length(unique(spcode))) {
    data <- cbind(species, spcode)
    unique.data <- data[!duplicated(data[ , c("species")]), ]
    ordered.data <- unique.data[order(unique.data[, "spcode"]), ]
    duplicates <- duplicated(ordered.data[ , "spcode"])
    for (i in 1:length(duplicates)) {
      if (isTRUE(duplicates[i])) duplicates[i-1] <- TRUE
    }  # to include the original as well as the duplicate
    ordered.data.duplicates <- ordered.data[duplicates, ]
    print(ordered.data.duplicates)
    stop("Resulting species codes are not unique (see above); try a different combination of nchar.gen, nchar.sp (and nchar.ssp), or check that you have specified the correct sep.species.")
  }
  message("OK - no duplicated spcodes found.")
  return(spcode)
}  # end spCodes function

[Created by Pretty R at inside-R.org]

If you have a vector named “species” with your species names, with the genus and the specific name separated by a white space, load the function above and then type:

spcode <- spCodes(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "")

This will create species codes with the first 3 characters from the genus and the first 3 characters from the specific name (although you can set any other numbers of characters, such as 1 + 5). If the resulting spcodes are not unique (i.e., if the number of species names is not the same as the number of resulting species codes), you will get an error message showing the duplicates and asking you to use another combination of numbers of characters for the genus and specific name.

If your input species binomials are separated by a character other than a space (e.g. a dot or an underscore), you can specify this with the sep.species argument; but note that all species names must be separated by the same character (e.g., ONE white space) for the function to work correctly. By default, there will be no character separating the two parts of the spcode, but you can use the sep.spcode argument to define one (e.g., sep.spcode = “_”).

If you have a table called mydata with species names in columns 2 to 30, for example, you can turn them into species codes like this:

spcodes <- spCodes(names(mydata)[2:30], nchar.gen = 3, nchar.sp = 3, sep.species = " ", sep.spcode = "")

names(mydata)[2:30] <- spcodes

REFERENCES:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 1.0.

Determine the percentage of data to reserve for model testing

Posted on March 26, 2013 by A.M. Barbosa

Based on the work of Schaafsma & van Vark (1979), Huberty (1994) provided a heuristic (‘rule of thumb’) for determining an adequate proportion of data to set aside for testing species presence/absence models, based on the number of predictor variables that are used (Fielding & Bell 1997). The percentTestData function, now included in the fuzzySim package (Barbosa 2015), calculates this proportion as a percentage:

percentTestData <- function(nvar) {
  # v1.1 (28 Mar 2013)
  # heuristic ('rule of thumb') of Huberty (1994; in Fielding & Bell 1997) to determine the test/training data ratio for presence-absence models
  # nvar: number of predictor variables
  huberty <- 1 / (1 + sqrt(nvar - 1))
  return(round(100 * huberty, 1))
}

[presented with Pretty R]

For example, if you’re building a model based on 15 variables, load the percentTestData function and then just type:

percentTestData(15)

References

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Huberty C.J. (1994) Applied Discriminant Analysis. Wiley, New York, 466 pp.

Schaafsma W. & van Vark G.N. (1979) Classification and discrimination problems with applications. Part IIa. Statistica Neerlandica 33: 91-126

Fielding A. H. & Bell J. F. (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24: 38-49

GLMs with multiple species and options

Posted on March 1, 2013 by A.M. Barbosa

The multGLM function, now included in the fuzzySim package (Barbosa, 2015), automatically calculates binomial GLMs for one or more species (or other binary variables) in a data frame. The function can optionally perform stepwise variable selection (and it does so by default) instead of forcing all variables into the models, starting from either the null model (the default, so selection starts forward) or from the full model (so selection starts backward) and using either AIC (Akaike’s Information Criterion) or BIC (Bayesian Information Criterion, also known as Schwarz criterion, SBC or SBIC) as a variable selection criterion. Instead or subsequently, it can also perform stepwise removal of non-significant variables from the models using the modelTrim function. There is also an optional preliminary selection of variables with a significant/informative bivariate relationship with the response, based on the false discovery rate (with different methods also available), for which you need the FDR function (but note that some variables may be significant and informative in a multivariate model even if they would not have been selected by FDR). It can also do a preliminary selection of non-correlated variables, which requires the corSelect function, available in fuzzySim since version 1.5. If you want favourability (F) to be calculated (as it is by default), you need the Fav function as well. By default, all data are used in model training, but you can define an optional test.sample to be reserved for model testing afterwards.

multGLM <- function (data, sp.cols, var.cols, id.col = NULL,
family = "binomial", test.sample = 0, FDR = FALSE, correction = "fdr", corSelect = FALSE, cor.thresh = 0.8, step = TRUE, trace = 0, start = "null.model", direction = "both", select = "AIC", trim = TRUE, Y.prediction = FALSE, P.prediction = TRUE, Favourability = TRUE, group.preds = TRUE, 
TSA = FALSE, coord.cols = NULL, degree = 3, verbosity = 2, ...) 
{
    # version 5.1 (6 Jan 2020)
    start.time <- Sys.time()
    on.exit(timer(start.time))
    input.ncol <- ncol(data)
    input.names <- colnames(data)
    stopifnot(as.vector(na.omit(as.matrix(data[, sp.cols]))) %in% 
        c(0, 1), all(sp.cols %in% (1:input.ncol)) || all(sp.cols %in% 
        input.names), all(var.cols %in% (1:input.ncol)) || all(var.cols %in% 
        input.names), is.null(id.col) || (length(id.col) == 1 && 
        (id.col %in% (1:input.ncol) || id.col %in% input.names)), 
        is.null(coord.cols) || all(coord.cols %in% (1:input.ncol)) || 
            all(coord.cols %in% input.names), family == "binomial", 
        test.sample >= 0 | test.sample == "Huberty", length(test.sample) == 
            1 | (is.integer(test.sample) & test.sample > 0), 
        length(test.sample) < nrow(data), is.logical(FDR), is.logical(step), 
        start %in% c("null.model", "full.model"), direction %in% 
            c("both", "backward", "forward"), select %in% c("AIC", 
            "BIC"), is.logical(Y.prediction), is.logical(P.prediction), 
        is.logical(Favourability), is.logical(group.preds), is.logical(trim), 
        is.logical(TSA), !Favourability | exists("Fav"), !trim | 
            exists("modelTrim"))
    data$sample <- "train"
    n <- nrow(data)
    data.row <- 1:n
    test.sample.input <- test.sample
    if (length(test.sample) == 1) {
        if (test.sample == "Huberty") {
            if (!FDR & !step & !trim) {
                test.sample <- percentTestData(length(var.cols))/100
                n.test <- round(n * test.sample)
                if (verbosity > 0) 
                  message("Following Huberty's rule, ", test.sample * 
                    100, "% of observations\n          (", n.test, 
                    " out of ", n, ") set aside for model testing; ", 
                    n - n.test, " observations used for model training.")
            }
            else stop("Sorry, Huberty's rule cannot be used with 'FDR', 'step' or 'trim',\n                   as these make the number of variables vary.\n                   Set these 3 parameters to FALSE,\n                   or use a different 'test.sample' option.")
        }
        else if (test.sample == 0) {
            if (verbosity > 0) 
                message("All ", n, " observations used for model training;\n              none reserved for model testing.")
            n.test <- 0
        }
        else if (test.sample < 1) {
            n.test <- round(n * test.sample)
            if (verbosity > 0) 
                message(test.sample * 100, "% of observations (", 
                  n.test, " out of ", n, ") set aside for model testing; ", 
                  n - n.test, " observations used for model training.")
        }
        else if (test.sample >= 1) {
            n.test <- test.sample
            if (verbosity > 0) 
                message(n.test, " (out of ", n, ") observations set aside for model testing; ", 
                  n - n.test, " observations used for model training.")
        }
        test.sample <- sample(data.row, size = n.test, replace = FALSE)
    }
    else if (length(test.sample) > 1) {
        n.test <- length(test.sample)
        if (verbosity > 0) 
            message(n.test, " (out of ", n, ") observations set aside for model testing; ", 
                n - n.test, " observations used for model training.")
    }
    data$sample[data.row %in% test.sample] <- "test"
    train.data <- data[data$sample == "train", ]
    if (Favourability) {
        if (family != "binomial") {
            Favourability <- FALSE
            warning("Favourability is only applicable to binomial responses,\n               so it could not be calculated")
        }
    }
    if (!is.numeric(sp.cols)) 
        sp.cols <- which(colnames(data) %in% sp.cols)
    if (!is.numeric(var.cols)) 
        var.cols <- which(colnames(data) %in% var.cols)
    if (!is.null(coord.cols) && !is.numeric(coord.cols)) 
        coord.cols <- which(colnames(data) %in% coord.cols)
    if (!is.null(id.col) && !is.numeric(id.col)) 
        id.col <- which(colnames(data) %in% id.col)
    keeP <- P.prediction
    if (Favourability) 
        P.prediction <- TRUE
    n.models <- length(sp.cols)
    n.preds <- n.models * (Y.prediction + P.prediction + Favourability)
    models <- vector("list", n.models)
    predictions <- matrix(NA, nrow = nrow(data), ncol = n.preds)
    colnames(predictions) <- rep("", n.preds)
    model.count <- 0
    pred.count <- 0
    for (s in sp.cols) {
        model.count <- model.count + 1
        response <- colnames(train.data)[s]
        if (verbosity > 0) 
            cat("\n\n=> Building model ", model.count, " of ", 
                n.models, " (", response, ")...\n\n", sep = "")
        if (verbosity > 1) 
            cat(length(var.cols), "input predictor variable(s)\n\n")
        if (TSA) {
            if (verbosity > 1) 
                cat("...plus the spatial trend variable\n\n")
            tsa <- suppressMessages(multTSA(data = data, sp.cols = s, 
                coord.cols = coord.cols, degree = degree, type = "Y"))
            tsa_name <- paste("sptrend", response, sep = "_")
            data <- data.frame(data, tsa[, ncol(tsa)])
            names(data)[ncol(data)] <- tsa_name
            train.data <- data.frame(train.data, tsa[which(data$sample == 
                "train"), ncol(tsa)])
            names(train.data)[ncol(train.data)] <- tsa_name
            var.cols <- c(var.cols, ncol(train.data))
        }
        if (FDR) {
            fdr <- FDR(data = train.data, sp.cols = s, var.cols = var.cols, 
                correction = correction, verbose = FALSE)
            if (nrow(fdr$select) == 0) {
                message(paste0("No variables passed the FDR test (so no variables included in the model)\n for '", 
                  response, "'. Consider using 'FDR = FALSE' or choosing a less stringent 'correction' procedure."))
            }
            if (verbosity > 1) 
                cat(length(var.cols) - nrow(fdr$select), "variable(s) excluded by 'FDR' function\n", 
                  paste(row.names(fdr$exclude), collapse = ", "), 
                  "\n\n")
            sel.var.cols <- which(colnames(train.data) %in% rownames(fdr$select))
        }
        else sel.var.cols <- var.cols
        if (length(sel.var.cols) > 1 && corSelect == TRUE) {
            corselect <- suppressMessages(corSelect(data = train.data, 
                sp.cols = s, var.cols = sel.var.cols, cor.thresh = cor.thresh, 
                use = "pairwise.complete.obs"))
            corsel.var.cols <- corselect$selected.var.cols
            if (verbosity > 1) 
                cat(length(sel.var.cols) - length(corsel.var.cols), 
                  "variable(s) excluded by 'corSelect' function\n", 
                  corselect$excluded.vars, "\n\n")
            sel.var.cols <- corsel.var.cols
        }
        if (length(sel.var.cols) == 0) 
            model.vars <- 1
        else model.vars <- colnames(train.data)[sel.var.cols]
        model.formula <- with(train.data, as.formula(paste(response, 
            "~", paste(model.vars, collapse = "+"))))
        model.expr <- expression(glm(model.formula, family = binomial))
        if (step && length(sel.var.cols) > 0) {
            n.vars.start <- length(sel.var.cols)
            if (select == "AIC") 
                K <- 2
            else if (select == "BIC") 
                K <- log(n)
            if (start == "full.model") {
                model <- step(eval(model.expr), direction = direction, 
                  trace = trace, k = K)
            }
            else if (start == "null.model") {
                model.scope <- model.formula[-2]
                null.formula <- as.formula(paste(response, "~", 
                  1))
                model <- step(glm(null.formula, family = binomial, 
                  data = train.data), direction = direction, 
                  scope = model.scope, trace = trace, k = K)
            }
            else stop("'start' must be either 'full.model' or 'null.model'")
            n.vars.step <- length(model$coefficients) - 1
            excluded.vars <- setdiff(colnames(data[, sel.var.cols]), 
                names(model$coefficients)[-1])
            if (verbosity > 1) 
                cat(n.vars.start - n.vars.step, "variable(s) excluded by 'step' function\n", 
                  paste(excluded.vars, collapse = ", "), "\n\n")
        }
        else model <- eval(model.expr, envir = train.data)
        if (trim && length(sel.var.cols) > 0) {
            n.vars.start <- length(model$coefficients) - 1
            names.vars.start <- names(model$coefficients)[-1]
            model <- suppressMessages(modelTrim(model, ...))
            n.vars.trim <- length(model$coefficients) - 1
            excluded.vars <- setdiff(names.vars.start, names(model$coefficients)[-1])
            if (verbosity > 1) 
                cat(n.vars.start - n.vars.trim, "variable(s) excluded by 'modelTrim' function\n", 
                  paste(excluded.vars, collapse = ", "), "\n\n")
        }
        if (step || trim) {
            sel.var.names <- names(model$coefficients)[-1]
            if (verbosity > 1) 
                cat(length(sel.var.names), "variable(s) INCLUDED IN THE FINAL MODEL\n", 
                  paste(sel.var.names, collapse = ", "))
        }
        if (verbosity > 1) 
            cat("\n\n")
        models[[model.count]] <- model
        names(models)[[model.count]] <- response
        if (Y.prediction) {
            pred.count <- pred.count + 1
            colnames(predictions)[pred.count] <- paste(response, 
                "Y", sep = "_")
            predictions[, pred.count] <- predict(model, data)
        }
        if (P.prediction) {
            pred.count <- pred.count + 1
            colnames(predictions)[pred.count] <- paste(response, 
                "P", sep = "_")
            predictions[, pred.count] <- predict(model, data, 
                type = "response")
        }
        if (Favourability) {
            n1 <- sum(train.data[, s] == 1, na.rm = TRUE)
            n0 <- sum(train.data[, s] == 0, na.rm = TRUE)
            pred.count <- pred.count + 1
            predictions[, pred.count] <- Fav(n1n0 = c(n1, n0), 
                pred = predictions[, pred.count - 1])
            colnames(predictions)[pred.count] <- paste(response, 
                "F", sep = "_")
        }
        if (TSA) {
            train.data <- train.data[, -ncol(train.data)]
            data <- data[, -ncol(data)]
            var.cols <- var.cols[-length(var.cols)]
        }
    }
    if (P.prediction && !keeP) {
        n.char <- nchar(colnames(predictions))
        pred.suffix <- substr(colnames(predictions), n.char - 
            1, n.char)
        P.cols <- grep("_P", pred.suffix)
        predictions <- predictions[, -P.cols]
    }
    n.pred.types <- sum(Y.prediction, keeP, Favourability)
    if (n.pred.types == 0) {
        predictions <- data.frame()
    }
    else {
        predictions <- data.frame(data[, id.col], sample = data[, 
            "sample"], predictions)
        if (n.pred.types == 1 || length(sp.cols) == 1) 
            group.preds <- FALSE
        if (group.preds) {
            first.pred.col <- ifelse(is.null(id.col), 2, 3)
            pred1.cols <- seq(first.pred.col, ncol(predictions), 
                by = n.pred.types)
            pred2.cols <- seq(first.pred.col + 1, ncol(predictions), 
                by = n.pred.types)
            pred3.cols <- NULL
            if (n.pred.types == 3) {
                pred3.cols <- seq(first.pred.col + 2, ncol(predictions), 
                  by = n.pred.types)
            }
            predictions <- data.frame(data[, id.col], sample = data$sample, 
                predictions[, pred1.cols], predictions[, pred2.cols], 
                predictions[, pred3.cols])
        }
        if (!is.null(id.col)) {
            colnames(predictions)[1] <- input.names[id.col]
        }
    }
    if (test.sample.input == 0) {
        predictions <- predictions[, -match("sample", colnames(predictions))]
    }
    if (!is.null(id.col)) {
        colnames(predictions)[1] <- colnames(data)[id.col]
    }
    selected_variables <- lapply(models, function(x) names(x$model)[-1])
    message("Finished!")
    return(list(predictions = predictions, models = models, variables = selected_variables))
}

REFERENCES:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Plot a generalized linear model

Posted on February 17, 2013 by A.M. Barbosa

The plotGLM function, now included in the modEvA package (Barbosa et al. 2014), plots the observed (presence/absence) data and the predicted (probability) values of a generalized linear model against the y regression equation (logit) values. For now, it works only for logistic regression (binomial response, logit link). Optionally (if plot.values = TRUE) it can display in the plot the proportion of deviance explained (Weisberg 1980, Guisan & Zimmermann 2000), but this requires that the Dsquared function is loaded, and that a model object is provided (rather than just the observed and predicted values) if you want the adjusted deviance as well.

plotGLM <-
function(obs = NULL, pred = NULL, model = NULL, link = "logit", 
                    plot.values = FALSE, xlab = "Logit (y)", 
                    ylab = "Predicted probability", main = "Model plot", ...) {
  # version 1.7 (28 Oct 2014)
  # obs: presence/absence or other binary (1-0) observed data
  # pred: values predicted by a GLM of the binary observed data
  # model: instead of (and overriding) obs & pred, you can provide a model object of class "glm"
  # link: link function of the GLM; only 'logit' is implemented
  # plot.values: logical, whether to report the values of deviance and adjusted deviance in the plot (requires the 'model' argument and the 'Dsquared' function)
  # ...: arguments to pass to the 'plot' function
 
  model.provided <- ifelse(is.null(model), FALSE, TRUE)
 
  if (model.provided) {
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  }  else { # if model not provided
    if (is.null(obs) | is.null(pred)) stop("You must provide either 'obs' and 'pred', or a 'model' object of class 'glm'")
  }  # end if model
 
  stopifnot(
    length(obs) == length(pred),
    obs %in% c(0, 1),
    pred >= 0,
    pred <= 1)
 
  logit <- log(pred / (1 - pred))
  if (!model.provided) model <- glm(obs ~ logit, family = "binomial")
 
  plot(pred ~ logit, ylim = c(0, 1), type = "n", xlab = xlab, ylab = ylab, 
       main = main, ...)
  points(obs ~ logit, pch = 1, col = "darkgrey")
  abline(v = 0, lty = 5, col = "grey")  # y of P = 0.5
  #abline(h = pred[which.min(abs(logit))], col = "lightgrey", lty = 2)
  points(pred ~ logit, pch = 20, cex = 0.6)
 
  if (plot.values) {
    Dsq <- Dsquared(model = model, adjust = FALSE)
    if (max(logit) > abs(min(logit)))  x.loc <- c(max(logit), 1)
    else x.loc <- c(min(logit), 0)
    text(x = x.loc[1], y = 0.6, adj = x.loc[2], labels = substitute(paste(D^2 == a), list(a = round(Dsq, 3))))
    if (model.provided) {  # adjDsq needs n parameters in original model, not just our model created from obs~logit
      adjDsq <- Dsquared(model = model, adjust = TRUE)
      text(x = x.loc[1], y = 0.4, adj = x.loc[2], labels = substitute(paste('D'['adj']^2) == a, list(a = round(adjDsq, 3))))  # http://stackoverflow.com/questions/10156417/subscripts-in-plots-in-r, plus a mail on substitute(paste) in Silwod-R
 
    }
  }  # end if plot.val
}

Created by Pretty R at inside-R.org

[presented with Pretty R]

For example, if you have a table called mydata with your species’ presence/absence in column 3 and your predicted data in column 20, paste in R the plotGLM function and then this command:

plotGLM(obs = mydata[ , 3], pred = mydata[ , 20])

Instead of obs and pred, you can provide a model object:

# mymodel <- glm(myspecies ~ var1 + var2 ……)

plotGLM(model = mymodel, plot.values = TRUE)

Grey circles are the presences (0) and absences (1); black dots are the corresponding predicted values. The vertical line at y = 0 corresponds to a predicted value of 0.5 (inflection point). You can add a legend and also alternative or additional titles and labels to the plot, e.g with arguments legend, ylab, xlab and main (see ?plot).

References

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Guisan, A. & Zimmermann, N.E. (2000). Predictive habitat distribution models in ecology. Ecological Modelling 135: 147-186

Weisberg, S. (1980). Applied Linear Regression. Wiley, New York

Overlap Analysis

Posted on December 2, 2012 by A.M. Barbosa

Overlap Analysis is one of the simplest forms of modelling species’ distributions. It assesses the ranges of values of the given predictor variables at the sites where a species has been recorded present, and predicts where that species should be able to occur based on those presence data (e.g. Brito et al. 1999, Arntzen & Teixeira 2006).

OA, now included in the modEvA package (Barbosa et al. 2014), can also be useful when extrapolating models outside their original scope (geographical area, time period or spatial resolution), as it can identify which localities are within the model’s domain — i.e., within the analysed ranges of values of the variables, outside which the model may not be reliable (e.g. Barbosa et al. 2009). In this case, the sp.cols should contain not species’ presence/absence, but rather indicate (with value 1) all observations that have been included in the model. Cf. modEvA function MESS, which is more informative but also more computationally intensive.

OA <-
function(data, sp.cols, var.cols) {
  # version 2.1 (13 May 2014)
 
  if (length(sp.cols) > 1) stop ("Sorry, OA is currently implemented for only one response variable at a time, so 'sp.cols' must indicate only one column")
 
  input.data <- data
 
  na.rm = TRUE
  if(na.rm) {
    mod.data <- data[ , c(sp.cols, var.cols)]
    data <- data[complete.cases(mod.data), ]
  }
 
  predictors <- data[ , var.cols]
  nvar <- ncol(predictors)
 
  response <- data[ , sp.cols]
  predictors.presence <- predictors[response > 0, ]
  var.presmin <- var.presmax <- vector("numeric", nvar)
  predictors.overlap <- matrix(data = NA, nrow = nrow(predictors), ncol = nvar)
  for (v in 1:nvar) {
    var.presmin[v] <- min(predictors.presence[ , v])
    var.presmax[v] <- max(predictors.presence[ , v])
    predictors.overlap[ , v] <- ifelse((predictors[ , v] >= var.presmin[v] & predictors[ , v] <= var.presmax[v]), 1, 0)
  }  # end for v
  overlap.noNA <- as.integer(rowSums(predictors.overlap) == nvar)
  return(overlap.noNA)
}

[presented with Pretty R]

NOTE that the input data format for this function has changed in mid May 2014, with former parameters response and predictors now replaced with data, sp.cols and var.cols (for coherence with multGLM and other functions in modEvA). Input data for the OA function are now a matrix or data frame containing the response variable (presences vs. absences of a species if we want to model its occurrence, or modelled vs. non-modelled sites if we want to know which non-modelled sites are within the modelled range), and the predictor variables to consider. The output is a binary vector whith 1 where the values of all predictors lie within the ranges observed for the presence records, and 0 otherwise. For example, if you have a table called mydata with your species’ presence data in the first column and your predictor variables in columns 2 to 12, paste in R the OA function followed by this command:

OA(data = mydata, sp.cols = 1, var.cols = 2:12)

References

Arntzen JW, Teixeira J. 2006. History and new developments in the mapping and modelling of the distribution of the golden-striped salamander, Chioglossa lusitanica. Zeitschrift für Feldherpetologie, Supplement: 1-14.

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa, A.M., Real, R. & Vargas, J.M. (2009) Transferability of environmental favourability models in geographic space: the case of the Iberian desman (Galemys pyrenaicus) in Portugal and Spain. Ecological Modelling 220: 747–754.

Brito JC, Crespo EG, Paulo OS 1999. Modelling wildlife distributions: Logistic Multiple Regression vs Overlap Analysis. Ecography 22: 251-260.

Variation partitioning

Posted on November 6, 2012 by A.M. Barbosa

The varPart function, now included in the modEvA package (Barbosa et al. 2014), performs variation partitioning (Borcard et al. 1992) among two factors (e.g. Ribas et al. 2006) or three factors (e.g. Real et al. 2003) for either multiple linear regression models (LM) or generalized linear models (GLM).

If you have linear models, input data for varPart are the coefficients of determination (R-squared values) of the linear regressions of the target variable on all the variables in the model, on the variables related to each particular factor, and (when there are 3 factors) on the variables related to each pair of factors. The outputs are the amounts of variance explained exclusively by each factor, the amounts explained exclusively by the overlapping effects of each pair of factors, and the amount explained by the overlap of the 3 factors if this is the case (e.g. Real et al. 2003). The amount of variation not explained by the complete model is also provided.

If you have generalized linear models (GLMs) such as logistic regression or the favourability function, you have no true R-squared values from these models; inputs can be squared coefficients of (Spearman’s) correlation between the model predictions given by each factor or pair of factors and the predictions of the complete model (e.g. Muñoz & Real 2006), or the R-squared values of the corresponding logit (y) functions (Real et al. 2013), or an adjusted R-squared (De Araújo et al. 2014). Consequently, output values are not the total amounts of variance (of the target variable) explained by factors and overlaps, but rather their proportional contribution to the total variation explained by the model. The amount of unexplained variation is not available for GLMs.

varPart <- function(A, B, C = NA, AB, AC = NA, BC = NA, ABC = NA,
         model.type = NULL, A.name = "Factor A", B.name = "Factor B", 
         C.name = "Factor C", plot = TRUE, plot.digits = 3, cex.names = 1.5, 
         cex.values = 1.2, main = "", cex.main = 2, plot.unexpl = TRUE) {
 
  # version 1.7 (17 Mar 2017)
 
  if (!is.null(model.type)) message ("NOTE: Argument 'model.type' is no longer used.")
 
  partials c(A, B, C, AB, BC, AC, ABC)
  if (is.finite(partials[c(1:2, 4)]) && is.na(partials[c(3, 5:7)]))  
    twofactors TRUE
  else if (all(is.finite(partials)))  
    twofactors FALSE
  else stop ("You must provide numeric values for either A, B and AB (for variation partitioning among two factors) or A, B, C, AB, BC, AC and ABC (for variation partitioning among three factors). See Details.")
 
  if (!all(na.omit(partials) >= 0 & na.omit(partials) <= 1)) stop ("Values must be between 0 and 1.")
 
  totalexpl ifelse(twofactors, AB, ABC)
  unexpl 1 - totalexpl
 
  if (twofactors) {
    Apure c(paste("Pure", A.name), paste("Pure", B.name),
                      paste0("Pure ", A.name, "_", B.name, " overlap"),
                      "Unexplained")
    results data.frame(c(Apure, Bpure, ABoverlap, unexpl),
                          row.names = output.names)
 
  } else { # end if 2 factors
 
    Apure C
    BCoverlap c(paste("Pure", A.name),
                      paste("Pure", B.name),
                      paste("Pure", C.name),
                      paste0("Pure ", A.name, "_", B.name, " overlap"),
                      paste0("Pure ", B.name, "_", C.name, " overlap"),
                      paste0("Pure ", A.name,"_", C.name," overlap"),
                      paste0(A.name,"_",B.name,"_",C.name," overlap"),
                      "Unexplained")
    results data.frame(c(Apure, Bpure, Cpure, ABoverlap, BCoverlap,
                            ACoverlap, ABCoverlap, unexpl),
                          row.names = output.names)
  }  # end else
 
  colnames(results) "Proportion"
 
  if (plot) {  # adapted from Daniel's http://stackoverflow.com/questions/1428946/venn-diagrams-with-r
 
    circle function(x, y, r) {
      ang seq(0, 2*pi, length = 100)
      xx cos(ang)
      yy sin(ang)
      polygon(xx, yy)
    }  # end circle funtion (by Daniel)
 
    Apure round(Apure, plot.digits)  # shorten values for plotting
    Bpure round(Bpure, plot.digits)
    ABoverlap round(ABoverlap, plot.digits)
    if(!twofactors) {
      Cpure round(Cpure, plot.digits)
      BCoverlap round(BCoverlap, plot.digits)
      ACoverlap round(ACoverlap, plot.digits)
      ABCoverlap round(ABCoverlap, plot.digits)
    }
 
    if (twofactors) {
      plot(0, 0, ylim = c(-1, 10), xlim = c(-1, 7), type = "n", axes = FALSE,
           ylab = "", xlab = "", main = main, cex.main = cex.main)
      circle(3,3,3)
      circle(3,6,3)
      text(x = c(3, 3), y = c(9.5, -0.5), labels = c(A.name, B.name),
           cex = cex.names)
      text(x = c(3, 3, 3), y = c(7, 4.75, 2), c(Apure, ABoverlap, Bpure),
           cex = cex.values)
 
    } else { # end if 2 factors
 
      plot(0, 0, ylim = c(-1, 10), xlim = c(-1, 10), type = "n", axes = FALSE,
           ylab = "", xlab = "", main = main, cex.main = cex.main)
      circle(3, 6, 3)
      circle(6, 6, 3)
      circle(4.5, 3,  3)
      text(x = c(2.5, 6.5, 4.5), y = c(9.5, 9.5, -0.5),
           labels = c(A.name, B.name, C.name), cex = cex.names, adj = c(0.5, 0.5, 0))
      text(x = c(1.8, 7.2, 4.5, 4.5, 2.8, 6.2, 4.5), y = c(6.6, 6.6, 2, 7, 4, 4, 5), labels = c(Apure, Bpure, Cpure, ABoverlap, ACoverlap, BCoverlap, ABCoverlap), cex = cex.values)
    } # end if 2 factors else
 
    if (plot.unexpl)  {
      rect(-1, -1, 10, 10)
      text(x = -0.9, y = -0.2, label = paste0("Unexplained\n", round(unexpl, plot.digits)), adj = 0, cex = cex.values)
    }
 
  }  # end if plot
  results
}

[presented with Pretty R]

You just need to copy the function above, paste it into R and press enter. An example of how to use it then is provided below. The results are returned numerically and also plotted in a diagram:

# if you have a linear model (LM), use (non-adjusted) R-squared values 
# for each factor and for their combinations as inputs:
 
varPart(A = 0.456, B = 0.315, C = 0.281, AB = 0.051, BC = 0.444, 
AC = 0.569, ABC = 0.624, A.name = "Spatial", B.name = "Human", 
C.name = "Environmental", main = "Small whale")
 
# if you have a generalized linear model (GLM), 
# you can use squared correlation coefficients 
# of the predictions of each factor with those of the complete model:
 
varPart(A = (-0.005)^2, B = 0.698^2, C = 0.922^2, AB = 0.696^2, 
BC = 0.994^2, AC = 0.953^2, ABC = 1, A.name = "Topographic", 
B.name = "Climatic", C.name = "Geographic", main = "Big bird", 
plot.unexpl = FALSE)

varPart_GLM You can change the number of digits in the plotted values with the argument plot.digits. You can also change the character sizes of the plot’s main title, circle names, and values, using arguments cex.main, cex.names and cex.values. Note that these results derive from arithmetic operations between your input values, and they always sum up to 1; if your input is incorrect, the results will be incorrect as well.

References

Barbosa A.M., Brown J.A., Jiménez-Valverde & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Borcard D, Legendre P, Drapeau P (1992) Partialling out the spatial component of ecological variation. Ecology 73: 1045-1055

De Araújo, C.B., Marcondes-Machado, L.O. & Costa, G.C. (2014) The importance of biotic interactions in species distribution models: a test of the Eltonian noise hypothesis using parrots. Journal of Biogeography 41: 513-523 (DOI: 10.1111/jbi.12234)

Muñoz, A.-R. & Real, R. (2006) Assessing the potential range expansion of the exotic monk parakeet in Spain. Diversity and Distributions 12: 656–665.

Real, R., Barbosa, A.M., Porras, D., Kin, M.S., Márquez, A.L., Guerrero, J.C., Palomo, L.J., Justo, E.R. & Vargas, J.M. (2003) Relative importance of environment, human activity and spatial situation in determining the distribution of terrestrial mammal diversity in Argentina. Journal of Biogeography 30: 939-947.

Real R., Romero D., Olivero J., Estrada A. & Márquez A.L. (2013) Estimating how inflated or obscured effects of climate affect forecasted species distribution. PLoS ONE 8: e53646. doi:10.1371/journal.pone.0053646

Ribas, A., Barbosa, A.M., Casanova, J.C., Real, R., Feliu, C. & Vargas, J.M. (2006) Geographical patterns of the species richness of helminth parasites of moles (Talpa spp.) in Spain: separating the effect of sampling effort from those of other conditioning factors. Vie et Milieu 56: 1-8.

Sensitivity-specificity curves

Posted on September 10, 2012 by A.M. Barbosa

**NOTE: this function is now deprecated; please use the optiPair function instead**

The SensSpecCurve function calculates and plots sensitivity (black circles) and specificity (white circles) for all thresholds at a given interval. The plot marks the threshold and the sensitivity value at which sensitivity and specificity are closest to each other. Optionally, you can plot also the difference (minuses), sum (pluses), and/or mean (grey line) of sensitivity and specificity, together with their optimal values and respective thresholds. All these values are also provided with the text output of the function.

SensSpecCurve <- function (obs, pred, interval = 0.01, plot = TRUE, plot.diff = FALSE, plot.sum = FALSE, plot.mean = FALSE, ...) {
# obs: a vector of observed presences (1) and absences (0) or another binary response variable
# pred: a vector with the corresponding predicted values of presence probability, habitat suitability, environmental favourability or alike
# interval: the interval of thresholds at which to calculate sensitivity and specificity
# plot: logical indicating whether or not to plot the curves
# plot.sum and plot.diff will optionally plot the sum (+) and difference (-) between sensitivity and specificity
# ... : arguments to be passed to the "plot" function

  SensSpec <- optiThresh(obs = obs, pred = pred, interval = interval, measures = c("Sensitivity", "Specificity"), plot = FALSE, simplif = TRUE)
  SensSpec$Difference <- abs(SensSpec$Sensitivity - SensSpec$Specificity)
  SensSpec$Sum <- rowSums(SensSpec[,1:2])
  SensSpec$Mean <- rowMeans(SensSpec)
  SensSpec$Threshold <- as.numeric(rownames(SensSpec))

  MinDiff <- min(SensSpec$Difference)
  MaxSum <- max(SensSpec$Sum)
  MaxMean <- max(SensSpec$Mean)

  ThreshDiff <- with(SensSpec, Threshold[which.min(Difference)])
  ThreshSum <- with(SensSpec, Threshold[which.max(Sum)])
  ThreshMean <- with(SensSpec, Threshold[which.max(Mean)])

  if(plot){
    ymax <- ifelse(plot.sum, max(SensSpec$Sum), 1)

    with(SensSpec, plot(Sensitivity ~ Threshold, pch = 20, ylim = c(0, ymax), ylab = "Sensitivity and specificity", ...))
    with(SensSpec, points(Specificity ~ Threshold, pch = 1))
    abline(h = SensSpec[which.min(SensSpec$Difference), "Sensitivity"], col = "grey", lty = 2)
    abline(v = ThreshDiff, col = "grey", lty = 2)

    if(plot.mean) {
      with(SensSpec, lines(Mean ~ Threshold, col = "grey"))
      abline(v = ThreshMean, col = "grey", lty = 2)
      abline(h = MaxMean, col = "grey", lty = 2)
    }  # end if plot mean

    if(plot.sum) {
      with(SensSpec, points(Sum ~ Threshold, pch = '+'))
      abline(v = ThreshSum, col = "grey", lty = 2)
      abline(h = MaxSum, col = "grey", lty = 2)
    }    # end if plot sum

    if(plot.diff) {
      with(SensSpec, points(Difference ~ Threshold, pch = '-', col = "grey"))
      abline(v = ThreshDiff, col = "grey", lty = 2)
      abline(h = MinDiff, col = "grey", lty = 2)
    }  # end if plot diff
  }  # end if plot

  return(list(SensSpec = SensSpec, MinDiff = MinDiff, ThreshDiff = ThreshDiff, MaxSum = MaxSum, ThreshSum = ThreshSum, MaxMean = MaxMean, ThreshMean = ThreshMean))

}  # end SensSpecCurve function

[presented with Pretty R]

This function uses the optiThresh function, which in turn uses threshMeasures and evaluate, so all these must be loaded along with SpensSpecCurve for it to work. Then just load your observed and predicted data (see Beginner’s guide for more information) and type or paste:

with(mydata, SpensSpecCurve(obs = myspecies, pred = myspecies_P))

Optimised thresholds for model evaluation

Posted on August 2, 2012 by A.M. Barbosa

The optiThresh function, now included in the modEvA package (Barbosa et al. 2014), calculates optimal thresholds for a number of model evaluation measures (see “Threshold-based measures of model evaluation“). Optimization is given for each measure, and/or for all measures according to particular criteria (e.g. Jiménez-Valverde & Lobo 2007, Nenzén & Araújo 2011). Results are given numerically and in plots:

The optiThresh function depends on the functions provided in the previous post, so all must be loaded for this one to work.

thresh.criteria = c("each", "preval", "0.5", "maxKappa", "minSensSpecDiff", "maxSensSpecSum", "maxTSS")  # the last two are equivalent

optiThresh <- function(obs = NULL, pred = NULL, model = NULL, interval = 0.01, measures = threshMeasures.methods[-1], optimize = thresh.criteria, simplif = FALSE,  plot = TRUE, sep.plots = FALSE, xlab = "Threshold", ...) {
  # version 2.6 (20 Jan 2013)
  # obs: a vector of observed presences (1) and absences (0) or another binary response variable
  # pred: a vector with the corresponding predicted values of presence probability, habitat suitability, environmental favourability or alike
  # model: instead of (and overriding) obs and pred, you can provide a model object of class "glm"
  # optimize: the threshold optimization criteria to use; "each" optimizes the threshold for each model evaluation measure, while the remaining options optimize all measures according to the specified criterion
  # interval: the interval of thresholds at which to calculate evaluation measures
  # measures: model evaluation measures for which to calculate the optimal thresholds
  # plot: logical indicating whether or not to plot the values of each evaluation measure at all thresholds
  # sep.plots: logical. If TRUE, each plot is presented separately (you need to be recording R plot history to be able to browse through them all); if FALSE (the default), all plots are presented together in the same window
  # ...: additional arguments to be passed to the "plot" function
  # Note: some measures cannot be calculated for thresholds at which there are zeros in the confusion matrix, hence the eventual 'NaN' or 'Inf' in results. Also, optimization may be deceiving for some measures; use plot = TRUE and inspect the plot(s).

  if (!is.null(model)) {
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
    model <- NULL  # so the message is not repeated for each threshold
  }  # end if model

  input.measures <- measures

  if ("minSensSpecDiff" %in% optimize | "maxSensSpecSum" %in% optimize) {
    if (!("Sensitivity" %in% measures)) {
      measures <- c(measures, "Sensitivity")
    }  # end if !Sensitivity
    if (!("Specificity" %in% measures)) {
      measures <- c(measures, "Specificity")
    }  # end if !Specificity
  }  # end if minSensSpecDiff

  if("maxKappa" %in% optimize & !("kappa" %in% measures)) {
    measures <- c(measures, "kappa")
  }

  if("maxTSS" %in% optimize & !("TSS" %in% measures)) {
    measures <- c(measures, "TSS")
  }

  thresholds <- seq(from = 0, to = 1, by = interval)
  Nthresholds <- length(thresholds)
  Nmeasures <- length(measures)
  all.thresholds <- data.frame(matrix(data = NA, nrow = Nthresholds, ncol = Nmeasures), row.names = thresholds)
  colnames(all.thresholds) = measures

  for (t in 1 : Nthresholds) for (m in 1 : Nmeasures) {
    all.thresholds[t, m] <- threshMeasures(obs = obs, pred = pred, thresh = thresholds[t], measures = measures[m], standardize = FALSE, simplif = TRUE)
  }  # end for t for m

  if (simplif) {  # shorter version for use with e.g. the optiPair function
    return(all.thresholds)
  }  # end if simplif
  else {
    results <- list(all.thresholds = all.thresholds)  # start a list of results

    input.optimize <- optimize
    if (plot == TRUE & !("each" %in% optimize)) optimize <- c("each", optimize)

    if ("each" %in% optimize) {
      optimals.each <- data.frame(matrix(data = NA, nrow = Nmeasures, ncol = 4))
      colnames(optimals.each) <- c("measure", "threshold", "value", "type")
      optimals.each[1] <- measures
      goodness.measures <- c("CCR", "Sensitivity", "Specificity", "PPP", "NPP", "kappa", "TSS", "NMI", "OddsRatio")
      badness.measures <- c("Omission", "Commission", "Misclass", "UPR", "OPR")
      change.measures <- c("PPI", "PAI")

      for (m in 1 : Nmeasures) {
        if (measures[m] %in% (goodness.measures)) {  # optimal is maximum
          optimals.each[m, "threshold"] <- as.numeric(rownames(all.thresholds)[which.max(all.thresholds[ , m])])
          optimals.each[m, "value"] <- max(all.thresholds[ , m], na.rm = TRUE)
          optimals.each[m, "type"] <- "maximum"
        }  # end if measure in goodness
        else {
          if (measures[m] %in% (badness.measures)) {  # optimal is minimum
            optimals.each[m, "threshold"] <- as.numeric(rownames(all.thresholds)[which.min(all.thresholds[, m])])
            optimals.each[m, "value"] <- min(all.thresholds[ ,m], na.rm = TRUE)
            optimals.each[m, "type"] <- "minimum"
          }  # end if measure in badness
          else {
            if (measures[m] %in% (change.measures)) {  # optimal is closest to zero
              optimals.each[m, "threshold"] <- as.numeric(rownames(all.thresholds)[which.min(abs(all.thresholds[ , m]))])
              optimals.each[m, "value"] <- min(abs(all.thresholds[ , m]), na.rm = TRUE)
              optimals.each[m, "type"] <- "closest to zero"
            }  # end if measure in change
          }  # end 2nd else
        }  # end 1st else
      }  # end for m
      if ("each" %in% input.optimize)  results <- c(results, optimals.each = list(optimals.each))  # add this to results
    }  # end if each

    criteria <- optimize[optimize != "each"]
    Ncriteria <- length(criteria)

    if (Ncriteria > 0) {

      optimals.criteria <- data.frame(matrix(data = NA, nrow = Nmeasures, ncol = Ncriteria))
      rownames(optimals.criteria) <- measures
      colnames(optimals.criteria) <- criteria

      if ("preval" %in% criteria) {
        for (m in 1 : Nmeasures) {
          optimals.criteria[m, "preval"] <- threshMeasures(obs = obs, pred = pred, thresh = "preval", measures = measures[m], standardize = FALSE, simplif = TRUE)
        }
      }  # end if preval

      if ("minSensSpecDiff" %in% criteria) {
        all.thresholds$SensSpecDiff <- with(all.thresholds, abs(Sensitivity - Specificity))
        minSensSpecDiff <- thresholds[which.min(all.thresholds$SensSpecDiff)]
        for (m in 1:Nmeasures) {
          optimals.criteria[m, "minSensSpecDiff"] <- threshMeasures(obs = obs, pred = pred, thresh = minSensSpecDiff, measures = measures[m], standardize = FALSE, simplif = TRUE)
        }
      }

      if ("maxSensSpecSum" %in% criteria) {
        all.thresholds$SensSpecSum <- with(all.thresholds, Sensitivity + Specificity)
        maxSensSpecSum <- thresholds[which.max(all.thresholds$SensSpecSum)]
        for (m in 1 : Nmeasures) {
          optimals.criteria[m, "maxSensSpecSum"] <- threshMeasures(obs = obs, pred = pred, thresh = maxSensSpecSum, measures = measures[m], standardize = FALSE, simplif = TRUE)
        }
      }

      if ("maxKappa" %in% criteria) {
        if (!("kappa" %in% measures)) {
          for (t in 1 : Nthresholds) {
            all.thresholds$kappa <- threshMeasures(obs = obs, pred = pred, thresh = thresholds[t], measures = "kappa", standardize = FALSE, simplif = TRUE)
          }
        }
        maxKappa <- thresholds[which.max(all.thresholds$kappa)]
        for (m in 1 : Nmeasures) {
          optimals.criteria[m, "maxKappa"] <- threshMeasures(obs = obs, pred = pred, thresh = maxKappa, measures = measures[m], standardize = FALSE, simplif = TRUE)
        }
      }

      if ("maxTSS" %in% criteria) {
        if (!("TSS" %in% measures)) {
          for (t in 1 : Nthresholds) {
            all.thresholds$TSS <- threshMeasures(obs = obs, pred = pred, thresh = thresholds[t], measures = "TSS", standardize = FALSE, simplif = TRUE)
          }
        }
        maxTSS <- thresholds[which.max(all.thresholds$TSS)]
        for (m in 1 : Nmeasures) {
          optimals.criteria[m, "maxTSS"] <- threshMeasures(obs = obs, pred = pred, thresh = maxTSS, measures = measures[m], standardize = FALSE, simplif = TRUE)
        }
      }

      if ("0.5" %in% criteria) {
        for (m in 1 : Nmeasures) {
          optimals.criteria[m, "0.5"] <- all.thresholds[rownames(all.thresholds) == 0.5, m]
        }
      }

      results <- c(results, optimals.criteria = list(optimals.criteria))  # add this to results
    }  # end if Ncriteria > 0

    if (plot) {
      opar <- par(no.readonly = TRUE)
      on.exit(par(opar))
      n.input.measures <- length(input.measures)
      if (sep.plots) {
        par(mfrow = c(1,1))
      } else {
        if (n.input.measures > 4)  par(mar = c(1, 4.5, 1, 1))
        root <- sqrt(n.input.measures)
        par(mfrow = c(ceiling(root), round(root)))
      }  # end if sep.plots else
      for (m in 1 : n.input.measures) {
        plot(all.thresholds[ , m] ~ thresholds, ylab = input.measures[m], ...)
        if ("each" %in% input.optimize) {
          abline(v = optimals.each[m, "threshold"], col = "grey", lty = 2)  # vertical line on optimal threshold
          abline(h = optimals.each[m, "value"], col = "grey", lty = 2)  # horiz line on optimal value
        }
     }  # end for m
    }  # end if plot

    return(results)

  }  # end if simplif else
}  # end optiThresh function

[presented with Pretty R]

References:

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Jiménez-Valverde A. & Lobo J.M. (2007) Threshold criteria for conversion of probability of species presence to either–or presence–absence. Acta Oecologica, 31: 361–369.

Nenzén H.K. & Araújo M.B. (2011) Choice of threshold alters projections of species range shifts under climate change. Ecological Modelling, 222: 3346–3354.

Threshold-based measures of model evaluation

Posted on July 30, 2012 by A.M. Barbosa

A number of measures may be used to evaluate the predictions of species distribution (or ecological niche, or bioclimatic envelope…) models against observed presence-absence data (Fielding & Bell 1997; Liu et al. 2011; Barbosa et al. 2013). The threshMeasures function, now included in the modEvA package (Barbosa et al. 2014), calculates a few of these measures. Input data are either a model object of class “glm”, or a vector of the observed the presences-absences (1-0) of your species plus another with the corresponding model predictions, and the threshold value to separate predicted presences from predicted absences in the model predictions. By default this threshold is 0.5, but you may want to change it depending on a number of criteria (see e.g. Jiménez-Valverde & Lobo 2007, Nenzén & Araújo 2011). You can set thresh to “preval” (species’ prevalence or proportion of presences in the input “obs” data) or to any numerical value between 0 and 1. You can calculate optimal threshold values according to several criteria with the optiThresh function provided in the next post. The results of threshMeasures are given numerically and in a bar plot:

# version 2.5 (20 Jan 2013)

threshMeasures.methods = c("AUC", "CCR", "Misclass", "Sensitivity", "Specificity", "Omission", "Commission", "PPP", "NPP", "UPR", "OPR", "PPI", "PAI", "kappa", "TSS", "NMI", "OddsRatio")

evaluate <- function(a, b, c, d, N = NULL, measure = "CCR"){
  # version 1.1 (24 Jul 2013)
  # internal function to be used by others in the package
  # a, b, c, d: elements of the confusion matrix (TP, FP, FN, TN)
  # N: sample size (total number of observations); calculated automatically if NULL
  # measure: evaluation measure to use
  # note: TSS and NMI are not symmetrical ("obs" vs "pred" != "pred" vs "obs")
  # note that some measures (e.g. NMI, odds ratio) don't work with zeros in (some parts of) the confusion matrix
  if(is.null(N))  N <- a + b + c + d
  stopifnot(N == a + b + c + d)
  if(measure == "CCR") { value <- (a+d)/N
  } else if(measure == "Sensitivity") { value <- a/(a+c)
  } else if(measure == "Specificity") { value <- d/(b+d)
  } else if(measure == "Omission") { value <- c/(a+c)
  } else if(measure == "Commission") { value <- b/(b+d)
  } else if(measure == "PPP") { value <- a/(a+b)  # also called "Precision"
  } else if(measure == "NPP") { value <- d/(c+d)
  } else if(measure == "Misclass") { value <- (b+c)/N
  } else if(measure == "UPR") { value <- c/(c+d)  # this and next 3: Barbosa et al. 2013 Diversity and Distributions
  } else if(measure == "OPR") { value <- b/(a+b)
  } else if(measure == "PPI") { value <- ((a+b)/(a+c))-1
  } else if(measure == "PAI") { value <- ((c+d)/(b+d))-1
  } else if(measure == "kappa") { value <- ((a+d)-(((a+c)*(a+b)+(b+d)*(c+d))/N))/(N-(((a+c)*(a+b)+(b+d)*(c+d))/N))
  } else if(measure == "TSS") { value <- (a*d - b*c) / ((a+c) * (b+d))
  } else if(measure == "NMI") { value <- 1-((-a*log(a)-b*log(b)-c*log(c)-d*log(d)+(a+b)*log(a+b)+(c+d)*log(c+d))/(N*log(N)-((a+c)*log(a+c)+(b+d)*log(b+d))))
# NMI by Forbes (1995); "1-" missing in Fielding & Bell 1997 (and Manel et al 2001) but Fielding confirmed it was a typo
  } else if(measure == "OddsRatio") { value <- (a*d)/(c*b)
# (c*b)/(a*d)  # inverse, would give a (more expectable) unimodal plot of odds against thresholds
  } else stop("Invalid measure; type 'threshMeasures.methods' for available options.")
  return(value)
} # end evaluate function

threshMeasures <- function(obs = NULL, pred = NULL, model = NULL, thresh = 0.5, measures = threshMeasures.methods, simplif = FALSE, plot = TRUE, plot.ordered = FALSE, standardize = TRUE, messages = TRUE, ...) {
  # version 2.5 (20 Jan 2013)
  # obs: a vector of observed presences (1) and absences (0) or another binary response variable
  # pred: a vector with the corresponding predicted values of presence probability, habitat suitability, environmental favourability or alike
  # model: instead of (and overriding) obs and pred, you can provide a model object of class "glm"
  # thresh: threshold value to separate predicted presences from predicted absences in 'pred'; may be "preval" or any real number between 0 and 1
  # measures: character vector of the evaluation measures to use
  # plot: logical, whether or not to produce a barplot of the calculated measures
  # plot.ordered: logical, whether to plot the measures in decreasing order rather than in input order
  # standardize: logical, whether to change measures that may range between -1 and +1 (namely kappa and TSS) to their corresponding value in the 0-to-1 scale, so thet they can compare directly to other measures
  # ...: arguments to be passed to the barplot function (if plot = TRUE)

  if(is.null(model)) {
    if (is.null(obs) | is.null(pred)) stop("You must provide either the 'obs' and 'pred' vectors, or a 'model' object of class 'glm'")
  }  # end if null model
  else {
    if (!all(class(model) %in% c("glm", "lm"))) stop("'model' must be a model object of class 'glm'")
    if (messages) {
      if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
      if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    }
    obs <- model$y
    pred <- model$fitted.values
  }  # end if !null model

  if (NA %in% obs | NA %in% pred) stop("Please remove (rows with) NA from your data")
  if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same number of values (and in the same order)")
  if (!all(obs %in% c(0, 1))) stop ("'obs' must consist of binary observations of 0 or 1")
  if (standardize & !exists("standard01")) stop ("'standardize' requires the standard01 function; please load that function or use 'standardize = FALSE'")
  if (!(thresh == "preval" | is.numeric(thresh)))  stop("'thresh' must be either 'preval' or a numeric value between 0 and 1")
  if (thresh == "preval")  thresh <- prevalence(obs)

  obs0 <- obs == 0
  obs1 <- obs == 1
  pred0 <- pred < thresh
  pred1 <- pred >= thresh
  a <- sum(obs1 & pred1)
  b <- sum(obs0 & pred1)
  c <- sum(obs1 & pred0)
  d <- sum(obs0 & pred0)
  N <- a + b + c + d

  Nmeasures <- length(measures)
  measureValues <- as.vector(rep(NA, Nmeasures), mode = "numeric")

  for (i in 1:Nmeasures) {
    if (measures[i] == "AUC") measureValues[i] <- AUC(obs = obs, pred = pred, simplif = TRUE)
    else if (measures[i] %in% threshMeasures.methods) {
      measureValues[i] <- evaluate(a, b, c, d, N, measure = measures[i])
      if (standardize == TRUE  &  measures[i] %in% c("TSS", "kappa")) {
        measureValues[i] <- standard01(measureValues[i])
        measures[i] <- paste("s", measures[i], sep = "")
        message(measures[i], " standardized to the 0-1 scale for direct comparability with other measures; use 'standardize = FALSE' if this is not what you wish")
      }  # end if standardize
    }  # end if measures[i] in threshMeasures.methods
    else {
      warning("'",measures[i],"' is not a valid measure; type 'threshMeasures.methods' for available options\n\n")
      next
    }  # end else
  }  # end for i

  Measures <- matrix(data = measureValues, nrow = Nmeasures, ncol = 1, dimnames = list(measures, "Value"))
  if (simplif) {  # shorter version for use with e.g. optiThresh function
    return(Measures)
  } else {
    prev <- prevalence(obs)
    conf.matrix <- matrix(c(a, b, c, d), nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(c("pred1", "pred0"), c("obs1", "obs0")))
    if (plot) {
      names(measureValues) <- measures
      measures.plot <- measureValues
      if (plot.ordered) {
        measures.plot <- sort(measures.plot, decreasing = TRUE)
      }
      barplot(measures.plot, las = 3, ...)
    }  # end if plot
    return(list(N = N, Prevalence = prev, Threshold = thresh, ConfusionMatrix = conf.matrix, ThreshMeasures = Measures))
  }  # end else
}  # end threshMeasures function

[presented with Pretty R]

Note that some of these measures (like NMI, UPR, OPR, PPP, NPP) cannot be calculated for thresholds at which there are zeros in the confusion matrix. Note also that, while most of these measures range from 0 to 1, some of them — such as kappa and TSS — may range from -1 to 1 (Allouche et al. 2006), so their raw scores are not exactly comparable. The threshMeasures function includes an option to standardize these measures to 0-1, for which you need the standard01 function, so that you obtain the standardized versions skappa and sTSS.

By default, the threshMeasures function calculates all the measures listed in threshMeasures.methods, but you may specify only a few measures if you wish, as in these examples:

# create some random test data:
myobs <- sample(c(0, 1), 150, replace = TRUE)
mypred <- runif(150, min = 0, max = 1)
# note: you can pratice with real sample data too

# calculate some threshold-based measures:
threshMeasures(obs = myobs, pred = mypred, thresh = 0.75, measures = c("Sensitivity", "Specificity", "kappa", "TSS", "UPR", "OPR"), standardize = TRUE, ylim = c(0, 1))

# instead of obs and pred, you can provide a model object of class "glm":
# mymodel = glm(myspecies ~ var1 + var2 ... ...)
threshMeasures(model = mymodel, thresh = "preval", measures = c("Sensitivity", "Specificity", "UPR", "OPR"), ylim = c(0, 1))

The threshMeasures function can also be used to calculate the agreement between different presence-absence (or other types of binary) data, as Barbosa et al. (2012) did for comparing mammal distribution data from atlas and range maps. Notice, however, that some of these measures, such as TSS or NMI, are not symmetrical (obs vs pred is different from pred vs obs).

References:

Allouche, O., Tsoar, A. & Kadmon, R. (2006) Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology, 43, 1223–1232.

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Barbosa A.M., Estrada A., Márquez A.L., Purvis A. & Orme C.D.L. (2012) Atlas versus range maps: robustness of chorological relationships to distribution data types in European mammals. Journal of Biogeography 39: 1391-1400

Barbosa A.M., Real R., Muñoz A.R. & Brown J.A. (2013) New measures for assessing model equilibrium and prediction mismatch in species distribution models. Diversity and Distributions 19: 1333-1338 (DOI: 10.1111/ddi.12100)

Fielding A.H. & Bell J.F. (1997). A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24: 38-49

Jiménez-Valverde A. & Lobo J.M. (2007) Threshold criteria for conversion of probability of species presence to either–or presence–absence. Acta Oecologica 31: 361–369

Liu C., White M., & Newell G. (2011) Measuring and comparing the accuracy of species distribution models with presence-absence data. Ecography, 34, 232–243.

Nenzén H.K. & Araújo M.B. (2011) Choice of threshold alters projections of species range shifts under climate change. Ecological Modelling 222: 3346-3354

modTools

simple tools for simpler modelling

Trim off non-significant variables from a model

Create species name codes

Determine the percentage of data to reserve for model testing

GLMs with multiple species and options

Plot a generalized linear model

Overlap Analysis

Variation partitioning

Sensitivity-specificity curves

Optimised thresholds for model evaluation

Threshold-based measures of model evaluation