Analyse multicollinearity in a dataset, including the variance inflation factor (VIF)

Posted on September 13, 2014 by A.M. Barbosa

The multicol function, now included in the fuzzySim package (Barbosa, 2015), calculates the degree of multicollinearity in a set of numeric variables, using three closely related measures: R-squared, tolerance and the variance inflation factor (VIF). The latter is being increasingly used in ecology and a range of other research areas, and several packages already calculate the VIF in R, but their results can be strikingly different (Estrada & Barbosa, submitted). The multicol function below calculates multicollinearity statistics directly on the independent variables, producing the expected results and without (so far) clashing with other packages. Only ‘independent’ (predictor, right hand side side) variables should be entered, as the result obtained for each variable depends on all the other variables present in the analysed data set.

multicol <- function(vars) {
  # version 0.3 (9 Oct 2014)
result <- matrix(NA, nrow = ncol(vars), ncol = 3)
  rownames(result) <- colnames(vars)
  colnames(result) <- c("Rsquared", "Tolerance", "VIF")
  for (v in 1:ncol(vars)) {
    v.name <- colnames(vars)[v]
    other.v.names <- colnames(vars)[-v]
    mod.formula <- as.formula(paste(v.name, "~", paste(other.v.names, collapse = "+")))
    mod <- lm(mod.formula, data = vars)
    R2 <- summary(mod) $ r.squared
    result[v, "Rsquared"] <- R2
    result[v, "Tolerance"] <- 1 - R2
    result[v, "VIF"] <- 1 / (1 - R2)
  }
  return(result)
}

[presented with Pretty R]

Let’s try some examples with a couple of datasets available in R (trees and OrchardSprays):

multicol(trees)
 
# you'll get a warning and some NA results if any of the variables is not numeric:
multicol(OrchardSprays)
 
# but you can define a subset of columns of 'data' to calculate 'multicol' for:
multicol(OrchardSprays[ , 1:3])

REFERENCES

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Estrada A. & Barbosa A.M. (submitted) Calculating the Variance Inflation Factor in R: cautions and recommendations.

Classify integer columns

Posted on May 9, 2014 by A.M. Barbosa

The integerCols function below detects which numeric columns in a data frame contain only whole numbers, and converts those columns to integer class, so that they take up less space. It uses the multConvert function, which must be loaded too. Both functions are included in the latest version of the fuzzySim package (Barbosa, 2015).

integerCols <- function(data) {
  is.wholenumber <- function(x, tol = .Machine$double.eps ^ 0.5) {
    abs(x - round(x)) < tol
  }  # from ?is.integer examples
  all.cols <- 1:ncol(data)
  integer.cols <- rep(0, ncol(data))
  for (i in all.cols) {
    x <- na.omit(data[ , i])
    if(!is.numeric(x)) next
    if(!all(is.finite(x))) next
    if(min(is.wholenumber(x) == 1)) integer.cols[i] <- 1
  }
  multConvert(data, conversion = as.integer, cols = all.cols[integer.cols == 1])
}

[presented with Pretty R]

Usage example:

dat <- data.frame(
  a = 1:10,
  b = as.numeric(1:10),
  c = seq(0.1, 1, 0.1),
  d = letters[1:10]
)
 
str(dat)  # b is classified as 'numeric' although it contains only whole numbers
 
dat2 <- integerCols(dat)
 
str(dat2)  # b now classified as 'integer'

References:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Extract a “jumping window” from a vector

Posted on April 24, 2014 by A.M. Barbosa

Sometimes we need to extract and analyse parts of a dataset (e.g. a vector, or the columns or rows of a data frame) along a moving (aka sliding, rolling, running) window. The jumping.window function, included in the newly released DeadCanMove package (Barbosa et al. 2014), extracts a moving window with (currently) no overlap between windows, and with the possibility of a gap (whose size is defined by the user) between windows:

jumping.window <- function(x, window.size, gap.size) {
  window.indices <- integer(0)
  for (i in 1 : window.size) {
    window.indices <- c(window.indices, seq(from = i, to = length(x), by = window.size + gap.size))
  }
  window.indices <- sort(window.indices)
  sampl.windows <- vector[window.indices]
  return(sampl.windows)
}  # end jumping.window function

[presented with Pretty R]

Usage examples:

Paste the whole text of the function above in R, press enter, and then paste the following commands to check out how the function works:

jumping.window(x = 1:20, window.size = 1, gap.size = 0)
 
jumping.window(x = 1:20, window.size = 1, gap.size = 1)
 
jumping.window(x = 6:52, window.size = 5, gap.size = 10)
 
jumping.window(x = c(1, 3, 4, 6, 9, 10, 11), window.size = 3, gap.size = 2)
 
jumping.window(x = c("air", "bird", "cloud", "deep", "elephant", "goat"), window.size = 2, gap.size = 2)
 
 
# to extract jumping-window columns from a table:
 
head(CO2)
CO2jump <- CO2[ , jumping.window(x = 1:ncol(CO2), window.size = 2, gap.size = 1)]
head(CO2jump)

References:

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package version 0.1 (available at http://deadcanmove.r-forge.r-project.org)

modEvA and fuzzySim packages on R-Forge

Posted on April 7, 2014 by A.M. Barbosa

Just a quick post to let you know that package modEvA, containing most of the functions posted till today in this blog, and package fuzzySim, containing functions spCodes and splist2presabs and a fuzzy version of binarySimilarity and simMatrix, are finally available on R-Forge and should be accessed there from now on, as that’s where they’ll be updated. You can either install the packages directly with an R command, or download the source and/or binary files from R-Forge. Please refer to the packages pages for more information, and let me know if you bump into problems.

Check your package’s help files for non-ASCII characters

Posted on April 4, 2014 by A.M. Barbosa

If you’re building an R package (e.g. following this simple tutorial) and have your package skeleton with the ‘man’ folder containing all your help files, it may be useful to check all those files at once for non-ASCII characters (e.g. tildes, dashes) which may not be correctly exported to other computers. The following function aplies the showNonASCIIfile R function to all files in a given folder:

multNonASCIIcheck <- function(path) {
# version 1.0 (4 Apr 2014)
# path: complete folder path to the "man" folder of your package skeleton
  odir <- getwd()
  setwd(path)
  files <- list.files()
  for (f in files) {
    cat("\n", f, "\n", sep = "")
    if(length(tools::showNonASCIIfile(f)) == 0) cat("OK\n")
  }
  setwd(odir)
}

[presented with Pretty R]

Load the function and give it your ‘man’ folder path to get the names of all files and the non-ASCII characters found, e.g.:

multNonASCIIcheck(“/home/user/Documents/myPackages/supeR/man”)

Then you can just open the problematic help files and correct the non-ASCII characters detected.

Probability of occurrence from presence-background data in a data frame

Posted on November 1, 2013 by A.M. Barbosa

In an interesting recent paper, Royle et al. (2012) presented a critical analysis of Maxent and proposed a method for obtaining probability of occurrence (rather than a vague suitability index) based on maximum likelihood from presence-background (rather than presence-absence) species distribution data. They provided an R package, maxlike, which can apply this method to a set of presence point coordinates and a raster stack of predictor variables. However, this implies not only that you need point coordinates of your species observations, but also that sampling units are square pixels, whereas we often need to model distribution data recorded on larger and more irregular units — such as provinces, river basins, or UTM cells, for example.

The maxlike.df function is a modification of maxlike that can be applied to a data frame instead of spatial objects. I added argument covariates (to use instead of rasters), which requires a matrix or data.frame of the variables (one in each column, with the same names used in argument formula), and argument presences (to use instead of points) for a vector/column with the species data (of the same size and in the same order as the covariates, with 1 for presence and 0 or NA for background or no records — mind that MaxLike, like MaxEnt, needs both presence and background data, so they’re not strictly “presence-only”). Everything else in maxlike.df works the same way as in maxlike (Chandler & Royle 2013), so please read the maxlike documentation for further information on how the method is implemented and interpreted. Note that they recommend that the variables be standardized (e.g. vars = scale(vars)) before modelling, and that you should select which variables to include in each model.

Note also that Royle et al.‘s (2012) approach has been supported by Fitzpatrick et al. (2013), but criticized by Hastie & Fithian (2013).

UPDATE: the latest version of the maxlike package (0.1-7) now allows models to be calibrated with data frames, via arguments x (data.frame with environmental conditions at presence point locations) and z (data.frame with environmental conditions at background locations). The function below is therefore deprecated, and you should use the maxlike package instead.

maxlike.df <- function (formula, rasters, points, presences, covariates, link = c("logit", "cloglog"), starts, hessian = TRUE, fixed, removeDuplicates = FALSE, savedata = FALSE, na.action = "na.omit", ...)
  {
  if (missing(rasters))  rasters <- covariates  # new line
  if (missing(points))  points <- presences  # new line
  if (identical(formula, ~1))
    stop("At least one continuous covariate must be specified in the formula")
  link <- match.arg(link)
  varnames <- all.vars(formula)
  call <- match.call()
  npts <- sum(points == 1)  # was nrow(points)
  cd.class <- class(rasters)[1]
  #if (cd.class != "rasterstack") stop("rasters must be a raster stack")
  #pt.class <- class(points)[1]
  #if (!pt.class %in% c("matrix", "data.frame")) stop("points must be a matrix or a data.frame")
  #if (ncol(points) != 2) stop("points must have 2 columns containing the x- and y- coordinates")
  pt.names <- colnames(points)
  #if (identical(cd.class, "rasterstack")) {
    cd.names <- names(rasters)
    npix <- nrow(rasters) # was  prod(dim(rasters)[1:2])
    id <- 1:nrow(rasters)  # new line
    cellID <- id[points == 1] # was cellFromXY(rasters, points)
    duplicates <- duplicated(cellID)
    if (removeDuplicates) {
      cellID <- unique(cellID)
      npts <- length(cellID)
      points.retained <- points[!duplicates, ]
    }
    x <- as.data.frame(rasters[points == 1, ])  # was as.data.frame(matrix(extract(rasters, cellID), npts))
    z <- rasters  # was as.data.frame(matrix(getValues(rasters), npix))
    names(x) <- names(z) <- cd.names
  #}
  if (!all(varnames %in% cd.names)) stop("at least 1 covariate in the formula is not in names(rasters).")
  X.mf <- model.frame(formula, x, na.action = na.action)
  X.mf.a <- attributes(X.mf)
  pts.removed <- integer(0)
  points.retained <- points
  if ("na.action" %in% names(X.mf.a)) {
    pts.removed <- X.mf.a$na.action
    npts.removed <- length(pts.removed)
    if (npts.removed > 0) {
      warning(paste(npts.removed, "points removed due to missing values"))
      points.retained <- points.retained[-pts.removed]  # I removed the comma before ']', as here 'points' is a vector rather than a two-column matrix/dataframe
    }
  }
  X <- model.matrix(formula, X.mf)
  Z.mf <- model.frame(formula, z, na.action = na.action)
  Z.mf.a <- attributes(Z.mf)
  pix.removed <- integer(0)
  if ("na.action" %in% names(Z.mf.a)) {
    pix.removed <- Z.mf.a$na.action
    npix.removed <- length(pix.removed)
  }
  Z <- model.matrix(formula, Z.mf)
  npars <- ncol(X)
  parnames <- colnames(X)
  if (!"(Intercept)" %in% parnames) 
    stop("The intercept must be estimated or fixed")
  if (missing(starts)) {
    starts <- rep(0, npars)
    names(starts) <- parnames
  }
  else names(starts) <- parnames
  if (identical(link, "logit")) {
    nll <- function(pars) {
      psix <- plogis(drop(X %*% pars))
      psiz <- sum(plogis(drop(Z %*% pars)))
      -1 * sum(log(psix/psiz))
    }
  }
  else if (identical(link, "cloglog")) {
    nll <- function(pars) {
      psix <- 1 - exp(-exp(drop(X %*% pars)))
      psiz <- sum(1 - exp(-exp(drop(Z %*% pars))))
      -1 * sum(log(psix/psiz))
    }
  }
  else stop("link function should be either 'logit' or 'cloglog'")
  is.fixed <- rep(FALSE, npars)
  if (!missing(fixed)) {
    if (length(fixed) != length(starts)) 
      stop("fixed should be a vector with the same length as the number of parameters to be estimated")
    if (sum(is.double(fixed)) < 1) 
      stop("fixed must contain at least one real value")
    is.fixed <- !is.na(fixed)
    if (sum(!is.fixed) < 1) 
      stop("you cannot fix all parameters in the model")
    npars <- sum(!is.fixed)
    nll.fix <- function(p) {
      p[is.fixed] <- fixed[is.fixed]
      do.call("nll", list(pars = p))
    }
    fm <- optim(starts, nll.fix, hessian = hessian, ...)
    fm$par[is.fixed] <- fixed[is.fixed]
  }
  else {
    fm <- optim(starts, nll, hessian = hessian, ...)
  }
  not.fixed <- !is.fixed
  par <- fm$par
  if (hessian) {
    vcTry <- try(solve(fm$hessian[not.fixed, not.fixed]))
    if (identical(class(vcTry), "matrix")) {
      vc <- matrix(0, length(par), length(par))
      vc[not.fixed, not.fixed] <- vcTry
      se <- sqrt(diag(vc))
    }
    else {
      vc <- matrix(NA, npars, npars)
      se <- rep(NA, npars)
    }
  }
  else {
    vc <- matrix(NA, npars, npars)
    se <- rep(NA, npars)
  }
  dimnames(vc) <- list(parnames, parnames)
  aic <- 2 * fm$value + 2 * npars
  fitted <- plogis(Z %*% par)
  fitted.values <- merge(data.frame(id = id), data.frame(id = rownames(fitted), fitted = fitted), all = TRUE)  # new line
  out <- list(Est = cbind(Est = par, SE = se), vcov = vc, AIC = aic, call = call, pts.removed = pts.removed, pix.removed = pix.removed, points.retained = points.retained, optim = fm, not.fixed = not.fixed, link = link, fitted.values = fitted.values$fitted)  # 'fitted.values' added
  if (class(rasters) %in% c("matrix", "data.frame"))  savedata = TRUE  # new line, otherwise 'predict' would fail
  if (savedata)
    out$rasters <- rasters
  class(out) <- c("maxlikeFit", "list")
  return(out)
}

# some changes to predict.maxlikeFit were also required to accomodate maxlike.df:
predict.maxlikeFit <- function (object, ...) {
  e <- coef(object)
  rasters <- object$rasters
  if (is.null(rasters)) {
    rasters <- try(get(as.character(object$call$rasters)))
    if (identical(class(rasters)[1], "try-error")) 
      stop("could not find the raster data")
    warning("raster data were not saved with object, using the data found in the workspace instead.")
  }
  link <- object$link
  cd.names <- names(rasters)
  cd.class <- class(rasters)[1]  # new line
  if (cd.class == "RasterStack") {  # new line
    npix <- prod(dim(rasters)[1:2])
    z <- as.data.frame(matrix(getValues(rasters), npix))
  }  # new line
  else if (cd.class %in% c("matrix", "data.frame")) {  # new line
    npix <- nrow(rasters)  # new line
    z <- rasters  # new line
  }  else stop("'rasters' must be either a RasterStack, a matrix or a data.frame")  # new line
  names(z) <- cd.names
  formula <- object$call$formula
  varnames <- all.vars(formula)
  if (!all(varnames %in% cd.names)) 
    stop("at least 1 covariate in the formula is not in rasters.")
  Z.mf <- model.frame(formula, z, na.action = "na.pass")
  Z.terms <- attr(Z.mf, "terms")
  Z <- model.matrix(Z.terms, Z.mf)
  eta <- drop(Z %*% coef(object))
  if (identical(link, "logit")) 
    psi.hat <- plogis(eta)
  else if (identical(link, "cloglog")) 
    psi.hat <- 1 - exp(-exp(eta))
  else stop("link function should be either 'logit' or 'cloglog'")
  if (cd.class == "RasterStack") {  # new line
    psi.mat <- matrix(psi.hat, dim(rasters)[1], dim(rasters)[2], byrow = TRUE)
    psi.raster <- raster(psi.mat)
    extent(psi.raster) <- extent(rasters)
    psi.raster
  }  # new line
  else if (cd.class == "data.frame") return(psi.hat)  # new line
}

[presented with Pretty R]

Usage example for a data frame with the species presence data in column 2 and the variables (named var1, var2, var3 and var4) in columns 3 to 6:

myspecies.maxlike <- maxlike.df(formula = ~ var1 + var2 + var3 + var4, covariates = mydata[ , 3:6], presences = mydata[ , 2], method = "BFGS")

If you then want to explore the model with functions such as summary or predict, you need to install and load the maxlike package (which defines the appropriate class for this type of model). If not, just with the maxlike.df function you can already get some useful information from the model:

# get the model predictions applied to the analysed data:
myspecies.maxlike$fitted.values

# get the coefficient estimates and their standard errors:
myspecies.maxlike$Est

# get the model's AIC:
myspecies.maxlike$AIC

You can then evaluate maxlike.df predictions with functions from the modEvA package, as in the following examples:

library(modEvA)

AUC(obs = mydata$myspecies, pred = myspecies.maxlike$fitted.values)
HLfit(obs = mydata$myspecies, pred = myspecies_maxlike$fitted.values)
plotGLM(obs = mydata$myspecies, pred = myspecies_maxlike$fitted.values)

References

Chandler, R.B. & Royle, J.A. (2013) Package ‘maxlike’: Model species distributions by estimating the probability of occurrence using presence-only data. Version 0.1-4, CRAN

Fitzpatrick, M.C., Gotelli, N.J. & Ellison, A.M. (2013) MaxEnt versus MaxLike: empirical comparisons with ant species distributions. Ecosphere 4, art55.

Hastie, T. & Fithian, W. (2013) Inference from presence-only data; the ongoing controversy. Ecography 36, 864-867.

Royle, J.A., Chandler, R.B., Yackulic, C. & Nichols, J.D. (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods in Ecology and Evolution 3: 545–554

Distribution modelling based on presence points

Posted on October 31, 2013 by A.M. Barbosa

The pres.point.mod function uses the dismo package to run a series of models (currently Bioclim, Domain and Maxent) for a set of presence point coordinates and a raster stack of predictor variables. Optionally, it will also apply these models to an extrapolation raster stack, and write prediction and extrapolation maps to disk. The pres.point.mod function is not included in a package, but it’s in the “Supporting Information” of Zanolla et al. (2018), so please cite this paper if you use the function.

Note that Bioclim and Domain use only the information on the presence points, while Maxent uses also a set of background (pseudo-absence) points that are randomly generated if not provided by the user. Please read Hijmans & Elith’s “Species distribution modeling with R” for more info on these modelling methods and their R implementation.

pres.point.mod <- function(obs.xy, var.stack, extrap.stack = NULL, mod.methods = c("bioclim", "domain", "maxent"), bg.points = NULL, result.rasters.folder = NULL, result.rasters.format = "GTiff", overwrite = FALSE, ...) {
  # version 1.5 (8 Oct 2013)
  # obs.xy: a matrix or data frame containing 2 columns, the first with the x and the second with the y coordinates of the species presence points to be modelled
  # var.stack: a raster stack of the predictor variables to use in modelling, in the same coordinate reference system as obs.xy
  # extrap.stack: optionally, a raster stack of variables (with the same names and coordinate reference system as var.stack) to which to exrapolate the models
  # mod.methods: modelling methods to use; currently "bioclim", "domain" and "maxent" are implemented
  # bg.points: an optional matrix/dataframe of x-y coordinates of the background points or pseudo-absences to be used by maxent (the other 3 methods use only the presence points). If NULL, maxent will randomly generate 10000 points within the raster.stack
  # result.rasters.folder: optionally, the path to a folder (which will be created if not present) to save the rasters of model predictions to disk
  # result.rasters.format: format for the rasters to save in result.rasters.folder (type '?writeFormats' for options)
  # overwrite: logical, whether to overwrite raster files with the same name within the result.rasters.folder
  # ...: additional arguments for the 'maxent' function (see ?maxent details); relevant arguments may be 'a' for user-defined background points, 'path' for the folder in which to save maxent output files, and 'removeDuplicates' o specify whether duplicate presence points within the same pixel should be ignored

  start.time <- proc.time()
  if (!is.element("dismo", installed.packages()[,1])) stop("You need to have the 'dismo' package installed to run 'pres.point.mod'.")
  require(dismo)
  if(!is.null(result.rasters.folder)) dir.create(result.rasters.folder, showWarnings = FALSE)
  jar.path <- paste(system.file(package = "dismo"), "/java/", sep = "")
  if ("maxent" %in% mod.methods & !file.exists(paste(jar.path, "maxent.jar", sep = "")))  stop("'maxent.jar' file not found in '", jar.path, "; please put (a copy of) it there, or run 'pres.point.mod' without 'maxent' in 'mod.methods'", sep = "")
  results <- vector("list")

  if("bioclim" %in% mod.methods) {
    message("Building Bioclim model...  >>  ", Sys.time())
    bioc.model.expr <- expression(bioclim(x = var.stack, p = obs.xy))
    bioc.model <- tryCatch(eval(bioc.model.expr), error = function(e) NULL)
    if(!is.null(bioc.model)) {
      results[["bioc.model"]] <- bioc.model
      message("Making Bioclim prediction raster...  >>  ", Sys.time())
      bioc.rast <- predict(bioc.model, var.stack)
      results[["bioc.rast"]] <- bioc.rast
      if(!is.null(result.rasters.folder)) {
        message("Saving in result.rasters.folder...  >>  ", Sys.time())
        writeRaster(x = bioc.rast, filename = paste(result.rasters.folder, "Bioclim", sep = "/"), format = result.rasters.format, overwrite = overwrite)
      }  # end if result.rasters.folder
      if(!is.null(extrap.stack)) {
        message("Making Bioclim extrapolated raster...  >>  ", Sys.time())
        bioc.extrap.rast <- predict(bioc.model, extrap.stack)
        results[["bioc.extrap.rast"]] <- bioc.extrap.rast
        if(!is.null(result.rasters.folder)) {
          message("Saving in result.rasters.folder...  >>  ", Sys.time())
          writeRaster(x = bioc.extrap.rast, filename = paste(result.rasters.folder, "Bioclim_extrapolated", sep = "/"), format = result.rasters.format, overwrite = overwrite)
        }  # end if result.rasters.folder
      }  # end if extrap.stack
    }  # end if !null bioc model
    else message("...not! Bioclim model could not be obtained")
  }  # end if bioclim

  if("domain" %in% mod.methods) {
    message("Building Domain model...  >>  ", Sys.time())
    domain.model.expr <- expression(domain(x = var.stack, p = obs.xy))
    domain.model <- tryCatch(eval(domain.model.expr), error = function(e) NULL)
    if(!is.null(domain.model)) {
      results[["domain.model"]] <- domain.model
      message("Making Domain prediction raster...  >>  ", Sys.time())
      domain.rast <- predict(domain.model, var.stack)
      results[["domain.rast"]] <- domain.rast
      if(!is.null(result.rasters.folder)) {
        message("Saving in result.rasters.folder...  >>  ", Sys.time())
        writeRaster(x = domain.rast, filename = paste(result.rasters.folder, "Domain", sep = "/"), format = result.rasters.format, overwrite = overwrite)
      }  # end if result.rasters.folder
      if(!is.null(extrap.stack)) {
        message("Making Domain extrapolated raster...  >>  ", Sys.time())
        domain.extrap.rast <- predict(domain.model, extrap.stack)
        results[["domain.extrap.rast"]] <- domain.extrap.rast
        if(!is.null(result.rasters.folder)) {
          message("Saving in result.rasters.folder...  >>  ", Sys.time())
          writeRaster(x = domain.extrap.rast, filename = paste(result.rasters.folder, "Domain_extrapolated", sep = "/"), format = result.rasters.format, overwrite = overwrite)
        }  # end if result.rasters.folder
      }  # end if extrap.stack
    }  # end if !null domain model
    else message("...not! Domain model could not be obtained")
  }  # end if domain

  if("maxent" %in% mod.methods) {
    message("Building Maxent model...  >>  ", Sys.time())
    maxent.model.expr <- expression(maxent(x = var.stack, p = obs.xy, a = bg.points, ...))
    maxent.model <- tryCatch(eval(maxent.model.expr), error = function(e) NULL)
    if(!is.null(maxent.model)) {
      results[["maxent.model"]] <- maxent.model
      message("Making Maxent prediction raster...  >>  ", Sys.time())
      maxent.rast <- predict(maxent.model, var.stack)
      results[["maxent.rast"]] <- maxent.rast
      if(!is.null(result.rasters.folder)) {
        message("Saving in result.rasters.folder...  >>  ", Sys.time())
        writeRaster(x = maxent.rast, filename = paste(result.rasters.folder, "Maxent", sep = "/"), format = result.rasters.format, overwrite = overwrite)
      }  # end if result.rasters.folder
      if(!is.null(extrap.stack)) {
        message("Making Maxent extrapolated raster...  >>  ", Sys.time())
        maxent.extrap.rast <- predict(maxent.model, extrap.stack)
        results[["maxent.extrap.rast"]] <- maxent.extrap.rast
        if(!is.null(result.rasters.folder)) {
          message("Saving in result.rasters.folder...  >>  ", Sys.time())
          writeRaster(x = maxent.extrap.rast, filename = paste(result.rasters.folder, "Maxent_extrapolated", sep = "/"), format = result.rasters.format, overwrite = overwrite)
        }  # end if result.rasters.folder
      }  # end if extrap.stack
    }  # end if !null maxent model
    else message("...not! Maxent model could not be obtained")
  }  # end if maxent

  end.time <- proc.time()
  duration <- (end.time - start.time)[3]
  if (duration < 60) {
    units <- " second(s)"
  } else if (duration < 3600) {
    duration <- duration / 60
    units <- " minute(s)"
  } else {
    duration <- duration / 3600
    units <- " hour(s)"
  }

  message("Finished!  >>  ", Sys.time(), " -- it took ", round(duration),  units)
  return(results)
}  # end pres.point.mod function

[presented with Pretty R]

Please read the first lines of the function above for an explanation of what each parameter means.

Usage example:

mymodels <- pres.point.mod(obs.xy = mypoints, var.stack = predictor.stack, extrap.stack = var2050.stack, mod.methods = c("bioclim", "domain", "maxent"), bg.points = bg.points, result.rasters.folder = "myspecies_models", result.rasters.format = "GTiff", overwrite = FALSE, path = "myspecies_maxent_out")

pairs(mymodels$bioc.model) response(mymodels$bioc.model)

plot(mymodels$bioc.rast) plot(mymodels$domain.rast) plot(mymodels$maxent.rast)

REFERENCES:

Zanolla M., Altamirano M., Carmona R., De La Rosa J., Souza-Egipsy V., Sherwood A., Tsiamis K., Barbosa A.M., Muñoz A.R. & Andreakis N. (2018) Assessing global range expansion in a cryptic species complex: insights from the red seaweed genus Asparagopsis (Florideophyceae). Journal of Phycology, 54: 12-24.

Convert geographical coordinates from degree-minute-second to decimal degrees

Posted on September 25, 2013 by A.M. Barbosa

If you have latitude and/or longitude data in sexagesimal degrees, in the form degreesº minutes’ seconds” hemisfere (e.g. 41° 34′ 10.956″ N or 8° 37′ 47.1036″ W, with or without spaces in between), the dms2dec function can convert them to decimal degrees, which are usually required for mapping. This function is not included in a package, but it’s in the “Supporting Information” of Zanolla et al. (2018), so please cite this paper if you use the function.

dms2dec <- function(dms, separators = c("º", "°", "\'", "\"")) {
  # version 1.0 (25 Sep 3013)
  # dms: a vector (or column) of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in dms

  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)

  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)

  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    sec[i] <- splits[[i]][3]
    hem[i] <- splits[[i]][4]
  }

  dec <- as.numeric(deg) + (as.numeric(min) / 60) + (as.numeric(sec) / 3600)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  dec <- sign * dec
  return(dec)
}  # end dms2dec function

[presented with Pretty R]

Usage example, from a table called mydata:

mydata$latitude.decimal <- dms2dec(mydata$Latitude)

mydata$longitude.decimal <- dms2dec(mydata$Longitude)

REFERENCES:

Arrange plots within a window

Posted on September 16, 2013 by A.M. Barbosa

If you want to get a number of plots in the same window or figure, the arrangePlots function can help you find the adequate number of rows and columns in which to arrange them. This can be especially useful inside a function that can generate plots whose number you don’t initially know — it is used e.g. in the optiThresh function. Both these functions are included in the modEvA package (Barbosa et al. 2014).

arrangePlots  <- function(n.plots, landscape = FALSE) {
  # version 1.0 (16 Sep 2013)
  root <- sqrt(n.plots)
  large <- ceiling(root)
  small <- round(root)
  if (landscape) plots.rc <- c(small, large)
  else plots.rc <- c(large, small)
  return(plots.rc)
}  # end arrangePlots function

[presented with Pretty R]

Usage examples (e.g. for 6 plots):

par(mfrow = arrangePlots(6))

for(i in 1:6) plot(i)

par(mfrow = arrangePlots(6, landscape = TRUE))

for(i in 1:6) plot(i)

REFERENCES:

Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

Calculate evaluation measures for multiple models

Posted on September 5, 2013 by A.M. Barbosa

If you have a list of models created e.g. with the multGLM function, or a data frame with presence-absence data and the corresponding predicted values for a set of species, you can use the multModEv function to get a set of evaluation measures for all models simultaneously. Note that all models must have the same sample size. This function is included in the modEvA package (Barbosa et al. 2014).

multModEv.methods <- 'Prevalence', 'AUC', 'CCR', 'Misclass', 'Sensitivity', 'Specificity', 'Omission', 'Commission', 'PPP', 'NPP', 'UPR', 'OPR', 'PPI', 'PAI', 'kappa', 'TSS', 'NMI', 'OddsRatio', 'HL', 'HL.p', 'RMSE', 'Miller.int', 'Miller.slope', 'Miller.p'
 
multModEv <-
function(models = NULL, obs.data = NULL, pred.data = NULL, measures = multModEv.methods, standardize = FALSE, thresh = 0.5, bin.method = "quantiles", quiet = TRUE) {
  # version 2.0 (24 Jun 2015)
 
  for (m in measures) {
    if (!(m %in% modEvAmethods("multModEv"))) stop(m, " is not a valid measure; type modEvAmethods('multModEv') for available options.")
  }
 
  if (!bin.method %in% modEvAmethods("getBins")) stop("Invalid bin.method; type modEvAmethods('getBins') for available options.")
 
  if (is.null(models)) {
    if (is.null(obs.data) | is.null(pred.data)) stop("You must provide either a list of model object(s) of class 'glm', or a set of obs.data + pred.data with matching dimensions.")
  }  # end if !models
 
  else {
    if (!is.null(obs.data)) message("Argument 'obs.data' ignored in favour of 'models'.")
    if (!is.null(pred.data)) message("Argument 'pred.data' ignored in favour of 'models'.")
 
    n <- sapply(models, function(x) length(resid(x)))
    if(!(all(n == n[1]))) stop("Models must all be based on the same sample size.")
 
    obs.data <- pred.data <- fav.data <- matrix(data = NA, nrow = length(resid(models[[1]])), ncol = length(models), dimnames = list(NULL, names(models)))
    for (m in 1 : length(models)) {
      obs.data[ , m] <- models[[m]]$y
      pred.data[ , m] <- models[[m]]$fitted.values
    }  # end for m
  }  # end if models
 
  n.models <- ncol(obs.data)
  n.measures <- length(measures)
  results <- matrix(NA, nrow = n.models, ncol = n.measures)
  rownames(results) <- colnames(obs.data)
  colnames(results) <- measures
 
  message(n.models, " models to evaluate; ", n.measures,  " measures to calculate for each.")
 
  if (any(measures %in% modEvAmethods("threshMeasures"))) {
    thresh.measures <- measures[measures %in% modEvAmethods("threshMeasures")]
    cat("- using '", thresh, "' as the threshold value to calculate ", paste(thresh.measures, collapse = ", "), "\n", sep = "")  # doesn't quite work with 'message' or 'paste' or with a 'sep'
  }  # this needs to be oustide the 'for' loop
 
  bin.measures <- c("HL", "HL.p", "RMSE")  # ABC, unityRsq
  if (any(measures %in% bin.measures)) {
    bin.measures <- measures[measures %in% bin.measures]
    cat("- using '", bin.method, "' as the bin method to calculate ", paste(bin.measures, collapse = ", "), "\n", sep = "")  # doesn't quite work with 'message' or 'paste' or with a 'sep'
  }  # this needs to be oustide the 'for' loop
 
for (m in 1:n.models) {
 
    message("Evaluating model ", m, "...")
 
    if ("Prevalence" %in% measures)
      results[m, "Prevalence"] <- prevalence(obs.data[ , m])
 
    if ("Evenness" %in% measures)
      results[m, "Evenness"] <- evenness(obs.data[ , m])
 
    if ("AUC" %in% measures)
      results[m, "AUC"] <- AUC(obs = obs.data[ , m], pred = pred.data[ , m], simplif = TRUE)
 
    if (any(measures %in% modEvAmethods("threshMeasures"))) {
      for (m in 1:n.models)  for (tm in thresh.measures) {
        results[m, tm] <- threshMeasures(obs = obs.data[ , m], pred = pred.data[ , m], measures = tm, thresh = thresh, standardize = standardize, simplif = TRUE)
      }; rm(m, tm)
    }  # end if measures in modEvAmethods("threshMeasures")
    # thresh.measures <- optiThresh(obs = obs.data[ , m], pred = pred.data[ , m], plot = FALSE, optimize = "each")
 
    if (any(measures %in% c("HL", "HL.p"))) {
      for (m in 1:n.models) {
        HL <- HLfit(obs = obs.data[ , m], pred = pred.data[ , m], bin.method = bin.method, simplif = TRUE)
        if ("HL" %in% measures)  results[m, "HL"] <- HL$chi.sq
        if ("HL.p" %in% measures)  results[m, "HL.p"] <- HL$p.value
        if ("RMSE" %in% measures)  results[m, "RMSE"] <- HL$RMSE
      }; rm(m)
    }  # end if HL
 
    if (any(measures %in% c("Miller.int", "Miller.slope", "Miller.p"))) {
      for (m in 1:n.models) {
        Miller <- MillerCalib(obs = obs.data[ , m], pred = pred.data[ , m], plot = FALSE)
        if ("Miller.int" %in% measures)
          results[m, "Miller.int"] <- Miller$intercept
        if ("Miller.slope" %in% measures)
          results[m, "Miller.slope"] <- Miller$slope
        if ("Miller.p" %in% measures)
          results[m, "Miller.p"] <- Miller$slope.pvalue
      }; rm(m)
    } # end if Miller
  }  # end for m
 
  if (standardize) {
    colnames(results)[colnames(results) == "kappa"] <- "skappa"
    colnames(results)[colnames(results) == "TSS"] <- "sTSS"
  }  # end if standardize
 
  results <- data.frame(Model = rownames(results), results, row.names = NULL)
  message("Finished!")
  return(results)
}

[presented with Pretty R] Usage examples:

multModEv(obs.data = mydata[ , 2:8], pred.data = mydata[ , 9:15], thresh = “preval”) multModEv(models = mymodels$models) multModEv(models = mymodels$models, thresh = 0.5, measures = c(“AUC”, “CCR”, “Sensitivity”, “TSS”))

REFERENCES: Barbosa A.M., Brown J.A. & Real R. (2014) modEvA – an R package for model evaluation and analysis. R package, version 0.1.

modTools

simple tools for simpler modelling

Analyse multicollinearity in a dataset, including the variance inflation factor (VIF)

Classify integer columns

Extract a “jumping window” from a vector

modEvA and fuzzySim packages on R-Forge

Check your package’s help files for non-ASCII characters

Probability of occurrence from presence-background data in a data frame

Distribution modelling based on presence points

Convert geographical coordinates from degree-minute-second to decimal degrees

Arrange plots within a window

Calculate evaluation measures for multiple models