Degree-minute-second to decimal coordinates

Posted on February 2, 2022 by A.M. Barbosa

The ‘dms2dec’ function, posted here a while ago to convert longitude-latitude coordinates from degree-minute-second to decimal format, has recently been updated to accomodate more cases that I and my course participants run into. The function is pasted below and is also available on GitHub, from where it can be sourced directly from R with source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/dms2dec", encoding = "UTF-8"). Feedback welcome on whether it is working properly!

dms2dec <- function(dms, separators = c("º", "°", "\'", "’", "’’", "\"", "\'\'", "\\?")) {
  
  # version 1.4 (2 Feb 2022)
  # dms: a vector of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in 'dms'; mind these are taken in the order in which they appear and not interpreted individually, i.e. 7'3º will be taken as 7 degrees, 3 minutes! input data are assumed to be properly formatted
  
  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)
  
  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)
  
  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    
    if (length(splits[[i]]) < 4) {
      hem[i] <- splits[[i]][3]
    } else {
      sec[i] <- splits[[i]][3]
      hem[i] <- splits[[i]][4]
    }
  }
  
  dec <- colSums(rbind(as.numeric(deg), (as.numeric(min) / 60), (as.numeric(sec) / 3600)), na.rm = TRUE)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  hem_miss <- which(is.na(hem))
  if (length(hem_miss) > 0) {
    warning("Hemisphere not specified at position(s) ", hem_miss, ", so the sign of the resulting coordinates may be wrong.")
  }
  dec <- sign * dec
  return(dec)
}  # end dms2dec function

Usage example:

    mydata$lat_dd <- dms2dec(mydata$lat_dms)

    mydata$lon_dd <- dms2dec(mydata$lon_dms)

Package modEvA 3.0 is now on CRAN!

Posted on December 21, 2021 by A.M. Barbosa

This version of R package “model Evaluation and Analysis” includes some bug fixes (thanks to Huijie Qiao, Ying-Ju Tessa Chen, Oswald van Ginkel and Alba Estrada), some new functions (predPlot, confusionLabel, and mod2obspred, which is now used internally by several others), and it implements more classes (‘gam’, ‘gbm’, ‘randomForest’, ‘bart’) besides ‘glm’ for the ‘model‘ argument in most functions. There are also a few argument additions and improvements — e.g., varPart now has an option to plot the circles in colour (thanks to Oswald van Ginkel). You can read the package NEWS file for details.

You can now install the newest version of modEvA from CRAN and try out some of these new features:

install.packages("modEvA")
library(modEvA)

# load some other packages to make different models:
library(gam) library(gbm)

# take a sample dataset and create a numeric binary response variable:
data(kyphosis) head(kyphosis) kyphosis$Kyphosis <- ifelse(kyphosis$Kyphosis == "present", 1, 0)

# make different models with this response variable:
mod_glm <- glm(Kyphosis ~ Age + Number + Start, family = binomial, data = kyphosis) mod_gam <- gam(Kyphosis ~ s(Age) + s(Number) + s(Start), family = binomial, data = kyphosis) mod_gbm <- gbm(Kyphosis ~ Age + Number + Start, distribution = "bernoulli", data = kyphosis)

# get different evaluation metrics/plots directly from the model objects:
# e.g., density of predictions for presences and absences:
predPlot(model = mod_glm, main = "GLM") predPlot(model = mod_gam, main = "GAM") predPlot(model = mod_gbm, main = "GBM") predDensity(model = mod_glm, main = "GLM") predDensity(model = mod_gam, main = "GAM") predDensity(model = mod_gbm, main = "GBM")

# (area under the) ROC and Precision-Recall curves:
`AUC(model = mod_glm, main = "GLM") AUC(model = mod_gam, main = "GAM") AUC(model = mod_gbm, main = "GBM") AUC(model = mod_glm, curve = "PR", main = "GLM") AUC(model = mod_gam, curve = "PR", main = "GAM") AUC(model = mod_gbm, curve = "PR", main = "GBM")`

You can try also other functions such as ‘threshMeasures‘, ‘MillerCalib‘ or ‘HLfit‘. And check out the colour version of ‘varPart‘:

varPart(A = 0.456, B = 0.315, AB = 0.852, A.name = "Spatial", B.name = "Climatic", col = TRUE)
varPart(A = 0.456, B = 0.315, C = 0.281, AB = 0.051, BC = 0.444, AC = 0.569, ABC = 0.624, A.name = "Spatial", B.name = "Human", C.name = "Climatic", col = TRUE)

Feel free to send me any bug reports! Feature requests/suggestions are also welcome, though I can’t promise a timely response… And remember to look for the latest package updates in the development page on R-Forge!

Mapping the confusion matrix

Posted on November 10, 2021 by A.M. Barbosa

The ‘confusionLabel‘ function below labels the predictions of a binary response model according to their confusion matrix categories, i.e., it classifies each prediction into a false positive, false negative, true positive or true negative, given a user-defined threshold value:

confusionLabel <- function(model, # placeholder, not yet implemented
                           obs,  # vector of observed 0 and 1 values
                           pred, # vector of predicted probabilities
                           thresh) # threshold to separate predictions
{
  if (length(obs) != length(pred)) stop("'obs' and 'pred' must be of the same length (and in the same order)")
  res <- rep("", length(obs))
  res[pred >= thresh & obs == 1] <- "TruePos"
  res[pred < thresh & obs == 0] <- "TrueNeg"
  res[pred >= thresh & obs == 0] <- "FalsePos"
  res[pred < thresh & obs == 1] <- "FalseNeg"
  res
}

Usage example:

confusionLabel(obs = myspecies_presabs, pred = myspecies_P, thresh = 0.23)

The result is a vector with the same length as ‘obs’ and ‘pred’, containing the label for each value. This allows you to then map the confusion matrix, like this:

This function will be included in the next version of the ‘modEvA‘ package, which is soon to be released.

Downloading and cleaning GBIF data with R

Posted on March 22, 2021 by A.M. Barbosa

Many students and workshop participants ask me for a (semi-)automated way to 1) download species occurrence data from GBIF into R, and 2) clean such data from common errors. The following script does that, while calling the user’s attention to the need for properly citing the data sources (not just GBIF, which is a repository for many sources), and for carefully mapping and inspecting them for additional problems which are not picked up by current tools.

NOTE [April 2023]: This script is a bit outdated given recent developments in R — see e.g. newer posts on GBIF data cleaning and removal of absences.

#######################################
## DOWNLOAD AND CLEAN DATA FROM GBIF ##
#######################################

library(rgbif)
library(scrubr)
library(maps)


# IF YOU HAVE ONLY ONE SPECIES OR TAXONOMIC GROUP ----

myspecies <- "Galemys pyrenaicus"  # instead of a species it can be another taxonomic group, e.g. "Mammalia" or "Talpidae"

# download GBIF occurrence data for this taxon; this takes time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000)


# take a look at the downloaded data:
gbif_data
# if "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations


# if your taxon is widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# get the DOIs for citing these data properly:
gbif_citation(gbif_data)
# note: if you need or prefer only one DOI for the entire dataset, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# check how the data are organized:
names(gbif_data)
names(gbif_data$meta)
names(gbif_data$data)

# get the columns that matter for mapping and cleaning the occurrence data:
myspecies_coords <- gbif_data$data[ , c("decimalLongitude", "decimalLatitude", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
head(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys was based, so remember to adjust the spatial resolution of your analysis accordingly!

# also, these data are likely to contain species absences and location errors, so jump to "CLEAN THE DATASET" section below - this is VERY IMPORTANT!!!


# IF YOU HAVE MORE THAN ONE TAXONOMIC GROUP ----

myspecies <- c("Galemys pyrenaicus", "Chioglossa lusitanica")

# download GBIF occurrence data for these species; this may take a long time if there are many data points!
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 500)  # decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset

# take a look at the downloaded data:
gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# get the DOI for citing these data properly:
gbif_citation(gbif_data)  # unfortunately it is more complicated to obtain with R a proper citation for a dataset with multiple species. To get a DOI for these data, download the dataset directly from www.gbif.org and then import the .csv to R. It is very important to properly cite the data sources! GBIF is not a source, just a repository for many people who put in very hard work to collect these data and make them available. See also the new "derived datasets" tool (https://www.gbif.org/citation-guidelines#derivedDatasets) for obtaining a DOI based on the datasetKey column.

# if your species are widespread but you want to work on a particular region, you can download records within a specified window of coordinates:
gbif_data <- occ_data(scientificName = myspecies, hasCoordinate = TRUE, limit = 20000, decimalLongitude = "-10, 10", decimalLatitude = "35, 55")  # note that coordinate ranges must be specified this way: "smaller, larger" (e.g. "-5, -2")

gbif_data

# check how the data are organized:
names(gbif_data)
names(gbif_data[[myspecies[1]]])
names(gbif_data[[myspecies[1]]]$meta)
names(gbif_data[[myspecies[1]]]$data)


# create and fill a list with only the 'data' section for each species:
myspecies_coords_list <- vector("list", length(myspecies))
names(myspecies_coords_list) <- myspecies
for (s in myspecies) {
  # get the columns that matter for mapping and cleaning the occurrence data:
  coords <- gbif_data[[s]]$data[ , c("decimalLongitude", "decimalLatitude", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
  if (!is.null(coords)) myspecies_coords_list[[s]] <- data.frame(species = s, coords)
}
lapply(myspecies_coords_list, head)

# collapse the list into a data frame:
myspecies_coords <- as.data.frame(do.call(rbind, myspecies_coords_list), row.names = 1:sum(sapply(myspecies_coords_list, nrow)))
head(myspecies_coords)
tail(myspecies_coords)

# map the occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = as.factor(myspecies_coords$species), pch = ".")

# you may notice (especially if you zoom in, e.g. by specifying a smaller range of coordinates under 'xlim' and 'ylim' above) that many points are too regularly spaced to be exact locations of species sightings; rather, such points are likely to be centroids of (relatively large) grid cells on which particular surveys were based, so remember to adjust the spatial resolution of your analysis accordingly!


# CLEAN THE DATASET! ----

# mind that data often contain errors, so careful inspection and cleaning are necessary! 
# here we'll first remove records of absence (if any):
names(myspecies_coords)
sort(unique(myspecies_coords$occurrenceStatus))  # check for different indications of "absent", which could be in different languages! and remember that R is case-sensitive
absence_rows <- which(myspecies_coords$occurrenceStatus %in% c("absent", "Absent", "ABSENT", "ausente", "Ausente", "AUSENTE"))
length(absence_rows)
if (length(absence_rows) > 0) {
  myspecies_coords <- myspecies_coords[-absence_rows, ]
}

# let's do some further data cleaning with functions of the 'scrubr' package (but note this cleaning is not exhaustive!)
nrow(myspecies_coords)
myspecies_coords <- coord_incomplete(coord_imprecise(coord_impossible(coord_unlikely(myspecies_coords))))
nrow(myspecies_coords)

# map the cleaned occurrence data:
map("world", xlim = range(myspecies_coords$decimalLongitude), ylim = range(myspecies_coords$decimalLatitude))  # if the map doesn't appear right at first, run this command again
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], col = as.factor(myspecies_coords$species), pch = ".")
# possible erroneous points e.g. on the Equator (lat and lon = 0) should have disappeared now

# also eliminate presences with reported coordinate uncertainty (location error, spatial resolution) larger than 5 km (5000 m):
myspecies_coords <- coord_uncertain(myspecies_coords, coorduncertainityLimit = 5000)
nrow(myspecies_coords)
# but note that this will only get rid of records where coordinate uncertainty is adequately reported, which may not always be the case! Careful mapping and visual inspection is necessary

# map the cleaned occurrence records with a different colour on top of the raw ones:
points(myspecies_coords[ , c("decimalLongitude", "decimalLatitude")], pch = 20, cex = 0.5, col = "turquoise")

Plot predicted values for presences vs. absences

Posted on January 20, 2021 by A.M. Barbosa

Diagnostic plots are always a nice, visually explicit way to assess our data and predictions, and I’ve just added a new one to the modEvA R package. The ‘predPlot‘ function plots predicted values separated into observed presences and absences, and coloured according to whether they’re above or below a given prediction threshold (the default threshold is species prevalence, i.e. proportion of presences, in the observed data). It shows whether there’s a visible split in the predicted values for presences and absences (see also functions predDensity and plotGLM in the same package, for alternative/complementary ways of visualizing this). The plot imitates (with permission from the author) one of the graphical outputs of the ‘summary‘ of models built with the ‘embarcadero‘ package (Carlson, 2020), but it can be applied to any ‘glm‘ object or any set of observed and predicted values, and it allows specifying a user-defined prediction threshold. The ‘predPlot‘ function is now included in package modEvA (version >= 2.1, currently available on R-Forge).

predPlot <- function(model = NULL, obs = NULL, pred = NULL, thresh = "preval", main = "Classified predicted values", legend.pos = "bottomright") {
  # version 1.0 (20 Jan 2021)
  
  if (!is.null(model)) {
    if (!("glm" %in% class(model)) || family(model)$family != "binomial") stop("'model' must be of class 'glm' and family 'binomial'.")
    if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.")
    if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.")
    obs <- model$y
    pred <- model$fitted.values
  } else {
    if (is.null(obs) || is.null(pred)) stop ("You must provide either 'model' or a combination of 'obs' and 'pred'.")
    if (!is.numeric(obs) || !is.numeric(pred)) stop ("'obs' and 'pred' must be numeric.")
    if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same length.")
  }
  
  if (!(thresh == "preval" || (is.numeric(thresh) && thresh >= 0 && thresh <= 1))) stop ("'thresh' must be either 'preval' or a numeric value between 0 and 1.")
  if (thresh == "preval")  thresh <- prevalence(obs)

  pred0 <- pred[obs == 0]
  pred1 <- pred[obs == 1]
  
  opar <- par(no.readonly = TRUE)
  on.exit(par(opar))
  par(mar = c(5, 5.2, 3, 1))

  plot(x = c(0, 1), y = c(-0.5, 1.5), xlab = "Predicted value", type = "n", ylab = "", yaxt = "n", main = main)
  axis(side = 2, at = c(0, 1), tick = FALSE, labels = c("Observed\nabsences", "Observed\npresences"), las = 1)
  abline(v = thresh, lty = 2)
  points(x = pred0, y = sapply(rep(0, length(pred0)), jitter, 10), col = ifelse(pred0 < thresh, "grey", "black"))
  points(x = pred1, y = sapply(rep(1, length(pred1)), jitter, 10), col = ifelse(pred1 < thresh, "grey", "black"))
  
  if (!is.na(legend.pos) && legend.pos != "n")  legend(legend.pos, legend = c("Predicted presence", "Predicted absence"), pch = 1, col = c("black", "grey"))
}

Usage examples:

library(modEvA)

# load sample models:
data(rotif.mods)

# choose a particular model to play with:
mod <- rotif.mods$models[[1]]

# make some predPlots:
predPlot(model = mod)
predPlot(model = mod, thresh = 0.5)

# you can also use 'predPlot' with vectors of observed and predicted values instead of a model object:
myobs <- mod$y
mypred <- mod$fitted.values
predPlot(obs = myobs, pred = mypred)

References

Carlson C.J. (2020) embarcadero: Species distribution modelling with Bayesian additive regression trees in R. Methods in Ecology and Evolution, 11: 850-858.

Functions for time tracking and management

Posted on June 26, 2020 by A.M. Barbosa

Especially since I had to start working as a freelancer, it became essential to keep an accurate record of how much time I spend on each task and project. I checked out some software tools available online (namely TimeCamp, which I’ve used for a while), but I decided to see if I could write my own R functions for tracking time use, especially when working offline (some of us still like or need to work offline sometimes!), and for producing some related plots and analyses afterwards. Here’s what I came up with.

I first need to have a file named “PROJECTS.csv“, with a column named “project” and a column named “task“, and the names of the projects and tasks I’ll be working on, like this:

type	project	task
consulting	elephants	analysis
consulting	elephants	writing
consulting	elephants	revising
training	R-GIS	preparing
training	R-GIS	teaching
experimenting	timeman	programming
experimenting	timeman	debugging

I can manually add projects and tasks to this file at any time. Then I use the ‘start_task‘ and ‘stop_task‘ functions below to add entries to another file, which is named “DONT-TOUCH_timeman_data.csv“, has five columns (proj, task, start, stop, time) and should not be moved or modified manually:

projects <- read.csv("PROJECTS.csv")
tail(projects)

timeman_data <- read.csv("DONT-TOUCH_timeman_data.csv", as.is = TRUE)
tail(timeman_data)

start_task <- function(proj, task, time = NULL) {
  if (nrow(timeman_data) > 0 && is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("previous task must be stopped before starting another")
  if (!(proj %in% unique(projects$project)))
    stop ("proj must exist in 'PROJECTS.csv'")
  if (!(task %in% unique(projects$task)))
    stop ("task must exist in 'PROJECTS.csv'")
  #if (!(proj %in% unique(timeman_data$projecto)))
    #timeman_data[nrow(timeman_data) + 1, "projecto"] <- proj
  if (is.null(time))
    time <- Sys.time()
  timeman_data[nrow(timeman_data) + 1, 1:3] <- c(proj, task, time)
  timeman_data <<- timeman_data
}

stop_task <- function(time = NULL) {
  if (nrow(timeman_data) > 0 && !is.na(timeman_data[nrow(timeman_data), ncol(timeman_data)]))
    stop ("task must be started before being stopped")
  if (is.null(time))
    time <- Sys.time()
  #this_row <- which.max(timeman_data$projecto == proj)
  this_row <- nrow(timeman_data)
  timeman_data[this_row, "stop"] <- as.character(time)
  timeman_data[this_row, "time"] <- difftime(timeman_data[this_row, "stop"], timeman_data[this_row, "start"])
  timeman_data <<- timeman_data
  write.csv(timeman_data, "DONT-TOUCH_timeman_data.csv", row.names = FALSE)
  cat("Showing last rows in tasks table:\n\n")
  tail(timeman_data)
}

current_task <- function() {
  timeman_data[nrow(timeman_data), ]
}

I need to use ‘start_task‘ every time I start working on a task, and then ‘stop_task‘ every time I stop. This adds an entry with the starting, stopping, and total time spent in each instance:

unique(projects$proj)
unique(projects$task)
unique(timeman_data$proj)

# ADD START:
start_task(proj = "timeman", task = "debugging")
tail(timeman_data)

# ADD STOP:
stop_task()
tail(timeman_data)

I can then apply all sorts of R analyses and plots to the time-tracking data, which is recorded in the ‘timeman_data‘ data frame and CSV file. However, these functions are certainly far from ideal, as they use “unsafe” operators (like ‘<<-‘) and they rely on a file remaining untouched in its folder. It’s plain to see how things could easily go wrong at any moment…. Suggestions welcome on how this could be improved and made safer! While this was certainly an interesting exercise, I think I’ll just stick to TimeCamp for the time being 🙂

Plot outliers and their values

Posted on March 13, 2020 by A.M. Barbosa

The ‘plot_outliers‘ function below draws a boxplot and a scatterplot of a numeric variable x and plots the values of the outliers (currently not offset, even if they overlap). For relatively small datasets, it can be a quick way to identify which outliers look reasonable and which are likely a result of transcription or measurement error, and thus should be either corrected or discarded.

plot_outliers <- function(x, val_col = "blue", ...) { 
  par_in <- par(no.readonly = TRUE) 
  par(mfrow = c(1, 2)) 
  bp <- boxplot(x, ...) 
  out <- bp$out 
  message(length(out), " outliers detected") 
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  plot(x, pch = 20)
  if (length(out) > 0) text(x = 0.5, y = bp$out, labels = round(out, 2), adj = 0, col = val_col)
  par(par_in)
}

Usage examples:

plot_outliers(iris$Sepal.Width)

Additional arguments for the ‘boxplot‘ function can be provided, e.g.

plot_outliers(airquality$Ozone, notch = TRUE)

plot_outliers(airquality$Wind, col = "darkgreen", main = "wind")

This function is used in an article which we hope to submit soon.

fuzzySim updated to 3.0 on CRAN!

Posted on February 11, 2020 by A.M. Barbosa

The newest version of R package fuzzySim (3.0) is now on CRAN! It includes new functions such as ‘sharedFav‘, ‘favClass‘, ‘bioThreat‘ and ‘gridRecords‘; improvements to some functions, help files and examples; updated e-mail and citation information [ see citation(“fuzzySim”) ]; clarifications and typo corrections along the reference manual; and some bug fixes (after changes to base R and/or to function dependencies), e.g. to ‘getPreds‘ when applied to raster objects.

Among other new functionalities, fuzzySim now makes it easier to use variable selection and presence-absence modelling techniques on occurrence points + raster variables, as these are becoming the more common data formats in species distribution modelling (SDM) and ecological niche modelling (ENM). Here’s a worked example:

# download and plot predictor variables:
# (make sure their spatial resolution is not finer than the occurrence data)
library(raster)
worldclim <- getData("worldclim", var = "bio", res = 10)
plot(worldclim)
plot(worldclim[[1]])

# download and plot species occurrence data:
library(rgbif)
gbif <- occ_data(scientificName = "Galemys pyrenaicus", hasCoordinate = TRUE, limit = 5000)
# GBIF data often have errors, data cleaning (e.g. with package 'scrubr') should be performed!
# here we'll just remove records of absence or zero-abundance:
absence_rows <- which(gbif$data$occurrenceStatus == "absent" | gbif$data$organismQuantity == 0)
if (length(absence_rows) > 0) gbif$data <- gbif$data[-absence_rows, ]
presences <- gbif$data[ , c("decimalLongitude", "decimalLatitude")]
# plot the occurrence records on the map:
points(presences)

# crop variables to the extent of the occurrence data (mind that this is just a quick way, certainly not the best one, of delimiting the adequate region for analysis):
worldclim_crop <- crop(worldclim, extent(range(presences$decimalLongitude), range(presences$decimalLatitude)))
plot(worldclim_crop[[1]])
points(presences)


# model occurrence data as presence-absence in the cropped grid of pixels:
library(fuzzySim)
library(terra)

# first, get the centroid coordinates and variable values at pixels with and without presence records
# (remember, raster resolution should not be finer than occurrence data!)
gridded_presences <- gridRecords(rst = rast(worldclim_crop), pres.coords = presences)
head(gridded_presences)
plot(worldclim_crop[[1]])
points(gridded_presences[gridded_presences$presence == 0, c("x", "y")], col = "red")
points(gridded_presences[gridded_presences$presence == 1, c("x", "y")], col = "blue", pch = 20)

# then, build a GLM with variable selection on these presence-absence data:
names(gridded_presences)
model_GLM <- multGLM(data = gridded_presences, sp.cols = "presence", var.cols = 5:23, id.col = "cell", FDR = TRUE, corSelect = TRUE, step = TRUE, trim = TRUE)
summary(model_GLM$models$presence)
head(model_GLM$predictions)

# finally, get and plot the model predictions (probability and favourability):
pred_GLM_raster <- getPreds(data = stack(worldclim_crop), models = model_GLM$models)
plot(pred_GLM_raster)

If you’re worried about using a presence-absence modelling algorithm on ‘presence-only’ records, you can compare these predictions with those of a widely used presence-background algorithm (Maxent) on the same data, to check that they are not far off:

library(maxnet)
?maxnet

# build a maxent model with all variables:
model_maxent_allvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], addsamplestobackground = TRUE, f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , 5:23], classes = "lq"))  # "lq" = linear + quadratic features

# build a maxent model with only the variables selected by multGLM (folowing FDR, correlation, AIC, significance):
selected_vars <- model_GLM$variables$presence
model_maxent_selvars <- maxnet(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], addsamplestobackground = TRUE, f = maxnet.formula(p = gridded_presences[ , "presence"], data = gridded_presences[ , selected_vars], classes = "lq"))

# compute and map maxent predictions:
pred_maxent_allvars_raster <- raster::predict(worldclim_crop, model_maxent_allvars, type = "cloglog", clamp = TRUE)
pred_maxent_selvars_raster <- raster::predict(worldclim_crop, model_maxent_selvars, type = "cloglog", clamp = TRUE)
par(mfrow = c(1, 2))
plot(pred_maxent_allvars_raster, main = "Maxent all vars")
plot(pred_maxent_selvars_raster, main = "Maxent selected vars")

Compare both Maxent prediction maps with the “presence_F” map above (favourability, which removes the effect of species prevalence from presence probability, so it’s the same as presence probability if prevalence were 50% — the default prevalence assumed by Maxent). This and other modelling methods will be taught next week at the CIBIO advanced course on ecological niche modelling. Check out the Courses page for other upcoming courses!

This blog is now on R-bloggers!

Posted on February 9, 2020 by A.M. Barbosa

I should have done this long ago, but it’s never too late: modTools has finally been added to R-bloggers — a content aggregator contributed by bloggers who write about R, empowering bloggers to empower other R users. I hope the community will find it useful. Constructive feedback welcome!

Grid point occurrence records onto a raster

Posted on February 3, 2020 by A.M. Barbosa

The ‘gridRecords‘ function, which has just been added to the ‘fuzzySim‘ package (from version 2.6 on), takes a raster stack and a set of spatial coordinates of a species’ presence (and optionally absence) records, and returns a data frame with the presences and absences, as well as the corresponding values of the rasters in the grid of pixels (cells). If absence coordinates are not supplied, all pixels without any presence point will be returned as absences. A precursor of this function was used in Báez et al. (2020) for getting unique presences and absences from point occurrence data at the spatial resolution of marine raster variables. The function can be especially useful for using point + raster data to compute presence-absence models with packages or functions that normally require data frames as input, such as glm, multGLM, gam or randomForest.

gridRecords <- function(rst,
                        pres.coords,
                        abs.coords = NULL,
                        na.rm = TRUE) {
  
  # version 2.0 (3 Feb 2020)
  
  if (!requireNamespace("raster")) stop("This function requires installing the 'raster' package first.")
  
  if (is.null(abs.coords)) {
    abs.coords <- raster::coordinates(rst)
  }

  p_extract <- raster::extract(rst, pres.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  a_extract <- raster::extract(rst, abs.coords, cellnumbers = TRUE, df = TRUE)[ , -1]
  
  p_extract <- unique(p_extract)
  a_extract <- unique(a_extract)
  
  a_extract <- a_extract[!(a_extract$cells %in% p_extract$cells), ]
  
  p_centroids <- raster::xyFromCell(rst, p_extract$cells)
  a_centroids <- raster::xyFromCell(rst, a_extract$cells)
  
  p_extract <- data.frame(presence = 1, p_centroids, p_extract)
  if (nrow(a_extract) > 0) {
    a_extract <- data.frame(presence = 0, a_centroids, a_extract)
  }
  
  result <- rbind(p_extract, a_extract)
  
  if (na.rm) {
    result_NA <- which(apply(result[ , 5:ncol(result)], MARGIN = 1, FUN = function(x) all(is.na(x))))
    if (length(result_NA) > 0) {
      result <- result[-result_NA, ]
    }
  }
  
  return(result)
}

Usage example:

library(raster)
library(fuzzySim)  # >= 2.6

# import a system raster with 3 layers and crop it to a smaller extent:
rst <- stack(system.file("external/rlogo.grd", package = "raster"))
ext <- extent(c(0, 15, 25, 40))
rst <- crop(rst, ext)
plot(rst)
plot(rst[[1]])

# generate some random presence and absence points:
set.seed(123)
presences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
absences <- sp::spsample(as(ext, "SpatialPolygons"), 50, type = "random")
points(presences, pch = 20, cex = 0.2, col = "black")
points(absences, pch = 20, cex = 0.2, col = "white")

# use 'gridRecords' on these random points:
gridded_pts <- gridRecords(rst, coordinates(presences), coordinates(absences))
head(gridded_pts)  # 'red', 'green' and 'blue' are the names of the layers in 'rst'

# plot them to check the result:
pres_coords <- gridded_pts[gridded_pts$presence == 1, c("x", "y")]
abs_coords <- gridded_pts[gridded_pts$presence == 0, c("x", "y")]
points(gridded_pts[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pts$presence)

# you can also do it with only presence (no absence) records:
gridded_pres <- gridRecords(rst, coordinates(presences))
head(gridded_pres)
plot(rst[[1]])
points(presences, pch = 20, cex = 0.2, col = "black")
pres_coords <- gridded_pres[gridded_pres$presence == 1, c("x", "y")]
abs_coords <- gridded_pres[gridded_pres$presence == 0, c("x", "y")]
points(gridded_pres[ , c("x", "y")], pch = 4, cex = 0.6, col = gridded_pres$presence)

References
Baez J.C., Barbosa A.M., Pascual P., Ramos M.L. & Abascal F. (2020) Ensemble modelling of the potential distribution of the whale shark in the Atlantic Ocean. Ecology and Evolution, 10: 175-184

# (area under the) ROC and Precision-Recall curves:AUC(model = mod_glm, main = "GLM")AUC(model = mod_gam, main = "GAM")AUC(model = mod_gbm, main = "GBM")AUC(model = mod_glm, curve = "PR", main = "GLM")AUC(model = mod_gam, curve = "PR", main = "GAM")AUC(model = mod_gbm, curve = "PR", main = "GBM")