modTools: simple tools for simpler modelling

Featured

Posted on May 11, 2012 by A.M. Barbosa

This website provides a series of free software tools (CC BY-SA licence), mostly written in R, that can be used for various tasks related to the analysis of species’ distribution and ecology (although many of them can be applied to other purposes). Most of them are now being included in R packages — please cite the appropriate package (mentioned in the blog post featuring each function) if you use these tools, which are released under a Creative Commons Attribution license. Please note that these tools are experimental (like R, they come with absolutely no warranty) and occasionally edited with corrections or improvements, so preferably check back for the latest version before you use a function.

Leave a comment here if you use these functions in publications — this way I can cite you in the blog, track how people are using the tools, and justify the time and effort dedicated to them. Thanks are due to Arpat Ozgul and François Guilhaumon, who helped me through the steep part of the R learning curve (to me it was actually more like an overhang). Comments and suggestions are very welcome, although a quick response cannot be guaranteed!

Delimiting the modelling background for scattered uneven occurrence data

Posted on October 31, 2024 by A.M. Barbosa

In species distribution modelling and ecological niche modelling (SDM & ENM), the region from where background or pseudoabsence points are picked is key to how well a model turns out. This region should include sufficient localities for the model to assess the species’ (non-)preferences, but it should also be within the species’ reach AND reasonably evenly surveyed for it. Survey bias within the background or the pseudoabsence region can seriously reduce model quality.

Yet, real-life occurrence data (especially at the wide scale of species distributions) often are strongly biased, and it’s rarely straightforward to define an adequate region around the existing points. Very nice criteria exist that work well in theory, but not under most actual scenarios of biodiversity data bias. Until recently, my preferred approach was a buffer around the known occurrences, whose radius was e.g. the mean pairwise distance between them (as a rough measure of their spatial spread, or the spatial reach of the species), or half the width of the area defined by the occurrences (usually better for elongated distributions or surveys):

# download some example occurrence records:

occs <- geodata::sp_occurrence("Rupicapra", "pyrenaica", args = c("year=2024"))


# map occurrence records:

occs <- terra::vect(occs, geom = c("lon", "lat"), crs = "EPSG:4326")
terra::plot(occs, cex = 0.2, ext = terra::ext(occs) + 4)
cntry <- geodata::world(path = tempdir())
terra::plot(cntry, lwd = 0.2, add = TRUE)
# note you should normally clean these records from errors before moving on!


# compute mean distance and width of occurrence records:

occs_mdist <- mean(terra::distance(occs))
occs_width <- terra::width(terra::aggregate(occs))


# compute buffers around occurrences using both metrics:

occs_buff_d <- terra::aggregate(terra::buffer(occs, width = occs_mdist))
occs_buff_w <- terra::aggregate(terra::buffer(occs, width = occs_width / 2))


# plot both buffers and occurrence records:

par(mfrow = c(1, 2))
terra::plot(occs_buff_d, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Mean distance")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)
terra::plot(occs_buff_w, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Half width")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)

This may work well for relatively evenly surveyed species or regions, where we can reasonably assume that most of the buffered area has been (roughly evenly) surveyed, and that most places without presence are likely absences. However, many species have isolated records in remote understudied areas, with survey effort hugely imbalanced across the distribution range. So, a buffer with a uniform radius will not work well:

occs <- geodata::sp_occurrence("Lutra", "lutra", args = c("year=2024"))

# everything else same code as above...

You can use smaller multipliers for distance and width, but these uniform-radius buffers will always include large regions with only scarce and isolated occurrence records, where it cannot be assumed that absence of records mostly reflects absence of the species. Normally I would terra::erase() from the buffer the regions that look insufficiently surveyed, but that implies discarding potentially valuable occurrence data.

So, my latest approach consists of finding clusters of occurrence points that are within a given distance of one another; and then defining a varying buffer radius of, e.g., the width or the mean pairwise distance among the points in each cluster, down-weighted by the distance to other points or clusters (as an indicator of isolated observations in under-surveyed areas). This way, the modelling background will avoid areas that were most likely not surveyed, yet without discarding the occurrence records from those areas. I’ve implemented several variations of this and related approaches in the new getRegion() function of R package fuzzySim (version 4.26):

# try regions with two different methods:

reg1 <- fuzzySim::getRegion(occs, type = "inv_dist")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg1, lwd = 4, border = "orange", add = TRUE)

reg2 <- fuzzySim::getRegion(occs, type = "clust_width", width_mult = 0.5)
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg2, lwd = 4, border = "orange", add = TRUE)

Read the function help file for all the different options available, and try them out to see what feels best for your study system. Feedback and bug reports welcome!

Getting marine polygon maps in R

Posted on February 5, 2024 by A.M. Barbosa

Another frequent question of my students is how to obtain a polygon map of the seas and oceans, rather than the land polygons (countries, etc.) that are commonly imported with R spatial data packages. You can mostly just use the land polygons and do the opposite operation as you would do for terrestrial features — e.g., to colour the sea in a map of land polygons, just use the background instead of the col argument; to mask a raster map to the sea, just use the land polygons for a terra::mask() with inverse=TRUE; to select or crop features that overlap the sea, just do instead terra::erase() with the land polygons. But anyway, you may just prefer or need to have a marine polygon.

There are marine polygon maps available for download, e.g. at Marineregions.org. But if you don’t need the separation into particular seas or oceans or EEZs, you can easily create a global marine polygon from a countries map in R:

# import a world countries map:
countries <- geodata::world(path = tempdir())
terra::plot(countries, col = "tan")

# make a polygon map delimiting the entire extent of the Earth:
earth <- terra::vect(terra::ext(), crs = "EPSG:4326")
terra::plot(earth, col = "lightblue")
terra::plot(countries, col = "tan", add = TRUE)

# erase the countries (land parts) to get just the marine polygon:
marine <- terra::erase(earth, countries)
terra::plot(marine, col = "lightblue")

That’s it! See also terra::symdif(), or terra::mask(inverse=TRUE). You can then crop the marine polygon with your own other polygon or desired extent, e.g. terra::crop(marine, terra::ext(-20, 60, -40, 40)); and/or you can use the marine polygon to crop/mask other maps to the marine regions, e.g.:

# import a global bathymetry map:
bathy_source <- "https://gebco2023.s3.valeria.science/gebco_2023_land_cog.tif" # from https://gist.github.com/mdsumner/aaa6f1d2c1ed107fbdd7e83f509a7cf3
bathy <- terra::rast(bathy_source, vsi = TRUE)
# terra::plot(bathy)  # slow

# crop bathymetry to a given extent:
bathy_crop <- terra::crop(bathy, terra::ext(110, 180, -50, 0))
terra::plot(bathy_crop, main = "Pixel values everywhere")
terra::plot(countries, add = TRUE)

# crop and mask bathymetry to keep values only on the marine polygon:
bathy_marine <- terra::crop(bathy_crop, marine, mask = TRUE)
terra::plot(bathy_marine, main = "Marine pixel values only")

See also this previous post for how to further crop/mask to the near-shore raster values, including for particular continents or islands.

Extract raster values to points with bilinear interpolation

Posted on January 26, 2024 by A.M. Barbosa

A student recently asked me how exactly the R terra::extract() function worked when using method="bilinear" to get raster values for points. The help file rightly says that ‘With “bilinear” the returned values are interpolated from the values of the four nearest raster cells‘, but this wasn’t immediately clear without a visual example. So, I put together some code to illustrate it:

library(terra)

# create a simple raster map:
r <- rast(nrows = 3, ncols = 3, xmin = 0, xmax = 3, ymin = 0, ymax = 3)
values(r) <- 1:ncell(r)
plot(r)

# get the pixel centroids:
centr <- as.points(r)
plot(centr, add = TRUE)

# show pixel values on the map:
plot(r, main = "Pixel values")
text(centr, values(r))

With terra::extract(), when a point falls exactly at the centre of a pixel, “bilinear” gives the same result as the “simple” method, i.e. the value of the pixel where the point exactly falls:

# get the middle pixel centroid:
pnt <- centr[5, ]
plot(pnt, col = "blue", pch = 4, cex = 2, add = TRUE)

extract(r, pnt, ID = FALSE, method = "simple")
extract(r, pnt, ID = FALSE, method = "bilinear")
# both 5

Now here’s a function to check what values are extracted for points at other parts of this raster map:

checkPoint <- function(dx = 0, dy = 0, col = "black") {
  pnt_shift <- shift(pnt, dx = dx, dy = dy)
  plot(pnt_shift, col = col, add = TRUE)
  val <- extract(r, pnt_shift, ID = FALSE, method = "bilinear")
  text(pnt_shift, val, pos = 3, col = col, cex = 0.8, halo = TRUE)
}

Let’s use this function to see what values we get from this raster (with method="bilinear") for different points:

plot(r)
text(centr, values(r))

checkPoint(dx = 0.2, col = "blue")
checkPoint(dx = 0.5, col = "purple")
checkPoint(dx = 0.8, dy = 0.2, col = "steelblue")
checkPoint(dx = 0.5, dy = -0.5, col = "magenta")
checkPoint(dx = -0.3, dy = 0.3, col = "darkgreen")
checkPoint(dx = -1.3, dy = 0, col = "darkturquoise")
checkPoint(dx = -1.3, dy = 0.3, col = "red3")
checkPoint(dx = -1.3, dy = 1.2, col = "cadetblue")

So that’s how this works! Pretty cool, isn’t it? Consider using “bilinear” rather than the default “simple” method when extracting raster values to points with ‘terra‘. And thanks to Robert J. Hijmans for such a great package!

Actual pixel sizes of unprojected raster maps

Posted on January 23, 2024 by A.M. Barbosa

It is well known, though often dismissed, that the areas of spatial units (cells, pixels) based on unprojected coordinates (longitude-latitude degrees, arc-minutes or arc-seconds) are wildly inconsistent across the globe. Towards the poles, as the longitude meridians approach each other, the actual ground width of the pixels sharply decreases. So, for non-tropical regions, the actual area covered by these pixels can be quite far from the nominal “approximately 1 km at the Equator” (or 10 km, etc.) which is often referred in biogeography and macroecology studies.

The function below combines several tools from the ‘terra’ R package to compute the (mean or centroid) pixel area for a given raster map. By default, it also shows a map illustrating the latitudinal variation in the pixel areas.

pixelArea <- function(rast, # SpatRaster
                      type = "mean", # can also be "centroid"
                      unit = "m",  # can also be "km"
                      mask = TRUE,  # to use only non-NA pixels
                      map = TRUE) {

  # by A. Marcia Barbosa (https://modtools.wordpress.com/)
  # version 1.4 (17 May 2024)

  r <- rast

  stopifnot(inherits(r, "SpatRaster"),
            type %in% c("mean", "centroid"))

  r_size <- terra::cellSize(r, unit = unit, mask = mask)
  if (map) terra::plot(r_size, main = paste0("Pixel area (", unit, "2)"))
  areas <- terra::values(r_size, mat = FALSE, dataframe = FALSE, na.rm = FALSE)  # na.rm must be FALSE for areas[centr_pix] to be correct

  if (type == "mean") {
    out <- mean(areas, na.rm = TRUE)
    cat(paste0("Mean pixel area (", unit, "2):\n", out, "\n\n"))
    return(out)
  }

  if (type == "centroid") {
    r_pol <- terra::as.polygons(r * 0, aggregate = TRUE)
    centr <- terra::centroids(r_pol)
    if (map) {
      if (!mask) terra::plot(r_pol, lwd = 0.2, add = TRUE)
      terra::plot(centr, pch = 4, col = "blue", add = TRUE)
    }

    centr_pix <- terra::cellFromXY(r, terra::crds(centr))
    out <- areas[centr_pix]
    if (!is.finite(out)) message("The centroid of your region may not have a pixel value; consider using mask=FALSE, or type = 'mean'.")
    cat(paste0("Centroid pixel area (", unit, "2):\n", out, "\n\n"))
    return(out)
  }
}

You can source the function directly from GitHub with the first command below, which is followed by a usage example for a central European country:

source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/pixelArea.R")
elev <- geodata::elevation_30s("Switzerland", path = tempdir())
terra::plot(elev)

terra::res(elev)
# 0.008333333 degree resolution
# i.e. half arc-minute, or 30 arc-seconds (1/60/2)
# "approx. 1 km at the Equator"

pixelArea(elev, unit = "km")
# Mean pixel area (km2):
# [1] 0.5892033

As you can see, the mean pixel area here is quite different from 1 squared km… and Switzerland isn’t even one of the northernmost countries! So, when using unprojected rasters (e.g. WorldClim, CHELSA climate, Bio-ORACLE, MARSPEC and so many others) in your analysis, consider choosing the spatial resolution that most closely matches the resolution that you want in your study region, rather than the resolution at the Equator when this is actually far away.

You can read more details about geographic areas in the help file of the terra::expanse() function. Feedback welcome!

Weighted probability vs. favourability

Posted on July 11, 2023 by A.M. Barbosa

Presence probability, typically obtained with presence-(pseudo)absence modelling methods like GLM, GAM, GBM or Random Forest, is conditional not only on the suitability of the environmental conditions, but also on the general prevalence (proportion of presences) of the species in the study area. So, a species with few presences will generally have low presence probabilities, even in suitable conditions, simply because its presence is indeed rare.

As species distribution modellers often want to remove the efect of unbalanced prevalence from model predictions, a common procedure is to only pick (pseudo)absences in the same number as the presences, for a modelled prevalence of 50%, though this may imply significant loss of data. Alternatively, most modelling functions (e.g. glm() of base R, but also functions that implement GAM, GBM, Random Forests, etc. in a variety of R packages) allow attributing different weights to presences and absences, although they typically do not do this by default. However, some modelling packages that use these functions, like ENMTools and biomod2, use their own defaults and silently apply different weights to presences and absences, to balance their contributions and thus produce prevalence-corrected suitability predictions. Beware: as per these packages’ help files (which users should always read!), ENMTools always does this by default (see e.g. here), while biomod2 does it by default when the pseudoabsences or background points are automatically generated by the package, but not when they are provided by the user (see e.g. here).

A less compromising alternative may be the favourability function (Real et al., 2006; Acevedo & Real, 2012), which removes the effect of species prevalence from predictions of actual presence probability, without the need to restrict the number of (pseudo)absences to the same as the number of presences (i.e. without losing data), and without the need to alter the actual contributions of the data by attributing them different weights. Below is a simple and reproducible comparison in R between favourability, raw probability, and probability based on a model with down-weighted absences so that they balance the number of presences. I’ve used GLM, but this applies to other presence-(pseudo)absence models as well.

library(fuzzySim)

data("rotif.env")
names(rotif.env)

spp_cols <- names(rotif.env)[18:47]
var_cols <- names(rotif.env)[5:17]

nrow(rotif.env)  # 291 sites
sort(sapply(rotif.env[ , spp_cols], sum))  # from 99 to 172 presences
sort(sapply(rotif.env[ , spp_cols], fuzzySim::prevalence))  # from 34 to 59%

species <- spp_cols[8]  # 8 for example; try with others too
species

npres <- sum(rotif.env[ , species])
nabs <- nrow(rotif.env) - npres
npres
nabs

prevalence(rotif.env[ , species])

# set weights as in weights="equal" of ENMTools::enmtools.glm():
weights <- rep(NA, nrow(rotif.env))
weights[rotif.env[ , species] == 1] <- 1
weights[rotif.env[ , species] == 0] <- npres / nabs
weights
sum(weights[rotif.env[ , species] == 1])
sum(weights[rotif.env[ , species] == 0])  # same

formula <- reformulate(termlabels = var_cols, response = species)
formula 

mod <- glm(formula, data = rotif.env, family = binomial)
modw <- glm(formula, data = rotif.env, family = binomial, weights = weights)

pred <- predict(mod, rotif.env, type = "response")
predw <- predict(modw, rotif.env, type = "response")
fav <- Fav(mod) # note favourability is only applicable to unweighted predictons

par(mfrow = c(1, 3))
plot(pred, predw, pch = 20, cex = 0.2)  # curve
plot(pred, fav, pch = 20, cex = 0.2)  # cleaner curve
plot(fav, predw, pch = 20, cex = 0.2)  # ~linear but with noise

par(mfrow = c(1, 1))
plot(pred[order(pred)], pch = 20, cex = 0.5, col = "grey30", ylab = "Prediction")
points(predw[order(pred)], pch = 20, cex = 0.5, col = "blue")  # higher, as expected after down-weighting the unbalancedly numerous absences
points(fav[order(pred)], pch = 20, cex = 0.5, col = "salmon")  # higher like predw, but with less noise (more like the original pred)
legend("topleft", legend = c("probability", "weighted probability", "favourability"), pch = 20, col = c("grey30", "blue", "salmon"), bty = "n")

As you can see, favourability and weighted probability (which serve the same purpose of removing the effect of unbalanced sample prevalence on model predictions) are highly similar. However, favourability does not alter the original data in any way (i.e., it lets the model weigh presences and absences proportionally to the numbers in which they actually occur in the data); and it provides less noisy results that are more aligned with the original (unweighted, non-manipulated) probability.

I’ve checked this for some species already, but further tests and feedback are welcome!

Removing absences from GBIF datasets

Posted on April 17, 2023 by A.M. Barbosa

I often come across GBIF users who are unaware that the records available for a given taxon are not necessarily all presences: there’s a column named “occurrenceStatus” whose value can be “PRESENT” or “ABSENT”! The absence records can, of course, be removed with simple operations in R or even omitted from the download, but many users overlook or accidentally skip this crucial step, and then end up analysing species distributions with some absence records being used as presences.

To reduce the chances of this happening, I just added new arguments to the fuzzySim function cleanCoords() (which was explained in a previous post) to allow removing absence records too, when the user wants to filter only the presences. Here’s a usage example using aardvark occurrence records downloaded with the geodata package. Note that this requires installing the latest version of fuzzySim (4.9.9):

occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer",
       fixnames = FALSE)
# NOTE: as per the function help file, if you use GBIF data, remember to check the data use agreement and follow guidance on how to properly cite the actual data sources!

names(occ)

occ_clean <- fuzzySim::cleanCoords(occ,
             coord.cols = c("decimalLongitude", "decimalLatitude"),
             uncert.col = "coordinateUncertaintyInMeters",
             uncert.limit = 10000,
             abs.col = "occurrenceStatus")

# 764 rows in input data
# 576 rows after 'rm.dup'
# 575 rows after 'rm.equal'
# 575 rows after 'rm.imposs'
# 575 rows after 'rm.missing.any'
# 575 rows after 'rm.zero.any'
# 570 rows after 'rm.imprec.any'
# 465 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)
# 461 rows after 'rm.abs'

As you can see, besides some common biodiversity data issues such as duplicated or erroneous coordinates, additional records were removed because they represented absences. Hopefully this can help prevent their incorrect use as species presences. Feedback welcome!

Getting continent, mainland and island maps in R

Posted on March 8, 2023 by A.M. Barbosa

Maps of continents, mainlands and islands can be useful, for example, for selecting areas — and then cropping or masking variables — for modelling a species’ distribution. Here’s a way to obtain such maps using the ‘geodata’ and ‘terra’ R packages:

# load required packages:
library(terra)
library(geodata)

# import a world countries map:
countries <- world(resolution = 5, path = "maps")  # you may choose a smaller (more detailed) resolution for the polygon borders, and a different folder path to save the imported map
head(countries)

# import a table with country codes and continents:
cntry_codes <- country_codes()
head(cntry_codes)

# add this table to the countries map attributes:
head(countries)
head(cntry_codes[ , 1:4])
countries <- merge(countries, cntry_codes, by.x = "GID_0", by.y = "ISO3", all.x = TRUE)
head(countries)

# plot the countries map coloured according to "continent":
plot(countries, "continent", lwd = 0.2, main = "Countries by continent")


# dissolve (aggregate) countries into a continents map:
continents <- aggregate(countries, by = "continent")
values(continents)
plot(continents, "continent", lwd = 0.2)

# note that each continent (colour) is a multi-part polygon including mainland and islands - see also:
plot(continents[1, ])

# disaggregate continent polygons, to then separate islands and mainlands:
continents <- disagg(continents)

# get a map of just the continent mainlands (largest polygons):
unique(continents$continent)
largest <- (order(expanse(continents), decreasing = TRUE))[1:length(unique(continents$continent))]
mainlands <- continents[largest, ]
plot(mainlands, "continent", lwd = 0.2, main = "Continent mainlands")


# get a map of just the islands (i.e. polygons except mainlands):
islands <- erase(continents, mainlands)
plot(islands, "continent", lwd = 0.2, main = "World islands")

# you can then crop and mask a raster map to given islands or continents, e.g.:
elevation <- elevation_global(res = 10, path = "maps")  # you may choose a smaller (more detailed) resolution or pixel size
afr_mainland <- subset(mainlands, mainlands$continent == "Africa")
elev_afr_mainland <- crop(elevation, afr_mainland, mask = TRUE)
plot(elev_afr_mainland, main = "Elevation in mainland Africa")

# you can also compute a buffer around a given continent, and use it to crop marine layers to get only the near-shore waters of that continent:

afr_buff <- terra::buffer(afr_mainland, width = 200000)  # 200 km
plot(afr_buff, col = "darkblue", background = "lightblue")
plot(afr_mainland, col = "tan", add = TRUE)

# import a marine variable, e.g. bathymetry:
bathy_source <- "https://gebco2023.s3.valeria.science/gebco_2023_land_cog.tif" # from https://gist.github.com/mdsumner/aaa6f1d2c1ed107fbdd7e83f509a7cf3
bathy <- terra::rast(bathy_source, vsi = TRUE)

bathy_afr <- terra::crop(bathy, afr_buff, mask = TRUE)
plot(bathy_afr, col = hcl.colors(100, "blues"))
plot(countries, col = "tan", add = TRUE)

You can use the geodata::gadm() function (instead of world) if you want to download polygons for particular countries, and then apply similar procedures to separate islands from mainlands there.

Safe-and-simple cleaning of species occurrences

Posted on January 25, 2023 by A.M. Barbosa

In my species distribution modelling courses, for a quick and safe removal of the most common biodiversity database errors, I’ve so far used functions from the scrubr R package, namely ‘coord_incomplete’, ‘coord_impossible’, ‘coord_unlikely’, ‘coord_imprecise’ and ‘coord_uncertain’. There are other R packages for species occurrence data cleaning (e.g. CoordinateCleaner, biogeo, or the GUI-based bdclean), but they either require (somewhat laborious) data formatting, and/or they can eliminate many valid records (e.g. near region centroids or towns or institutions, or wrongly considered outliers) if their argument values are not mindfully tailored and their results not carefully inspected. As I need students to have at least most of the worst data errors cleaned before modelling, but can’t spend much of their time and attention on something that’s not the main focus of the course, scrubr was my go-to option… until it was archived, both on CRAN and on GitHub! While I always advise students to do a more thorough data inspection for real work after the course, I implemented those no-fuss, safe-and-simple essential cleaning procedures in a new cleanCoords function in package fuzzySim. This function removes (any or all of) duplicate, missing, impossible, unlikely, imprecise or overly uncertain spatial coordinates.

Here’s a usage example, which requires having a recent version of fuzzySim and also the geodata package for downloading some GBIF records to clean:

library(fuzzySim)
library(geodata)

# download some species occurrences from GBIF:
occ <- geodata::sp_occurrence(genus = "Orycteropus", species = "afer", fixnames = FALSE)

# clean the occurrences:
names(occ)
occ_clean <- fuzzySim::cleanCoords(occ, coord.cols = c("decimalLongitude", "decimalLatitude"), uncert.col = "coordinateUncertaintyInMeters", uncert.limit = 10000)  # 10 km tolerance

# 761 rows in input data
# 574 rows after 'rm.dup'
# 573 rows after 'rm.equal'
# 573 rows after 'rm.imposs'
# 573 rows after 'rm.missing.any'
# 573 rows after 'rm.zero.any'
# 568 rows after 'rm.imprec.any'
# 462 rows after 'rm.uncert' (with uncert.limit=10000 and uncert.na.pass=TRUE)

Run help(cleanCoords) to see the arguments that you can turn on or off. As I always like my students to look at the results of every step of the analysis before moving on, I added a plot argument which is TRUE by default. So you’ll also get a figure like this, with the selected presences in blue and the excluded ones in red:

Feedback welcome!

Lollipop chart

Posted on January 5, 2023 by A.M. Barbosa

According to modern recommendations in data viz, lollipop charts are generally a better alternative to bar charts, as they reduce the visual distortion caused by the length of the bars, making it easier to compare the values. So, in the next versions of the ‘modEvA‘ and ‘fuzzySim‘ packages, functions that produce bar plots will instead (by default) produce lollipop charts, using the new ‘lollipop’ function which will be included in ‘modEvA‘. I know ‘ggplot2‘ produces great lollipop charts already, but I like to keep my package dependencies to a minimum, or else they become much harder to maintain… So here’s the new function:

#' Title Lollipop chart
#'
#' @param x a numeric vector.
#' @param names a vector of the same length as 'x' with the names to be plotted below the lollipops. If this argument is left NULL and 'x' has names, then these will be used.
#' @param ymin numeric value for the lower limit of the y axis. The default is zero. If set to NA, the minimum of 'x' will be used.
#' @param sticks logical value indicating whether the sticks of the lollipops should be drawn. The default is TRUE.
#' @param col colour for the lollipops.
#' @param grid logical, whether or not to add a grid to the plot. The default is TRUE.
#' @param cex numeric value indicating the size of the lollipops. Will be passed as 'cex' to 'points' and as 'lwd' to 'arrows' (the lines or lollipop sticks).
#' @param cex.axis numeric value indicating the size of the x and y axis labels.
#' @param las argument to pass to 'plot' indicating the orientation of the axis labels.
#' @param ... additional arguments that can be used for the plot, e.g. 'main'.
#'
#' @return This function produces a lollipop chart of the values in 'x'.
#' @export
#'
#' @examples lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.axis = 0.6, main = "Lollipop chart")

lollipop <- function(x, names = NULL, ymin = 0, sticks = TRUE, col = "royalblue", grid = TRUE, cex = 1, cex.axis = 1, las = 1, ...) {
  # version 1.0 (5 Jan 2023)

  if (is.na(ymin))  ymin <- min(x, na.rm = TRUE)
  plot(c(ymin, x), axes = FALSE, type = "n", xlab = "", ...)
  if (grid) grid()
  if (is.null(names)) names <- names(x)
  axis(1, at = 1:length(x), labels = names, las = las, cex.axis = cex.axis)
  axis(2, ylim = c(ymin, max(x, na.rm = TRUE)), cex.axis = cex.axis)
  points(x, pch = 20, col = col, cex = cex)
  if (sticks)  arrows(x0 = 1:length(x), x1 = 1:length(x), y0 = 0, y1 = x, length = 0, col = col, lwd = cex)
}

And here a use case with the ‘mtcars’ dataset of R, and a bar plot for comparison:

barplot(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.names = 0.6, main = "Bar chart")

lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], cex.axis = 0.6, main = "Lollipop chart")

‘ymin’ is zero by default, but you can set it to NA for the minimum value:


lollipop(mtcars[,1], names = rownames(mtcars), las = 2, ylab = names(mtcars)[1], main = "Lollipop chart not starting at zero", ymin = NA)

Feedback welcome before I update the packages!

Model evaluation with presence points and raster predictions

Posted on May 6, 2022 by A.M. Barbosa

The Boyce index (Boyce et al. 2002) is often described as a presence-only metric for evaluating the predictions of species distribution (or ecological niche, or habitat suitability) models (e.g. Hirzel et al. 2006, Cianfrani et al. 2010, Bellard et al. 2013, Valavi et al. 2022). It measures the predicted-to-expected ratio of presences in each class of predicted values, i.e., whether classes / regions with higher predicted values have indeed higher proportions of presences than expected by chance. But, if there’s a proportion of presences in each class, then the rest of the class (i.e. the complementary proportion) must be… non-presences, which are equally necessary for the computation of this index (as for virtually any other model evaluation metric). We need both pixels with and pixels without presence records, i.e. both presences and (pseudo)absences, or the Boyce index cannot be computed: try using a raster map with values only in pixels with presence (e.g. my_raster <- mask(my_raster, my_presences)) and see what happens. The same applies to presence-background modelling methods like Maxent and ENFA, which use both the presence and the remaining (i.e. “non-presence”) pixels, and can’t produce proper predictions without non-presence pixels — you can also try this for yourself. So, they need the same data that presence-absence methods need. Whether or not the computation requires an explicit identification of the absences, or just presence points and a map of the entire study area, simply depends on how each method or metric is implemented in each particular software package.

The ‘modEvA‘ R package computes the Boyce index and a range of other metrics that evaluate model predictions against presence/absence or presence/background data, including the area under the ROC curve (AUC), the area under the precision-recall curve, sensitivity, specificity, TSS, Cohen’s kappa, Hosmer-Lemeshow goodness-of-fit, Miller calibration statistics, and several others. For all these metrics, the input can be either 1) a model object of a range of implemented classes, or 2) a pair of numeric vectors with observed presence/absence and the corresponding predicted values, or 3) a set of presence point coordinates and a raster map of the predictions across the model evaluation region. Here some usage examples:

install.packages("modEvA", repos = "http://R-Forge.R-project.org")
library(modEvA)
library(terra)

# import a raster map with a predictor variable:
elev <- rast(system.file("ex/elev.tif", package = "terra"))

# import a species' occurrence data for this area:
gbif <- geodata::sp_occurrence(genus = "Dendrocopos", species = "major", ext = elev)
head(gbif)
unique(gbif$occurrenceStatus)  # remove absences if any!
pres_coords <- gbif[ , c("lon", "lat")]
plot(elev)
points(pres_coords, pch = 20, cex = 0.2)

# get a data frame of the pixels with and without presences:
dat <- fuzzySim::gridRecords(elev, pres_coords)
head(dat)
points(dat[dat$presence == 0, c("x", "y")], col = "red", cex = 0.5)
points(dat[dat$presence == 1, c("x", "y")], col = "blue", cex = 0.5)

# compute a species distribution model (e.g. a GLM) from these data:
mod <- glm(presence ~ elevation, family = binomial, data = dat)
summary(mod)

# get a raster map with the predictions from this model:
pred <- predict(elev, mod, type = "response")
plot(pred, main = "Woodpecker presences and predictions")
points(pres_coords, pch = 20, cex = 0.2)

# compute some model evaluation metrics
# using just these presence coordinates and raster predictions:

par(mfrow = c(2, 2), mar = c(5, 5, 2, 1))

Boyce(obs = pres_coords, pred = pred, main = "Boyce index")

AUC(obs = pres_coords, pred = pred, main = "ROC curve")

threshMeasures(obs = pres_coords, pred = pred, thresh = "maxTSS", measures = c("Sensitivity", "Specificity", "Precision", "TSS", "kappa"), standardize = FALSE, main = "Threshold-based")

AUC(obs = pres_coords, pred = pred, curve = "PR", main = "PR curve")

So, any evaluation metric can be computed using “only presence points”, as long as the raster map of predictions includes both pixels that do and pixels that don’t overlap these points, the latter of which are necessarily taken as available unoccupied pixels by all of these metrics (Boyce index included).

Note you may get slightly different results with modEvA::Boyce() and with ecospat::ecospat.boyce(), because ecospat.boyce removes duplicate classes by default, though both functions allow controlling this with argument ‘rm.dup.classes‘ or ‘rm.duplicate‘, respectively (see help file of either function for details). Bug reports and other feedback welcome on this version of package ‘modEvA‘!

REFERENCES

Boyce, M.S., P.R. Vernier, S.E. Nielsen & F.K.A. Schmiegelow (2002) Evaluating resource selection functions. Ecological Modelling 157: 281-300

Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M., & Courchamp, F. (2013) Will climate change promote future invasions? Global Change Biology, 19(12), 3740. https://doi.org/10.1111/GCB.12344

Cianfrani, C., Le Lay, G., Hirzel, A. H., & Loy, A. (2010) Do habitat suitability models reliably predict the recovery areas of threatened species? Journal of Applied Ecology, 47(2), 421–430. https://doi.org/10.1111/J.1365-2664.2010.01781.X

Hirzel, A. H., Le Lay, G., Helfer, V., Randin, C., & Guisan, A. (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling, 199(2), 142-152. http://www.sciencedirect.com/science/article/pii/S0304380006002468

Valavi, R., Guillera-Arroita, G., Lahoz-Monfort, J. J., & Elith, J. (2022) Predictive performance of presence-only species distribution models: a benchmark study with reproducible code. Ecological Monographs, 92(1), e01486. https://doi.org/10.1002/ECM.1486