Delimiting the modelling background for scattered uneven occurrence data

In species distribution modelling and ecological niche modelling (SDM & ENM), the region from where background or pseudoabsence points are picked is key to how well a model turns out. This region should include sufficient localities for the model to assess the species’ (non-)preferences, but it should also be within the species’ reach AND reasonably evenly surveyed for it. Survey bias within the background or the pseudoabsence region can seriously reduce model quality.

Yet, real-life occurrence data (especially at the wide scale of species distributions) often are strongly biased, and it’s rarely straightforward to define an adequate region around the existing points. Very nice criteria exist that work well in theory, but not under most actual scenarios of biodiversity data bias. Until recently, my preferred approach was a buffer around the known occurrences, whose radius was e.g. the mean pairwise distance between them (as a rough measure of their spatial spread, or the spatial reach of the species), or half the width of the area defined by the occurrences (usually better for elongated distributions or surveys):

# download some example occurrence records:

occs <- geodata::sp_occurrence("Rupicapra", "pyrenaica", args = c("year=2024"))


# map occurrence records:

occs <- terra::vect(occs, geom = c("lon", "lat"), crs = "EPSG:4326")
terra::plot(occs, cex = 0.2, ext = terra::ext(occs) + 4)
cntry <- geodata::world(path = tempdir())
terra::plot(cntry, lwd = 0.2, add = TRUE)
# note you should normally clean these records from errors before moving on!


# compute mean distance and width of occurrence records:

occs_mdist <- mean(terra::distance(occs))
occs_width <- terra::width(terra::aggregate(occs))


# compute buffers around occurrences using both metrics:

occs_buff_d <- terra::aggregate(terra::buffer(occs, width = occs_mdist))
occs_buff_w <- terra::aggregate(terra::buffer(occs, width = occs_width / 2))


# plot both buffers and occurrence records:

par(mfrow = c(1, 2))
terra::plot(occs_buff_d, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Mean distance")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)
terra::plot(occs_buff_w, col = "yellow", border = "orange", lwd = 3, ext = terra::ext(occs) + 4, main = "Half width")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(occs, cex = 0.2, add = TRUE)
Image

This may work well for relatively evenly surveyed species or regions, where we can reasonably assume that most of the buffered area has been (roughly evenly) surveyed, and that most places without presence are likely absences. However, many species have isolated records in remote understudied areas, with survey effort hugely imbalanced across the distribution range. So, a buffer with a uniform radius will not work well:

occs <- geodata::sp_occurrence("Lutra", "lutra", args = c("year=2024"))

# everything else same code as above...
Image

You can use smaller multipliers for distance and width, but these uniform-radius buffers will always include large regions with only scarce and isolated occurrence records, where it cannot be assumed that absence of records mostly reflects absence of the species. Normally I would terra::erase() from the buffer the regions that look insufficiently surveyed, but that implies discarding potentially valuable occurrence data.

So, my latest approach consists of finding clusters of occurrence points that are within a given distance of one another; and then defining a varying buffer radius of, e.g., the width or the mean pairwise distance among the points in each cluster, down-weighted by the distance to other points or clusters (as an indicator of isolated observations in under-surveyed areas). This way, the modelling background will avoid areas that were most likely not surveyed, yet without discarding the occurrence records from those areas. I’ve implemented several variations of this and related approaches in the new getRegion() function of R package fuzzySim (version 4.26):

# try regions with two different methods:

reg1 <- fuzzySim::getRegion(occs, type = "inv_dist")
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg1, lwd = 4, border = "orange", add = TRUE)

reg2 <- fuzzySim::getRegion(occs, type = "clust_width", width_mult = 0.5)
terra::plot(cntry, lwd = 0.2, add = TRUE)
terra::plot(reg2, lwd = 4, border = "orange", add = TRUE)
Image

Read the function help file for all the different options available, and try them out to see what feels best for your study system. Feedback and bug reports welcome!

Actual pixel sizes of unprojected raster maps

It is well known, though often dismissed, that the areas of spatial units (cells, pixels) based on unprojected coordinates (longitude-latitude degrees, arc-minutes or arc-seconds) are wildly inconsistent across the globe. Towards the poles, as the longitude meridians approach each other, the actual ground width of the pixels sharply decreases. So, for non-tropical regions, the actual area covered by these pixels can be quite far from the nominal “approximately 1 km at the Equator” (or 10 km, etc.) which is often referred in biogeography and macroecology studies.

The function below combines several tools from the ‘terra’ R package to compute the (mean or centroid) pixel area for a given raster map. By default, it also shows a map illustrating the latitudinal variation in the pixel areas.

pixelArea <- function(rast, # SpatRaster
                      type = "mean", # can also be "centroid"
                      unit = "m",  # can also be "km"
                      mask = TRUE,  # to use only non-NA pixels
                      map = TRUE) {

  # by A. Marcia Barbosa (https://modtools.wordpress.com/)
  # version 1.4 (17 May 2024)

  r <- rast

  stopifnot(inherits(r, "SpatRaster"),
            type %in% c("mean", "centroid"))

  r_size <- terra::cellSize(r, unit = unit, mask = mask)
  if (map) terra::plot(r_size, main = paste0("Pixel area (", unit, "2)"))
  areas <- terra::values(r_size, mat = FALSE, dataframe = FALSE, na.rm = FALSE)  # na.rm must be FALSE for areas[centr_pix] to be correct

  if (type == "mean") {
    out <- mean(areas, na.rm = TRUE)
    cat(paste0("Mean pixel area (", unit, "2):\n", out, "\n\n"))
    return(out)
  }

  if (type == "centroid") {
    r_pol <- terra::as.polygons(r * 0, aggregate = TRUE)
    centr <- terra::centroids(r_pol)
    if (map) {
      if (!mask) terra::plot(r_pol, lwd = 0.2, add = TRUE)
      terra::plot(centr, pch = 4, col = "blue", add = TRUE)
    }

    centr_pix <- terra::cellFromXY(r, terra::crds(centr))
    out <- areas[centr_pix]
    if (!is.finite(out)) message("The centroid of your region may not have a pixel value; consider using mask=FALSE, or type = 'mean'.")
    cat(paste0("Centroid pixel area (", unit, "2):\n", out, "\n\n"))
    return(out)
  }
}

You can source the function directly from GitHub with the first command below, which is followed by a usage example for a central European country:

source("https://raw.githubusercontent.com/AMBarbosa/unpackaged/master/pixelArea.R")
elev <- geodata::elevation_30s("Switzerland", path = tempdir())
terra::plot(elev)

terra::res(elev)
# 0.008333333 degree resolution
# i.e. half arc-minute, or 30 arc-seconds (1/60/2)
# "approx. 1 km at the Equator"

pixelArea(elev, unit = "km")
# Mean pixel area (km2):
# [1] 0.5892033

Image

As you can see, the mean pixel area here is quite different from 1 squared km… and Switzerland isn’t even one of the northernmost countries! So, when using unprojected rasters (e.g. WorldClim, CHELSA climate, Bio-ORACLE, MARSPEC and so many others) in your analysis, consider choosing the spatial resolution that most closely matches the resolution that you want in your study region, rather than the resolution at the Equator when this is actually far away.

You can read more details about geographic areas in the help file of the terra::expanse() function. Feedback welcome!

Calculate zonal statistics from rasters in multiple zip files

This is a wrapper for the zonalFromZip function published in the previous post, for when you have multiple zip files with multiple raster files each (as in the WorldClim paleo-climate database), and you want to extract zonal statistics for them all automatically. To use it, you’ll need to have the zonalFromZip function loaded, as well as the raster R package.

zonalsFromZips <- function(zip.files, zones.map, rast.file.ext = ".tif", aux.file.ext = NULL, verbosity = 1, ...) {
  # v2.0 (2018/07/11)
  results <- vector("list", length(zip.files))
  names(results) <- basename(tools::file_path_sans_ext(zip.files))
  for (f in 1:length(zip.files)) {
    message("\nUNZIPPING FILE ", f, " OF ", length(zip.files), " (", basename(zip.files[f]), ")...")
    results[[f]] <- zonalFromZip(zip.file = zip.files[f], zones.map = zones.map, rast.file.ext = rast.file.ext, aux.file.ext = aux.file.ext, verbosity = verbosity, ...)
  }; rm(f)
  message("\nFINISHED ALL!")
  return(results)
}  # end zonalsFromZips function

The result is a list of dataframes, each containing the zonal stats for one of the .zip files of rasters. Usage example:

LGM.zonals <- zonalsFromZips(zip.files = list.files("/home/joe/LGM", full.names = TRUE), zones.map = provinces)

[presented with Pretty R]

 

Calculate zonal statistics from rasters in a zip file

Imagine you have a zipped folder with a bunch of raster maps containing variables (e.g. the ones you can download from WorldClim or from CliMond), and you need to calculate zonal statistics from each of these rasters. The zonaFromZip function, provided below, automates this process without the need to unzip the folders. It extracts one raster at a time from the .zip, imports it to R, calculates zonal statistics for your zones raster map (the ‘mean’ function is used by default, but you can provide any other argument accepted by the zonal function of the R raster package), and then deletes the unzipped file before unzipping the next one, therefore requiring minimal disk space.

zonalFromZip <- function (zip.file, zones.map, fun, rast.file.ext = ".tif", aux.file.ext = NULL, delete.unzipped = TRUE, verbosity = 2, ...)
 # version 2.1 (updated 11/12/2018)
 # zip.file: path to the zip containing the raster maps to calculate zonal stats from
 # zones.map: map (in your R workspace) containing the spatial units to calculate zonal stats to; must be of class 'raster', 'SpatialPolygons' or 'SpatialPolygonsDataFrame' and have the same CRS (and resolution if raster) of the maps in 'zip.file'
 # fun: function to calculate (e.g. mean)
 # rast.file.ext: file extension of the raster maps in 'zip.file'
 # aux.file.ext: file extension for the auxiliary files (e.g. ".hdr" for .bil raster files, or ".rdc" for Idrisi .rst files)
 # ...: additional arguments to pass to the 'raster::zonal' (if 'zones.map' is a raster) or the 'raster::extract' function (if 'zones.map' is a SpatialPolygons* map), such as na.rm = TRUE
{
 require(raster)
 rast.files <- unzip(zip.file, list = TRUE) $ Name
 var.names <- unique(tools::file_path_sans_ext(rast.files))
 n.var <- length(var.names)
 zonal.stats <- vector("list", length(var.names))
 names(zonal.stats) <- var.names
 
 for (i in 1:n.var) {
 if (verbosity >= 1) message("Getting variable ", i, " of ", n.var)
 if (verbosity >= 2) message(" - unzipping file...")
 unzip(zip.file, files = paste0(var.names[i], rast.file.ext))
 if (!is.null(aux.file.ext)) unzip(zip.file, files = paste0(var.names[i], aux.file.ext))
 var.rast <- raster(paste0(var.names[i], rast.file.ext))
 if (!compareRaster(var.rast, zones.map, stopiffalse = FALSE)) {
 if (verbosity >= 2) message(" - cropping to zones raster...")
 var.rast <- crop(var.rast, zones.map)
 }
 
 if (verbosity >= 2) message(" - calculating zonal stats...")
 if (class(zones.map) %in% c("raster", "RasterLayer"))
 zonal.stats[[i]] <- raster::zonal(var.rast, zones.map, fun = fun, ...)
 else if (class(zones.map) %in% c("SpatialPolygons", "SpatialPolygonsDataFrame"))
 zonal.stats[[i]] <- raster::extract(var.rast, zones.map, fun = fun, df = TRUE, ...)
 else stop("'zones.map' must be of class 'raster', 'RasterLayer', 'SpatialPolygons' or 'SpatialPolygonsDataFrame'")
 
 if (verbosity >= 2) message(" - deleting unzipped file...")
 if (delete.unzipped) {
 unlink(list.files()[grep(pattern = paste0(var.names[i], rast.file.ext), x = list.files())])
 if (!is.null(aux.file.ext)) unlink(list.files()[grep(pattern = paste0(var.names[i], aux.file.ext), x = list.files())])
 } # end if delete
 } # end for i
 
 if (verbosity >= 1) message("Converting results to data frame...")
 zonal.stats <- as.data.frame(zonal.stats)
 zonal.stats <- subset(zonal.stats, select = c(1, seq(2, ncol(zonal.stats), 2)))
 colnames(zonal.stats)[1] <- "zone"
 colnames(zonal.stats)[-1] <- var.names
 if (verbosity >= 1) message("Finished!")
 return(zonal.stats)
}

Mind that you need the raster R package installed for this, and a raster map of the spatial units (zones) to which you want to extract the raster variables. Usage examples:

LGM.CCSM4.utm10 <- zonalFromZip(zip.file = "LGM/CCSM4.zip", zones.map = utm10, rast.file.ext = ".tif", aux.file.ext = NULL)
head(LGM.CCSM4.utm10)
 
WClim.utm10 <- zonalFromZip(zip.file = "bio_2-5m_bil.zip", zones.map = utm10, rast.file.ext = ".bil", aux.file.ext = ".hdr")
head(WClim.utm10)

Example for several .zip files within a folder at once (DEPRECATED – see next post’s zonalsFromZips function instead):

for (f in list.files("LGM")) {  
# "LGM" is the folder containing the zip files to extract zonal stats from
  name <- paste("LGM", tools::file_path_sans_ext(f), "utm10", sep = ".")
  zonstat <- zonalFromZip(zip.file = paste("LGM", f, sep = "/"), zones.map = utm10, raster.ext = ".tif", fun = "mean")
  assign(name, zonstat)
}

Classify integer columns

The integerCols function below detects which numeric columns in a data frame contain only whole numbers, and converts those columns to integer class, so that they take up less space. It uses the multConvert function, which must be loaded too. Both functions are included in the latest version of the fuzzySim package (Barbosa, 2015).

integerCols <- function(data) {
  is.wholenumber <- function(x, tol = .Machine$double.eps ^ 0.5) {
    abs(x - round(x)) < tol
  }  # from ?is.integer examples
  all.cols <- 1:ncol(data)
  integer.cols <- rep(0, ncol(data))
  for (i in all.cols) {
    x <- na.omit(data[ , i])
    if(!is.numeric(x)) next
    if(!all(is.finite(x))) next
    if(min(is.wholenumber(x) == 1)) integer.cols[i] <- 1
  }
  multConvert(data, conversion = as.integer, cols = all.cols[integer.cols == 1])
}

[presented with Pretty R]

Usage example:

dat <- data.frame(
  a = 1:10,
  b = as.numeric(1:10),
  c = seq(0.1, 1, 0.1),
  d = letters[1:10]
)
 
str(dat)  # b is classified as 'numeric' although it contains only whole numbers
 
dat2 <- integerCols(dat)
 
str(dat2)  # b now classified as 'integer'

References:

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Extract a “jumping window” from a vector

Sometimes we need to extract and analyse parts of a dataset (e.g. a vector, or the columns or rows of a data frame) along a  moving (aka sliding, rolling, running) window. The jumping.window function, included in the newly released DeadCanMove package (Barbosa et al. 2014), extracts a moving window with (currently) no overlap between windows, and with the possibility of a gap (whose size is defined by the user) between windows:

jumping.window <- function(x, window.size, gap.size) {
  window.indices <- integer(0)
  for (i in 1 : window.size) {
    window.indices <- c(window.indices, seq(from = i, to = length(x), by = window.size + gap.size))
  }
  window.indices <- sort(window.indices)
  sampl.windows <- vector[window.indices]
  return(sampl.windows)
}  # end jumping.window function

[presented with Pretty R]

Usage examples:

Paste the whole text of the function above in R, press enter, and then paste the following commands to check out how the function works:

jumping.window(x = 1:20, window.size = 1, gap.size = 0)
 
jumping.window(x = 1:20, window.size = 1, gap.size = 1)
 
jumping.window(x = 6:52, window.size = 5, gap.size = 10)
 
jumping.window(x = c(1, 3, 4, 6, 9, 10, 11), window.size = 3, gap.size = 2)
 
jumping.window(x = c("air", "bird", "cloud", "deep", "elephant", "goat"), window.size = 2, gap.size = 2)
 
 
# to extract jumping-window columns from a table:
 
head(CO2)
CO2jump <- CO2[ , jumping.window(x = 1:ncol(CO2), window.size = 2, gap.size = 1)]
head(CO2jump)

References:

Barbosa A.M., Marques J.T., Santos S.M., Lourenço A., Medinas D., Beja P. & Mira A. (2014) DeadCanMove: Assess how spatial roadkill patterns change with temporal sampling scheme. R package version 0.1 (available at http://deadcanmove.r-forge.r-project.org)

Convert geographical coordinates from degree-minute-second to decimal degrees

If you have latitude and/or longitude data in sexagesimal degrees, in the form degreesº minutes’ seconds” hemisfere (e.g. 41° 34′ 10.956″ N  or  8° 37′ 47.1036″ W, with or without spaces in between), the dms2dec function can convert them to decimal degrees, which are usually required for mapping. This function is not included in a package, but it’s in the “Supporting Information” of Zanolla et al. (2018), so please cite this paper if you use the function.

dms2dec <- function(dms, separators = c("º", "°", "\'", "\"")) {
  # version 1.0 (25 Sep 3013)
  # dms: a vector (or column) of latitude or longitude in degrees-minutes-seconds-hemisfere, e.g. 41° 34' 10.956" N (with or without spaces)
  # separators: the characters that are separating degrees, minutes and seconds in dms

  dms <- as.character(dms)
  dms <- gsub(pattern = " ", replacement = "", x = dms)
  for (s in separators) dms <- gsub(pattern = s, replacement = "_splitHere_", x = dms)

  splits <- strsplit(dms, split = "_splitHere_")
  n <- length(dms)
  deg <- min <- sec <- hem <- vector("character", n)

  for (i in 1:n) {
    deg[i] <- splits[[i]][1]
    min[i] <- splits[[i]][2]
    sec[i] <- splits[[i]][3]
    hem[i] <- splits[[i]][4]
  }

  dec <- as.numeric(deg) + (as.numeric(min) / 60) + (as.numeric(sec) / 3600)
  sign <- ifelse (hem %in% c("N", "E"), 1, -1)
  dec <- sign * dec
  return(dec)
}  # end dms2dec function 

[presented with Pretty R]

Usage example, from a table called mydata:

LatLonTable

mydata$latitude.decimal <- dms2dec(mydata$Latitude)

mydata$longitude.decimal <- dms2dec(mydata$Longitude)

REFERENCES:

Zanolla M., Altamirano M., Carmona R., De La Rosa J., Souza-Egipsy V., Sherwood A., Tsiamis K., Barbosa A.M., Muñoz A.R. & Andreakis N. (2018) Assessing global range expansion in a cryptic species complex: insights from the red seaweed genus Asparagopsis (Florideophyceae). Journal of Phycology, 54: 12-24

Turn a list of species by site into a species presence-absence table

Imagine you have a data frame with the species that are present in each locality, such as this:

locality      species
Cuba          Pancake bush
Sumatra       Pancake bush
Greenland     Red dwarf
South America Pancake bush
South America Tree whale
Antarctica    Red dwarf

…but what you need is a data frame with one species per column and their presence (1) or absence (0) in each of the sites (rows), such as this:

    locality Pancake.bush Red.dwarf Tree.whale
     Africa            1         0          0
 Antarctica            1         1          0
       Asia            1         0          0
  Australia            1         1          0
     Baffin            1         1          1
      Banks            1         1          0

You can use R’s table function for this, but it can take a bit of fiddling to get the result in an easily manageable format. The splist2presabs function, now included in the fuzzySim package (Barbosa 2014), does this in one step:

splist2presabs <- function(data, sites.col, sp.col, keep.n = FALSE) {
  # version 1.1 (7 May 2013)
  # data: a matrix or data frame with your localities and species (each in a different column)
  # sites.col: the name or index number of the column containing the localities
  # sp.col: the name or index number of the column containing the species names or codes
  # keep.n: logical, whether to get in the resulting table the number of times each species appears in each locality; if false (the default), only the presence (1) or absence (0) are recorded

  stopifnot(
    length(sites.col) == 1,
    length(sp.col) == 1,
    sites.col != sp.col,
    sites.col %in% 1 : ncol(data) | sites.col %in% names(data),
    sp.col %in% 1 : ncol(data) | sp.col %in% names(data),
    is.logical(keep.n)
  )

  presabs <- table(data[ , c(sites.col, sp.col)])
  presabs <- as.data.frame(unclass(presabs))
  if (!keep.n)  presabs[presabs > 1] <- 1
  presabs <- data.frame(row.names(presabs), presabs)
  names(presabs)[1] <- names(subset(data, select = sites.col))
  rownames(presabs) <- NULL
  return(presabs)
}  # end splist2presabs function

To try an example, load the splist2presabs function (above) and then do the following:

# get a set of localities and some fake species:
loc <- names(islands)
spp <- c("Pancake bush", "Tree whale", "Red dwarf")
# combine them in a data frame:
data <- data.frame(locality = sample(loc, 100, replace = TRUE), species = sample(spp, 100, replace = TRUE))
# take a look at the data:
head(data)
# turn these into a presence-absence data frame and check it out:
atlas <- splist2presabs(data, sites.col = 1, sp.col = 2)
head(atlas)
# if you'd rather have columns with shorter names, load the spCodes function and do:
data$spcode <- spCodes(data$species)
head(data)
atlas <- splist2presabs(data, sites.col = 1, sp.col = 3)
head(atlas)

[presented with Pretty R]

REFERENCES:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 0.1.

Create species name codes

Often we need to work with data sets that have species in column names. Species names are often larger than the maximum size allowed, for example, in .dbf tables to associate with maps in a GIS. We thus need to create codes for the species names, which should be short but also unambiguous (you should recognize the species through the species code) and unique (there must not be two species with the same code). The spCodes function, now included in the fuzzySim package (Barbosa, 2014), combines a specified number of characters of the genus and species (and optionally also the subspecies) names, and checks if the codes are unique:

spCodes <- function(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "") {
  # version 1.8 (7 Mai 2013)

  species <- as.character(species)
  splits <- strsplit(species, split = sep.species)
  gen <- sp <- vector("character", length(splits))
  for (i in 1:length(splits)) {
    gen[i] <- splits[[i]][1]
    sp[i] <- splits[[i]][2]
    if (is.na(sp[i]))  sp[i] <- ""
  }  # end for i

  abbrev <- function (string, nchar) {
    abv <- substr(string, start = 1, stop = nchar)
    return(abv)
  }  # end abbrev function
  gen.code <- abbrev(gen, nchar.gen)
  sp.code <- abbrev(sp, nchar.sp)
  spcode <- paste(gen.code, sp.code, sep = sep.spcode)

  if (nchar.ssp > 0) {
    ssp <- vector("character", length(splits))
    for (i in 1:length(splits)) {
      ssp[i] <- splits[[i]][3]
      if (is.na(ssp[i]))  ssp[i] <- ""
    }
    ssp.code <- abbrev(ssp, nchar.ssp)
    spcode <- paste(spcode, ssp.code, sep = sep.spcode)
  }

  if (length(unique(species)) != length(unique(spcode))) {
    data <- cbind(species, spcode)
    unique.data <- data[!duplicated(data[ , c("species")]), ]
    ordered.data <- unique.data[order(unique.data[, "spcode"]), ]
    duplicates <- duplicated(ordered.data[ , "spcode"])
    for (i in 1:length(duplicates)) {
      if (isTRUE(duplicates[i])) duplicates[i-1] <- TRUE
    }  # to include the original as well as the duplicate
    ordered.data.duplicates <- ordered.data[duplicates, ]
    print(ordered.data.duplicates)
    stop("Resulting species codes are not unique (see above); try a different combination of nchar.gen, nchar.sp (and nchar.ssp), or check that you have specified the correct sep.species.")
  }
  message("OK - no duplicated spcodes found.")
  return(spcode)
}  # end spCodes function

[Created by Pretty R at inside-R.org]

If you have a vector named “species” with your species names, with the genus and the specific name separated by a white space, load the function above and then type:

spcode <- spCodes(species, nchar.gen = 3, nchar.sp = 3, nchar.ssp = 0, sep.species = " ", sep.spcode = "")

This will create species codes with the first 3 characters from the genus and the first 3 characters from the specific name (although you can set any other numbers of characters, such as 1 + 5). If the resulting spcodes are not unique (i.e., if the number of species names is not the same as the number of resulting species codes), you will get an error message showing the duplicates and asking you to use another combination of numbers of characters for the genus and specific name.

If your input species binomials are separated by a character other than a space (e.g. a dot or an underscore), you can specify this with the sep.species argument; but note that all species names must be separated by the same character (e.g., ONE white space) for the function to work correctly. By default, there will be no character separating the two parts of the spcode, but you can use the sep.spcode argument to define one (e.g., sep.spcode = “_”).

If you have a table called mydata with species names in columns 2 to 30, for example, you can turn them into species codes like this:

spcodes <- spCodes(names(mydata)[2:30], nchar.gen = 3, nchar.sp = 3, sep.species = " ", sep.spcode = "")

names(mydata)[2:30] <- spcodes


REFERENCES:

Barbosa A.M. (2014) fuzzySim: Fuzzy similarity in species’ distributions. R package, version 1.0.

Determine the percentage of data to reserve for model testing

Based on the work of Schaafsma & van Vark (1979), Huberty (1994) provided a heuristic (‘rule of thumb’) for determining an adequate proportion of data to set aside for testing species presence/absence models, based on the number of predictor variables that are used (Fielding & Bell 1997). The percentTestData function, now included in the fuzzySim package (Barbosa 2015), calculates this proportion as a percentage:

percentTestData <- function(nvar) {
  # v1.1 (28 Mar 2013)
  # heuristic ('rule of thumb') of Huberty (1994; in Fielding & Bell 1997) to determine the test/training data ratio for presence-absence models
  # nvar: number of predictor variables
  huberty <- 1 / (1 + sqrt(nvar - 1))
  return(round(100 * huberty, 1))
}

[presented with Pretty R]

For example, if you’re building a model based on 15 variables, load the percentTestData function and then just type:

percentTestData(15)

References

Barbosa A.M. (2015) fuzzySim: applying fuzzy logic to binary similarity indices in ecology. Methods in Ecology and Evolution, 6: 853-858

Huberty C.J. (1994) Applied Discriminant Analysis. Wiley, New York, 466 pp.
Schaafsma W. & van Vark G.N. (1979) Classification and discrimination problems with applications. Part IIa. Statistica Neerlandica 33: 91-126
Fielding A. H. & Bell J. F. (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24: 38-49