NHANES Activity using MIMS (Monitor-Independent Movement Summary)

Posted on September 24, 2025 by strictlystat

MIMS was introduced in the NHANES Study

A lot the accelerometry data from ActiGraph devices is summarized
using ActiLife software into ActiGraph Activity Counts (AC). The NHANES
(National Health and Nutrition Examination Survey) used ActiGraph devices, collecting data on over 14000 people in a nationally representative survey. The NHANES introduced MIMS, a Monitor-Independent
Movement Summary for activity data (John et al.
2019) instead of AC. The NHANES study released these summaries in
large (over 7Gb) SAS
XPORT files (XPTs), found in the Examination data. As this was a
novel measure, Karas et al. (2022)
compared MIMS and other common summary measures, which was possible
since MIMSunit package in R implements this method. In
2022, NHANES released their raw 80Hz accelerometer data, which was large
(over 1Tb compressed). We can use this raw data to compute MIMS using
the method described in John et al. (2019)
and compare it to the released data. **In short, we found that the
default MIMSunit::mims_unit
does not calculate the same result as released by NHANES, you must use
the MIMSunit::custom_mims_unit
function with allow_truncation = FALSE to get almost exact
results.

Here we’re showing this fact with one participant. Note, this code
may take some time to download the data, but we will walk through each
step using a reprex approach
and the full code is available at https://github.com/muschellij2/HopStat/blob/gh-pages/NHANES_Activity_using_MIMS_(Monitor-Independent_Movement_Summary)/NHANES_Activity_using_MIMS_(Monitor-Independent_Movement_Summary).Rmd.

Package Loading

We’re going to load packages required for this analysis:

library(MIMSunit)
library(curl)
library(haven)
library(curl)
library(dplyr)
library(readr)
options(digits.secs = 3)

Set up

The code below can do all of this in a temporary directory. I used a
hard-coded directory since some of these files are big and take a while
to download/convert/run. If you want to do this in a temporary way, you
can run:

Otherwise, set tdir to be a directory on your machine
for the data to go.

Here we pick a random ID. This ID is technically in the NHANES
National Youth Fitness Survey (NNYFS), but the same processing and
analysis was done on this data compared to NHANES. An ID from this study
was chosen since it has only around 1700 participants, and so the XPT is
smaller to download and load into R.

id = "71917"

Downloading Raw Data

The raw data are given in series of zipped tarballs, one for each
person, and the code below will download the tarball for the ID above.
The function below is a general function to download any of the tarballs
from the CDC FTP site. The
version argument is the folder on the FTP site
(pax_y, pax_g or pax_h), and
id is the ID for the person, also referred to the
SEQN in NHANES documentation. The exdir
argument is where to put the downloaded files, and ... are
additional arguments passed to curl::curl_download
such as quiet = TRUE/FALSE.

# Download tarball
download_80hz = function(id, version, exdir = tempdir(), ...) {
  files = id
  tarball_ending = grepl("[.]tar[.]bz2", id)
  files[!tarball_ending] = paste0(files[!tarball_ending], ".tar.bz2")
  urls = paste0("https://ftp.cdc.gov/pub/", version, "/", files)
  outfiles = sapply(urls, function(x) {
    destfile = file.path(exdir, basename(x))
    if (!file.exists(destfile)) {
      curl::curl_download(x, destfile = destfile, ...)
    }
    destfile
  })
  outfiles
}

Here we download the tarball for the data (about 130Mb).

raw = download_80hz(id, "pax_y", quiet = FALSE, exdir = tdir)

Reading in the Raw data

Here we read the data and write it out as a CSV, so we do not have to
re-run the extraction from the tarball, which can take a little bit of
time, in future execution. The tarball has a series of CSV files, one
for each hour, and one log file. We exclude the log file and read in the
CSVs using readr.

file_data = file.path(tdir, paste0(id, ".csv"))
if (!file.exists(file_data)) {
  files = untar(raw, verbose = TRUE, exdir = tdir, list = TRUE)
  files = file.path(tdir, files)
  if (!all(file.exists(files))) {
    untar(raw, verbose = TRUE, exdir = tdir)
  }
  log_file = files[grepl("_log", files, ignore.case = TRUE)]
  #' Getting only the data files
  csv_files = files[!grepl("_log", files, ignore.case = TRUE)]

  df = readr::read_csv(
    csv_files,
    col_types = readr::cols(
      HEADER_TIMESTAMP = readr::col_datetime(),
      X = readr::col_double(),
      Y = readr::col_double(),
      Z = readr::col_double()
    )
  )
  readr::write_csv(df, file_data)
} else {
  df = readr::read_csv(
    file_data,
    col_types = readr::cols(
      HEADER_TIMESTAMP = readr::col_datetime(),
      X = readr::col_double(),
      Y = readr::col_double(),
      Z = readr::col_double()
    ),
    progress = FALSE,
    show_col_types = FALSE
  )
}

Raw Data

Here we can print out the data to see what it looks like. We have it
in a tibble so it will not print all records and the
options(digits.secs = 3) set above will show the
milliseconds for the time of the data:

df

# A tibble: 55,343,278 × 4
   HEADER_TIMESTAMP             X     Y     Z
   <dttm>                   <dbl> <dbl> <dbl>
 1 2000-01-07 13:25:00.000 -0.774 0.009 0.604
 2 2000-01-07 13:25:00.013 -0.739 0.018 0.569
 3 2000-01-07 13:25:00.024 -0.704 0.023 0.522
 4 2000-01-07 13:25:00.036 -0.748 0.023 0.449
 5 2000-01-07 13:25:00.049 -0.801 0.029 0.384
 6 2000-01-07 13:25:00.062 -0.83  0.032 0.34 
 7 2000-01-07 13:25:00.075 -0.845 0.038 0.346
 8 2000-01-07 13:25:00.088 -0.874 0.029 0.34 
 9 2000-01-07 13:25:00.100 -0.891 0.026 0.314
10 2000-01-07 13:25:00.113 -0.918 0.018 0.27 
# ℹ 55,343,268 more rows

We see the data is made up of time (HEADER_TIMESTAMP),
and the accelerometry values for the 3 axes, measures in g
(gravitational units, $g =
9.81m/s^2$). The data is collected at 80Hz, so there are 80 rows
per second.
Note: the data column name is
HEADER_TIMESTAMP, but the MIMSunit package assumes the time
column is called HEADER_TIME_STAMP (note the extra
underscore).

With the lubridate package, we can floor the time to the
second level, count the number of rows with that time value and see that
the majority are 80, showing 80Hz data, with some that do
not have a full second covered.

df %>% 
  dplyr::mutate(time = lubridate::floor_date(HEADER_TIMESTAMP, unit = "second")) %>%
  dplyr::count(time) %>% 
  dplyr::count(n, name = "n_samples")

# A tibble: 2 × 2
      n n_samples
  <int>     <int>
1    78         1
2    80    691790

Calculating MIMS

From this df data, we can calculate MIMS using the
MIMSunit package. This package has a function mims_unit
which is the default way to calculate MIMS, and then use custom_mims_unit
function which allows more options. We will use both of these functions
to calculate MIMS and compare them to the released MIMS data from
NHANES. Note again the renaming of columns. This process interpolates
the data to 100Hz, runs an extrapolation procedure depending on the
dynamic range of the device ($\pm
6g$) for samples that reach the boundary, then applies a
4th-order Butterworth filter with cutoffs at 0.2 to 5Hz (limiting to
“typical” human activity) and then integrates the area under the curve
for each axis (MIMS per axis). The sum of these MIMS per axis gives a
minute level MIMS. This is done for each epoch, which in NHANES is 1
minute. See John et al. (2019) for more
details.

file_mims = file.path(tdir, paste0(id, "_MIMS.csv"))
if (!file.exists(file_mims)) {
  # MIMSunit requires specific naming
  run_df = df %>%
    dplyr::rename(HEADER_TIME_STAMP = HEADER_TIMESTAMP)
  mims = MIMSunit::mims_unit(df, epoch = "1 min", dynamic_range = c(-6L, 6L),
                             output_mims_per_axis = TRUE)
  # for printing
  mims = as_tibble(mims)
  readr::write_csv(mims, file_mims)
} else {
  mims = readr::read_csv(
    file_mims,
    col_types = readr::cols(
      HEADER_TIME_STAMP = readr::col_datetime(),
      .default = readr::col_double()
    ),
    progress = FALSE,
    show_col_types = FALSE
  )
}

mims

# A tibble: 11,530 × 5
   HEADER_TIME_STAMP       MIMS_UNIT MIMS_UNIT_X MIMS_UNIT_Y MIMS_UNIT_Z
   <dttm>                      <dbl>       <dbl>       <dbl>       <dbl>
 1 2000-01-07 13:25:00.000     27.3         8.41       11.7         7.13
 2 2000-01-07 13:26:00.000     41.2        14.9        14.8        11.5 
 3 2000-01-07 13:27:00.000     30.6        10.7        10.2         9.69
 4 2000-01-07 13:28:00.000      7.83        2.73        2.84        2.26
 5 2000-01-07 13:29:00.000     35.3        13.4         9.61       12.3 
 6 2000-01-07 13:30:00.000      6.81        2.08        1.71        3.02
 7 2000-01-07 13:31:00.000     27.0         8.09        9.33        9.59
 8 2000-01-07 13:32:00.000     30.2         9.12       11.8         9.31
 9 2000-01-07 13:33:00.000      9.71        2.60        3.61        3.50
10 2000-01-07 13:34:00.000     16.0         4.88        6.43        4.72
# ℹ 11,520 more rows

Calculating MIMS without truncation

Here we will run the same procedure, but with
allow_truncation = FALSE, which, according to the
documentation
. If it is TRUE, the algorithm will truncate very small MIMS-unit values to zero.
The threshold is based on
1e-04 * parse_epoch_string(epoch, sr) from this
code.

As the data is interpolated to 100Hz (so sr = 100), and
the epoch is 1 min, this threshold is:

1e-04 * MIMSunit::parse_epoch_string("1 min", sr = 100)

[1] 0.6

So that any minute-level MIMS value below 0.6 is
truncated to 0 (when using mims_unit default).
This is a very small value, but it does have an effect on the results,
as we will see below.

file_mims_notrunc = file.path(tdir, paste0(id, "_MIMS_notrunc.csv"))
if (!file.exists(file_mims_notrunc)) {
  # MIMSunit requires specific naming
  run_df = df %>%
    dplyr::rename(HEADER_TIME_STAMP = HEADER_TIMESTAMP)

  # no allowing truncation
  mims_notrunc = MIMSunit::custom_mims_unit(
    run_df, epoch = "1 min",
    dynamic_range = c(-6L, 6L),
    output_mims_per_axis = TRUE,
    allow_truncation = FALSE)
  # for printing
  mims_notrunc = as_tibble(mims_notrunc)
  readr::write_csv(mims_notrunc, file_mims_notrunc)
} else {
  mims_notrunc = readr::read_csv(
    file_mims_notrunc,
    col_types = readr::cols(
      HEADER_TIME_STAMP = readr::col_datetime(),
      .default = readr::col_double()
    ),
    progress = FALSE,
    show_col_types = FALSE
  )
}

mims_notrunc

# A tibble: 11,530 × 5
   HEADER_TIME_STAMP       MIMS_UNIT MIMS_UNIT_X MIMS_UNIT_Y MIMS_UNIT_Z
   <dttm>                      <dbl>       <dbl>       <dbl>       <dbl>
 1 2000-01-07 13:25:00.000     27.3         8.41       11.7         7.13
 2 2000-01-07 13:26:00.000     41.2        14.9        14.8        11.5 
 3 2000-01-07 13:27:00.000     30.6        10.7        10.2         9.69
 4 2000-01-07 13:28:00.000      7.83        2.73        2.84        2.26
 5 2000-01-07 13:29:00.000     35.3        13.4         9.61       12.3 
 6 2000-01-07 13:30:00.000      6.81        2.08        1.71        3.02
 7 2000-01-07 13:31:00.000     27.0         8.09        9.33        9.59
 8 2000-01-07 13:32:00.000     30.2         9.12       11.8         9.31
 9 2000-01-07 13:33:00.000      9.71        2.60        3.61        3.50
10 2000-01-07 13:34:00.000     16.0         4.88        6.43        4.72
# ℹ 11,520 more rows

We see that the data is highly correlated regardless of options
used:

cor(mims$MIMS_UNIT, mims_notrunc$MIMS_UNIT)

[1] 0.9998119

But we do see some differences, with a maximum absolute difference of
> 1, which for MIMS is significant:

hist(abs(mims$MIMS_UNIT - mims_notrunc$MIMS_UNIT))

MIMS released by NHANES

Here we download the MIMS data released from NHANES from the XPT
(1.6Gb – large).

file_mims_xpt = file.path(tdir, paste0(id, "_MIMS_XPT.csv"))
if (!file.exists(file_mims_xpt)) {
  url = "https://wwwn.cdc.gov/Nchs/Data/Nnyfs/Public/2012/DataFiles/Y_PAXMIN.xpt"
  destfile = file.path(tdir, basename(url))
  if (!file.exists(destfile)) {
    curl::curl_download(url, destfile = destfile, quiet = FALSE)
  }

  #' We can read in the XPT using `haven` and then subset the ID we need.  We chose the first ID
  pax = haven::read_xpt(destfile, n_max = 50000)
  stopifnot(id %in% pax$SEQN)
  #' Keep only the ids we need from above
  paxmims = pax %>%
    filter(SEQN %in% id) %>%
    select(SEQN, PAXDAYM, PAXDAYWM, PAXSSNMP, PAXTSM, PAXAISMM, PAXMTSM, PAXMXM, PAXMYM, PAXMZM)
  readr::write_csv(paxmims, file_mims_xpt)
} else {
  paxmims = readr::read_csv(
    file_mims_xpt,
    col_types = readr::cols(
      SEQN = col_character(),
      PAXDAYM = col_character(),
      PAXDAYWM = col_character(),
      .default = readr::col_double()
    ),
    progress = FALSE,
    show_col_types = FALSE
  )
}

This data doesn’t have any times in it so we can check to make sure
they have the same rows and simply bind them. This would likely need
more checking for other data, but here we see these agree. Aside: A
mapping from this to the date/time data would be helpful.

head(paxmims)

# A tibble: 6 × 10
  SEQN  PAXDAYM PAXDAYWM PAXSSNMP PAXTSM PAXAISMM PAXMTSM PAXMXM PAXMYM PAXMZM
  <chr> <chr>   <chr>       <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 71917 1       6               0     60        0   27.3    8.41  11.7    7.13
2 71917 1       6            4800     60        0   41.2   14.9   14.8   11.5 
3 71917 1       6            9600     60        0   30.6   10.7   10.2    9.69
4 71917 1       6           14400     60       80    7.83   2.73   2.84   2.26
5 71917 1       6           19200     60        0   35.3   13.4    9.61  12.3 
6 71917 1       6           24000     60      640    6.81   2.08   1.71   3.02

Note, the PAXMTSM is the MIMS value released by NHANES,
the other columns are MIMS per axis (PAXMXM – X-axis MIMS),
The PAXSSNMP is the time in samples from the start of the
measurement, so 0 is the first minute, 4800 is
the second minute (80Hz * 60seconds), and PAXDAYM is the
day of measurement, so 1 is the first day, 2
is the second day, etc.

We can see that this participant was observed for a little over 8
days:

max(paxmims$PAXSSNMP)/80/60/60/24

[1] 8.00625

which we can confirm in the raw data

range(df$HEADER_TIMESTAMP)

[1] "2000-01-07 13:25:00.000 UTC" "2000-01-15 13:34:50.963 UTC"

diff(range(df$HEADER_TIMESTAMP), units = "days")

Time difference of 8.00684 days

Compare Calculated MIMS to Released MIMS

Here we will check to make sure they have the same rows and simply
bind them. In practice, this may be off by 1 since calculated MIMS may
have a trailing value of -0.01 for minutes that are not
fully covered, but these will be dropped in the released MIMS:

stopifnot(nrow(paxmims) == nrow(mims))

We can check the first few rows to see indeed they have very similar
results:

head(paxmims %>% select(PAXMTSM))

# A tibble: 6 × 1
  PAXMTSM
    <dbl>
1   27.3 
2   41.2 
3   30.6 
4    7.83
5   35.3 
6    6.81

head(mims %>% select(HEADER_TIME_STAMP, MIMS_UNIT))

# A tibble: 6 × 2
  HEADER_TIME_STAMP       MIMS_UNIT
  <dttm>                      <dbl>
1 2000-01-07 13:25:00.000     27.3 
2 2000-01-07 13:26:00.000     41.2 
3 2000-01-07 13:27:00.000     30.6 
4 2000-01-07 13:28:00.000      7.83
5 2000-01-07 13:29:00.000     35.3 
6 2000-01-07 13:30:00.000      6.81

head(mims_notrunc %>% select(HEADER_TIME_STAMP, MIMS_UNIT))

# A tibble: 6 × 2
  HEADER_TIME_STAMP       MIMS_UNIT
  <dttm>                      <dbl>
1 2000-01-07 13:25:00.000     27.3 
2 2000-01-07 13:26:00.000     41.2 
3 2000-01-07 13:27:00.000     30.6 
4 2000-01-07 13:28:00.000      7.83
5 2000-01-07 13:29:00.000     35.3 
6 2000-01-07 13:30:00.000      6.81

Comparing to Default (no truncation): Differences!

First we will compare the default mims_unit function
results to the released NHANES MIMS. We see that the maximum absolute
difference has some large values for MIMS, and the histogram shows a lot
of small differences, but also some large differences.

max(abs(mims$MIMS_UNIT - paxmims$PAXMTSM))

[1] 1.652

cor(mims$MIMS_UNIT, paxmims$PAXMTSM)

[1] 0.9998118

hist(abs(mims$MIMS_UNIT - paxmims$PAXMTSM))

Comparing to MIMS with Truncation: Almost Identical

Next we will compare the custom_mims_unit function
results with allow_truncation = FALSE to the released
NHANES MIMS. We see that the maximum absolute difference is very small,
and the histogram shows almost all differences are very small.

max(abs(mims_notrunc$MIMS_UNIT - paxmims$PAXMTSM))

[1] 0.02853675

cor(mims_notrunc$MIMS_UNIT, paxmims$PAXMTSM)

[1] 1

hist(abs(mims_notrunc$MIMS_UNIT - paxmims$PAXMTSM))

Conclusions

To get MIMS somewhat equivalent to those released by NHANES you need
to use the allow_truncation = FALSE argument in the
custom_mims_unit function. Though this will not likely
change results significantly, this is an important reproducibility note.
The authors likely did not implement truncation in their original method
and then added this later, so it makes sense why this may not be noted
in the NHANES or MIMSunit documentation. It would be
helpful if the NHANES documentation noted this truncation issue, but we
released this code to help researchers understand differences they are
seeing if they try to reproduce the released data.

References

John, Dinesh, Qu Tang, Fahd Albinali, and Stephen Intille. 2019.
“An Open-Source Monitor-Independent Movement Summary for
Accelerometer Data Processing.” Journal for the Measurement
of Physical Behaviour 2 (4): 268–81.

Karas, Marta, John Muschelli, Andrew Leroux, Jacek K Urbanek, Amal A
Wanigatunga, Jiawei Bai, Ciprian M Crainiceanu, and Jennifer A Schrack.
2022. “Comparison of Accelerometry-Based Measures of Physical
Activity: Retrospective Observational Data Analysis Study.”
JMIR mHealth and uHealth 10 (7): e38077.

Getting the EPA Walkability Index in R

Posted on December 5, 2024 by strictlystat

Objective

In this quick tutorial, we will go over how to get the EPA Walkability Index for an address in the USA. We will show how to map the address to the Federal Information Processing Standard (FIPS) code and then map that code into the Walkability Index.

The Walkability Index

The Walkability Index is derived from the Smart Location Database, which has a number of additional measures for each census block, which may be relevant to research. Websites like Zillow use a different service, called the Walk Score, which is another data resource that has APIs, but is not covered here.

You can see the information for the walkability index at https://www.epa.gov/smartgrowth/national-walkability-index-user-guide-and-methodology. Namely, the National Walkability Index Methodology and User Guide PDF is a good explanation of how the index is created and derived. Specifically, the index is derived by:

And the index is scored and categorized using the following formula:

Getting the Walkability Index

The walkability index is provided via an ArcGIS Map Service/MapServer. The URL is located at https://geodata.epa.gov/arcgis/rest/services/OA/WalkabilityIndex/MapServer, which we can see a snapshot of its documentation:

Note here that there are other endpoints of the MapServer other than the Walkability Index, but the Walkability Index is indexed as 0, so that the URL for the Walkability endpoint is https://geodata.epa.gov/arcgis/rest/services/OA/WalkabilityIndex/MapServer/0. Looking at this endpoint, we can see the fields of the data. Namely, we are focused on the GEOID10/GEOID20 fields which are the FIPS codes for different Census years and NatWalkInd ( type: esriFieldTypeDouble, alias: Walkability Index ), which is the walkability index:

I chose the 2010 Census FIPS as the documentation said the 2019 ACS/Census fields but the field name said 2018.

Using `arcgis` Package to Access the Walkability Index

The arcgis package package allows us to easily interact with ArcGIS map servers and extract information from them in a nice, tidy way. The package can be installed via:

remotes::install_github("https://github.com/R-ArcGIS/arcgis")
# Alternative
install.packages("arcgis", repos = "https://r-arcgis.r-universe.dev")

Once downloaded, you can load the package and open the connection to the server via arc_open:

library(arcgis)
url<- "https://geodata.epa.gov/arcgis/rest/services/OA/WalkabilityIndex/MapServer/0"
(walk_arc <- arc_open(url))

Name: NationalWalkabilityIndex
Geometry Type: esriGeometryPolygon
CRS: 3857
Capabilities: Map,Query,Data

You can download the entire dataset via arc_select:

res = arc_select(walk_arc)

If you do not need the geometry/polygon/spatial information, you can set geometry = FALSE:

res = arc_select(walk_arc, geometry = FALSE)

In many cases, however, you can send a SQL WHERE statement to arc_select to subset the data. For example, we can filter the data where the FIPS Census 2010 is in a list of a few FIPS codes (the first ones returned by the full data):

(res = arc_select(walk_arc, where = "GEOID10 in ('481130078254', '481130078252')"))

Simple feature collection with 2 features and 182 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -10769250 ymin: 3880158 xmax: -10767950 ymax: 3881314
Projected CRS: WGS 84 / Pseudo-Mercator
       GEOID10      GEOID20 STATEFP COUNTYFP TRACTCE BLKGRPCE CSA                 CSA_Name  CBSA
1 481130078254 481130078254      48      113  007825        4 206 Dallas-Fort Worth, TX-OK 19100
2 481130078252 481130078252      48      113  007825        2 206 Dallas-Fort Worth, TX-OK 19100
                        CBSA_Name CBSA_POP CBSA_EMP CBSA_WRK  Ac_Total Ac_Water   Ac_Land   Ac_Unpr
1 Dallas-Fort Worth-Arlington, TX  7189384  3545715  3364458  73.59503        0  73.59503  73.59503
2 Dallas-Fort Worth-Arlington, TX  7189384  3545715  3364458 119.82991        0 119.82991 119.21420
  TotPop CountHU  HH P_WrkAge AutoOwn0   Pct_AO0 AutoOwn1    Pct_AO1 AutoOwn2p  Pct_AO2p Workers
1   1202     460 423    0.549       69 0.1631206       39 0.09219858       315 0.7446809     412
2    710     409 409    0.466        0 0.0000000      168 0.41075795       241 0.5892421     395
  R_LowWageWk R_MedWageWk R_HiWageWk R_PCTLOWWAGE TotEmp E5_Ret E5_Off E5_Ind E5_Svc E5_Ent E8_Ret
1          99         122        191    0.2402913     66     20      3      0     19     24     20
2          76         107        212    0.1924051     25      7      0      3     15      0      7
  E8_off E8_Ind E8_Svc E8_Ent E8_Ed E8_Hlth E8_Pub E_LowWageWk E_MedWageWk E_HiWageWk E_PctLowWage
1      3      0     15     24     0       4      0          21          27         18    0.3181818
2      0      3     13      0     0       2      0          10           4         11    0.4000000
       D1A       D1B       D1C   D1C5_RET   D1C5_OFF   D1C5_IND  D1C5_SVC D1C5_ENT   D1C8_RET   D1C8_OFF
1 6.250422 16.332625 0.8967997 0.27175749 0.04076362 0.00000000 0.2581696 0.326109 0.27175749 0.04076362
2 3.430799  5.955666 0.2097066 0.05871784 0.00000000 0.02516479 0.1258239 0.000000 0.05871784 0.00000000
    D1C8_IND  D1C8_SVC D1C8_ENT D1C8_ED  D1C8_HLTH D1C8_PUB      D1D D1_FLAG   D2A_JPHH D2B_E5MIX
1 0.00000000 0.2038181 0.326109       0 0.05435150        0 7.147222       0 0.15602837 0.8862639
2 0.02516479 0.1090474 0.000000       0 0.01677652        0 3.640506       0 0.06112469 0.8350147
  D2B_E5MIXA D2B_E8MIX D2B_E8MIXA D2A_EPHHM D2C_TRPMX1 D2C_TRPMX2 D2C_TRIPEQ D2R_JOBPOP D2R_WRKEMP
1  0.7633862 0.8554418  0.6620914 0.3489116  0.5262958  0.5859160 0.28712831 0.10410095  0.2761506
2  0.5699862 0.8316863  0.5544576 0.1970473  0.2484811  0.2713093 0.00203268 0.06802721  0.1190476
  D2A_WRKEMP   D2C_WREMLX      D3A     D3AAO     D3AMM    D3APO      D3B    D3BAO   D3BMM3   D3BMM4
1   6.242424 5.287423e-03 23.53490 0.0000000 10.655277 12.87962 115.9817 0.000000 60.87368  8.69624
2  15.800000 3.736299e-07 22.89337 0.7551371  2.859482 19.27875  80.1456 5.340904 10.68181 10.68181
    D3BPO3    D3BPO4    D4A D4B025      D4B050  D4C      D4D         D4E   D5AR   D5AE   D5BR  D5BE
1 34.78496 43.481198 362.10      0 0.000000000 4.33 37.65472 0.003602329 433601 303660 135362 53504
2 85.45446  5.340904 718.84      0 0.009516414 4.33 23.12611 0.006098592 386504 272135 236885 90089
         D5CR     D5CRI         D5CE     D5CEI         D5DR     D5DRI         D5DE     D5DEI D2A_Ranked
1 0.000397944 0.7858935 0.0003576452 0.8412986 0.0005250753 0.1846967 0.0004755985 0.1377067          6
2 0.000354720 0.7005311 0.0003205156 0.7539577 0.0009188875 0.3232213 0.0008008035 0.2318678          3
  D2B_Ranked D3B_Ranked D4A_Ranked NatWalkInd                                     Region Households
1         14         15         17   14.00000 Dallas-Fort Worth-Arlington, TX Metro Area        444
2         10         12         14   10.83333 Dallas-Fort Worth-Arlington, TX Metro Area        424
  Workers_1 Residents Drivers Vehicles White Male Lowwage Medwage Highwage W_P_Lowwage W_P_Medwage
1       412      1141  660.88      648   455  687      99     122      191   0.3181818   0.4090909
2       395       792  671.44       NA   662  384      76     107      212   0.4000000   0.1600000
  W_P_Highwage GasPrice  logd1a    logd1c  logd3aao logd3apo d4bo25 d5dei_1 logd4d UPTpercap
1    0.2727273      213 1.98106 0.6401681 0.0000000 2.630422      0       0      4        11
2    0.4400000      213 1.48858 0.1903778 0.5625469 3.009573      0       0      3        11
  B_C_constant  B_C_male B_C_ld1c B_C_drvmveh  B_C_ld1a B_C_ld3apo  B_C_inc1 B_C_gasp B_N_constant
1     0.962706 -0.027509 -0.08353   -0.241761 -0.018226  -0.145805 -0.068564 0.012247     2.113264
2     0.962706 -0.027509 -0.08353   -0.241761 -0.018226  -0.145805 -0.068564 0.012247     2.113264
  B_N_inc2 B_N_inc3 B_N_white B_N_male B_N_drvmveh B_N_gasp  B_N_ld1a  B_N_ld1c B_N_ld3aao B_N_ld3apo
1 0.328856 0.232304  0.030571 0.026792    0.093403 0.000627 -0.091362 -0.091362  -0.034381   -0.33084
2 0.328856 0.232304  0.030571 0.026792    0.093403 0.000627 -0.091362 -0.091362  -0.034381   -0.33084
  B_N_d4bo25 B_N_d5dei B_N_UPTpc C_R_Households C_R_Pop C_R_Workers C_R_Drivers C_R_Vehicles C_R_White
1  -0.648986 -0.328734  0.023496         928341 2606868     1163871     1759309      1394052   0.29134
2  -0.648986 -0.328734  0.023496         928341 2606868     1163871     1759309      1394052   0.29134
   C_R_Male C_R_Lowwage C_R_Medwage C_R_Highwage  C_R_DrmV NonCom_VMT_Per_Worker Com_VMT_Per_Worker
1 0.4930775   0.2137677   0.3395583    0.4466732 0.3934515              5.807667           21.68874
2 0.4930775   0.2137677   0.3395583    0.4466732 0.3934515              5.085922           21.37983
  VMT_per_worker VMT_tot_min VMT_tot_max VMT_tot_avg GHG_per_worker Annual_GHG Shape_Length Shape_Area
1       27.49641    11.44299     82.6363    25.65933       24.49930   6369.817     3703.093   424204.0
2       26.46575    11.44299     82.6363    25.65933       23.58099   6131.057     4183.884   690903.4
  OBJECTID SLC_score                       geometry
1        1  77.45096 POLYGON ((-10769247 3880758...
2        2  78.89864 POLYGON ((-10769134 3880319...

We can now treat the sf object like a data.frame using dplyr verbs to subset the FIPS code and the walkability index:

library(dplyr)
res %>% 
  select(GEOID10, NatWalkInd)

Simple feature collection with 2 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -10769250 ymin: 3880158 xmax: -10767950 ymax: 3881314
Projected CRS: WGS 84 / Pseudo-Mercator
       GEOID10 NatWalkInd                       geometry
1 481130078254   14.00000 POLYGON ((-10769247 3880758...
2 481130078252   10.83333 POLYGON ((-10769134 3880319...

Note, this still has the geometry of the polygons for spatial information/plotting.

If you do not want that, you can simply turn the data into a tibble and drop the geometry column:

res %>% 
  select(GEOID10, NatWalkInd) %>% 
  as_tibble() %>% 
  select(-any_of("geometry"))

# A tibble: 2 × 2
  GEOID10      NatWalkInd
               
1 481130078254       14
2 481130078252       10.8

Now we have the walkability index at the order of the FIPS code, but we still have an address and don’t know the FIPS code. Enter the censusxy package.

Geocoding via the `censusxy` Package

The censusxy package can be installed via CRAN or GitHub and interfaces R with the U.S. Census Bureau Geocoding Tools. This allows us to geocode an address with the required information, including lat/lon information using the cxy_geocode function:

library(censusxy)
address = tibble(
  street = "1600 Pennsylvania Avenue NW",
  city = "Washington",
  state = "DC",
  zip = "20500"
)
cxy = address %>%
  cxy_geocode(street = "street", 
              city = "city",
              state = "state", 
              zip = "zip", 
              output = "full",
              return = "geographies",
              benchmark = "Public_AR_Current",
              vintage = "Census2010_Current")

Note we are using the Census2010_Current “vintage” here since we are mapping to the 2010 Census data. You can see the other vintages using cxy_vintages("Public_AR_Current") (which was having issues at the time of publication) or cxy_vintages("4"), which may work but is less robust if the ID of the benchmark changes. You can also view the vintages at https://geocoding.geo.census.gov/geocoder/geographies/onelineaddress?form.

We can see the additional columns added to the data set and specifically subset those from the geocoding:

colnames(cxy)

 [1] "street"              "city"                "state"               "zip"
 [5] "cxy_address"         "cxy_status"          "cxy_quality"         "cxy_matched_address"
 [9] "cxy_tiger_line_id"   "cxy_tiger_side"      "cxy_lon"             "cxy_lat"
[13] "cxy_state_id"        "cxy_county_id"       "cxy_tract_id"        "cxy_block_id"

cxy %>% select(starts_with("cxy"))

                                         cxy_address cxy_status cxy_quality
1 1600 Pennsylvania Avenue NW, Washington, DC, 20500      Match       Exact
                              cxy_matched_address cxy_tiger_line_id cxy_tiger_side   cxy_lon  cxy_lat
1 1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500          76225813              L -77.03654 38.89869
  cxy_state_id cxy_county_id cxy_tract_id cxy_block_id
1           11             1         6202         1031

Note, we do not get a 12-length FIPS code from this but can construct it from this information as follows (as per https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html):

make_fips15 = function(
    state,
    county,
    tract,
    block) {
  fips15 = sprintf("%02.0f%03.0f%06.0f%04.0f",
                   state,
                   county,
                   tract,
                   block)
}
make_fips12 = function(...) {
  fips15 = make_fips15(...)
  fips12 = substr(fips15, 1, 12)
}

Note the use of sprintf to enforce zero padding when needed/appropriate. We created the FIPS length 15, which has the full block ID, but the GEOID10 uses the FIPS 12, which is a subset of the 15.

We can now use this function to make the FIPS code and subset the data to the FIPS code and the walkability index:

(cxy = cxy %>% 
  dplyr::mutate(
    GEOID10 = make_fips12(
      cxy_state_id,
      cxy_county_id,
      cxy_tract_id,
      cxy_block_id)
  ) %>% 
  select(GEOID10, starts_with("cxy")))

       GEOID10                                        cxy_address cxy_status cxy_quality
1 110010062021 1600 Pennsylvania Avenue NW, Washington, DC, 20500      Match       Exact
                              cxy_matched_address cxy_tiger_line_id cxy_tiger_side   cxy_lon  cxy_lat
1 1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500          76225813              L -77.03654 38.89869
  cxy_state_id cxy_county_id cxy_tract_id cxy_block_id
1           11             1         6202         1031

We can get the final result using this filtering if we had not downloaded the full data:

(result = arc_select(walk_arc, where = paste0("GEOID10 = '110010062021'")))

Simple feature collection with 1 feature and 182 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8578803 ymin: 4700308 xmax: -8571827 ymax: 4707388
Projected CRS: WGS 84 / Pseudo-Mercator
       GEOID10      GEOID20 STATEFP COUNTYFP TRACTCE BLKGRPCE CSA
1 110010062021 110010062021      11      001  006202        1 548
                                        CSA_Name  CBSA                                    CBSA_Name
1 Washington-Baltimore-Arlington, DC-MD-VA-WV-PA 47900 Washington-Arlington-Alexandria, DC-VA-MD-WV
  CBSA_POP CBSA_EMP CBSA_WRK Ac_Total Ac_Water  Ac_Land  Ac_Unpr TotPop CountHU HH P_WrkAge AutoOwn0
1  6151521  2806497  2713860 2844.286 1228.309 1615.977 581.9411     60      39 39        1        0
  Pct_AO0 AutoOwn1 Pct_AO1 AutoOwn2p Pct_AO2p Workers R_LowWageWk R_MedWageWk R_HiWageWk R_PCTLOWWAGE
1       0       39       1         0        0     108          32          33         43    0.2962963
  TotEmp E5_Ret E5_Off E5_Ind E5_Svc E5_Ent E8_Ret E8_off E8_Ind E8_Svc E8_Ent E8_Ed E8_Hlth E8_Pub
1  12322    146   2291    619   7763   1503    146    541    619   7005   1503   264     494   1750
  E_LowWageWk E_MedWageWk E_HiWageWk E_PctLowWage        D1A       D1B      D1C  D1C5_RET D1C5_OFF
1         964        3993       7365   0.07823405 0.06701709 0.1031032 21.17396 0.2508845 3.936824
  D1C5_IND D1C5_SVC D1C5_ENT  D1C8_RET  D1C8_OFF D1C8_IND D1C8_SVC D1C8_ENT   D1C8_ED D1C8_HLTH D1C8_PUB
1 1.063681 13.33984 2.582735 0.2508845 0.9296473 1.063681  12.0373 2.582735 0.4536541 0.8488831 3.007177
       D1D D1_FLAG D2A_JPHH D2B_E5MIX D2B_E5MIXA D2B_E8MIX D2B_E8MIXA D2A_EPHHM D2C_TRPMX1 D2C_TRPMX2
1 21.24098       0 315.9487  0.660679   0.660679 0.6762556  0.6762556 0.6034774  0.3871171  0.4167215
  D2C_TRIPEQ  D2R_JOBPOP D2R_WRKEMP  D2A_WRKEMP D2C_WREMLX      D3A    D3AAO    D3AMM    D3APO      D3B
1  0.3683783 0.009691488 0.01737731 0.008764811   0.371118 43.40688 7.922193 4.806263 30.67842 357.6078
    D3BAO   D3BMM3   D3BMM4   D3BPO3  D3BPO4   D4A    D4B025    D4B050    D4C      D4D      D4E   D5AR
1 33.2678 22.97062 12.67345 310.1034 122.774 77.78 0.4221862 0.8983763 665.33 263.5008 11.08883 344535
    D5AE   D5BR   D5BE         D5CR     D5CRI         D5CE     D5CEI        D5DR     D5DRI        D5DE
1 233310 668922 384773 0.0005755183 0.7895334 0.0004914899 0.7962962 0.001287299 0.7493346 0.001281091
      D5DEI D2A_Ranked D2B_Ranked D3B_Ranked D4A_Ranked NatWalkInd
1 0.7384541         13         15         20         20         18
                                                   Region Households Workers_1 Residents Drivers
1 Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area         38       108        58   51.04
  Vehicles White Male Lowwage Medwage Highwage W_P_Lowwage W_P_Medwage W_P_Highwage GasPrice     logd1a
1       NA    58   38      32      33       43  0.07823405   0.3240545    0.5977114      260 0.06486699
    logd1c logd3aao logd3apo d4bo25 d5dei_1 logd4d UPTpercap B_C_constant B_C_male  B_C_ld1c B_C_drvmveh
1 3.098919 2.188542 3.455636      0       1      5        67     2.608323 0.050198 -0.420489   -0.483538
  B_C_ld1a B_C_ld3apo  B_C_inc1 B_C_gasp B_N_constant B_N_inc2 B_N_inc3 B_N_white B_N_male B_N_drvmveh
1  0.13074  -0.728447 -1.225332 0.010679     3.281742 0.081456 0.079678 -0.015683 0.106342   -0.175908
   B_N_gasp  B_N_ld1a  B_N_ld1c B_N_ld3aao B_N_ld3apo B_N_d4bo25 B_N_d5dei B_N_UPTpc C_R_Households
1 -0.005019 -0.210591 -0.210591   0.079511  -0.184902  -0.534544  0.025933 -0.000537         284386
  C_R_Pop C_R_Workers C_R_Drivers C_R_Vehicles C_R_White  C_R_Male C_R_Lowwage C_R_Medwage C_R_Highwage
1  692683      284578    504467.9       232754 0.3657849 0.4744508   0.1778634   0.2618579    0.5602787
   C_R_DrmV NonCom_VMT_Per_Worker Com_VMT_Per_Worker VMT_per_worker VMT_tot_min VMT_tot_max VMT_tot_avg
1 0.9554406              2.022151           3.152144       5.174296    1.245001    150.5366    21.37389
  GHG_per_worker Annual_GHG Shape_Length Shape_Area OBJECTID SLC_score                       geometry
1       4.610298   1198.677     34053.64   19018783    61619  97.36804 POLYGON ((-8578803 4706193,...

And we can extract the Walkability score:

result %>% as_tibble() %>% select(NatWalkInd)

# A tibble: 1 × 1
  NatWalkInd
       
1         18

If you have multiple IDs, you can do something like:

ids = unique(cxy$GEOID10)
ids = paste0("'", ids, "'")
ids = paste(ids, collapse = ", ")
ids = paste0("(", ids, ")")
(where = paste0("GEOID10 in ", ids))

[1] "GEOID10 in ('110010062021')"

Or alternatively download the full data using arc_select and use dplyr to join the data.

Note, arc_select can be slow for large datasets, so you may want to subset the data using the where argument or use the geometry = FALSE argument to arc_select. Also, you can use more powerful SQL WHERE commands to subset the data.

Conclusion

If you have access to peoples addresses or latitude/longitude, you can map that directly to the national walkability index, which might be an interesting aspect of area of living information for research or policy purposes.

More Metrics: Smart Location Database

Although the focus was on the national walk, ability index, the smart location database has a number of additional fields and data sets available to the public, which can be mapped in the same way. As per https://www.epa.gov/smartgrowth/smart-location-mapping:

The Smart Location Database summarizes more than 90 different indicators associated with the built environment and location efficiency. Indicators include density of development, diversity of land use, street network design, and accessibility to destinations as well as various demographic and employment statistics. Most attributes are available for all U.S. block groups.

The MapServer endpoint (https://geodata.epa.gov/arcgis/rest/services/OA/SmartLocationDatabase/MapServer) has a number of metrics and data sets that may be useful:

You can use the same arcgis package to access these data sets in the same way as the walkability index and geocode using censusxy to get the FIPS code.

PMI: The Loan with an Increasing Interest Rate

Posted on January 18, 2024 by strictlystat

Disclaimer

This is my opinion and not financial advice. Don’t use any of this information to make decisions.

Why post this?

This post is a little different than previous posts in the past, which have focused mainly on data science. I’m trying to get back into blogging a bit in 2024, and I wanted to write about something I don’t need data or really code for. We bought a house a few years ago and I thought PMI was interesting from a financial perspective and so wanted to share how I thought about it and why I tried to pay it off ASAP.

TL;DR

Separate your loan into 2 parts: amount required to remove PMI (PMI balance) and the rest. Looking at the interest + PMI as the total cost of the PMI balance shows that the effective APR for the PMI balance of the loan is much higher than your APR and you should pay it off over almost all other debts (even credit cards, but that may be a larger decision).

PMI: Mortgage Interest

What is PMI (Private Mortgage Insurance)? When you buy a house, if you don’t have a set % of the down payment (usually 20%), the bank will require you to have PMI, which is almost always a fixed cost amount that you will pay on top of any interest on the loan. You can have PMI for the life of the loan, but in many cases you carry (e.g. pay) PMI until you get to that fixed percentage, usually the loan is at 80% loan of the home value.

Note that the initial amount is 20% of the value of the home/loan (20% down payment), but you can remove it when you have 80% of the value of the home left on the loan (80% left). If housing prices stay relatively flat or moderately increase, 20% down payment or 80% left are the the same or very close. If housing prices increase dramatically, you can potentially get a reappraisal and your loan amount, which hasn’t changed, may be a smaller percentage of the home value. Some lenders do not allow you to remove PMI based on reappraisals, or limit the amount of time from purchase you can remove PMI for, such as within 3 years. This restriction won’t allow you to eject PMI if the housing market goes up dramatically, like in the past few years. Regardless, depending on your restrictions or how long you own your home, it’s important to know when you can get rid of your PMI so you know the target you’re aiming for.

Why is PMI bad?

PMI isn’t bad per se, as it is necessary to get the loan if you do not have a 20% down payment, which can be a lot of money. Also, even if you have 20% to put down, you may want that money in an emergency fund, need it for home improvements, or need that money to furnish your new home. The reason I’m writing about PMI is because when you consider it more of a removable fixed fee than part of a loan, which makes it somewhat unique in how the math works out on paying it off faster or slower. To demonstrate this, let’s look at how interest rates work on the loan and then show how PMI is very different.

Normal APR

For many loans, the relevant number is the annualized percentage rate or APR. In finance, building wealth in its simplest terms is taking all your assets (things that generate value or keep value) and your liabilities (things that cost value or depreciate) and trying to maximize the APR for the assets and minimize APR on the liabilities. This largely is due to the wonder of compound interest. Easily understood liabilities are loans, such as car loans, credit card debt, and student loans. Easily understood assets are brokerage/stock market accounts, savings accounts, and homes. We’re only going to talk about things in direct money amounts to simplify things, so we’re not going to go into house appreciation/depreciation and assume it is worth the same over time.

Your Home Loan

For simplicity sake, let’s say your home loan is $100,000 and you have to put $20,000 (20%) down to have a loan with PMI, but you are only going to put down 10% ($10,000). Let’s also assume your rate (APR) for your mortgage is 5%. Let’s say the PMI is $50/month, which is a small/moderate PMI in many cases. In some cases, PMI is > $100/month. Thus, we’re paying $600/year for PMI. We’re assuming a fixed rate loan (commonly 30 years), so the APR on the loan does not change.

We’re going to discuss mortgage payments without any other costs such as homeowners insurance or property taxes.

Amortization

Amortization is the gradually writing off the initial cost of an asset, or in this case, gradually paying down the whole mortgage. Let’s look at the amortization schedule for this 30-year loan without PMI, and then we can break down how much PMI costs as an APR. Any payment to principal with PMI is inherently going to the PMI part and not the rest.

To calculate the monthly payment, we’ll use the formula:

Screen Shot 2024-01-18 at 5.18.21 PM

where i is our monthly interest rate: 0.4167% (5%/12 months), n is our total number of months (360 months for 30-year), and P is the principal amount. We will look at putting in 3 different amounts for principal: $80K (20% down), $90K (10% down), and $100K (0% down).

Here we can calculate the monthly cost (without PMI) for each scenario:

i = monthly_interest_rate = 0.05/12
p = c(80000, 90000, 100000)
# n = 1:(30*12)
n = 30*12
pmi_cutoff = 80000
get_monthly = function(p) {
  round(p * (i * (1+i)^n)/{(1+i)^n -1}, digits = 2)
}
a = get_monthly(p)
names(a) = sprintf("%4.0f", p)
a

 80000  90000 100000 
429.46 483.14 536.82

Although we see that there is a $53.68 difference between the $80K and $90K bank notes, we haven’t taken into account the PMI. The reason I bring this up is that I heard “It’s only $50/month, that’s like 1 fewer dinner out a month”, which I think is technically correct, but not the way I’d see this in the lens of assets and liabilities. Even accounting for PMI, it’s about $100 difference in payments, but it may be clear when we separate the mortgage balances below how this thinking can lead to a very expensive liability.

Amortization/Paydown Schedule

Let’s create an amortization schedule. We will take the last months balance, multiply it by the interest rate to get the additional interest accrual, and deduct the monthly payment. Then we look at the the last 6 months of payment in years 2-3:

schedule = matrix(ncol = 3, nrow = n+1)
schedule[1,] = p
colnames(schedule) = paste0("base_", p/1000)
for (irow in 2:(n+1)) {
  schedule[irow,] = schedule[irow-1,] - a + 
    round(schedule[irow-1, ] * monthly_interest_rate, 2)
}
schedule = as.data.frame(schedule)
schedule = schedule %>% 
  mutate(month = 0:(nrow(schedule)-1)) %>% 
  select(month, everything())
gt::gt(schedule %>% filter(month > 30 & month <= 36))

month	base_80	base_90	base_100
31	76826.07	86429.45	96032.79
32	76716.72	86306.43	95896.11
33	76606.91	86182.90	95758.86
34	76496.65	86058.86	95621.04
35	76385.93	85934.30	95482.64
36	76274.74	85809.22	95343.66

After 3 years of payments, we can see how the balance difference of almost $10K or $20K remains even after 3 years of payments, for the $10K or $0 down payments, respectively.

Let’s wrap this into a simple function that takes in the principal, the monthly interest rate, the value at which the PMI is removed and the PMI cost per month:

make_schedule = function(p, 
                         monthly_interest_rate = 0.05/12,
                         pmi_cutoff = 80000, 
                         pmi_value = 50) {
  schedule = as.data.frame(matrix(ncol = 3, nrow = n+1))
  colnames(schedule) = c("balance", "payment", "interest")
  schedule$balance[1] = p
  a = get_monthly(p)
  schedule = schedule %>% 
    dplyr::mutate(payment = a,
                  interest = 0L)
  # fill in the amortization schedule
  for (irow in 2:(n+1)) {
    schedule$interest[irow-1] = round(schedule$balance[irow-1] * monthly_interest_rate, 2)
    schedule$balance[irow] = schedule$balance[irow-1] - 
      a +
      schedule$interest[irow-1]
  }
  # add columns for the month, principal paid, indicator of PMI and PMI cost
  schedule = schedule %>% 
    dplyr::mutate(
      month = 0:(dplyr::n()-1),
      principal = payment - interest
    )
  schedule = schedule %>% 
    dplyr::mutate(
      has_pmi = balance > pmi_cutoff,
      pmi = ifelse(has_pmi, pmi_value, 0),
      total_cost = interest + pmi,
      effective_interest_rate = total_cost / balance, n = 1L)
  # schedule$effective_interest_rate[1] = schedule$effective_interest_rate[2]
  # We can see the effective APR (PMI + interest)
  schedule = schedule %>% 
    dplyr::mutate(
      effective_apr = scales::percent(effective_interest_rate * 12, 
                                      accuracy = 0.01),
      effective_interest_rate = scales::percent(effective_interest_rate,
                                                accuracy = 0.001),
    )
  # remove the last which happens due to rounding
  schedule = schedule %>% 
    filter(balance > 0)
  schedule = tibble::as_tibble(schedule)
  schedule
}

To show how the PMI affects the interest rate of the total loan over time, we will use a 10% down ($10K), so the principal is $90K:

m = make_schedule(90000, monthly_interest_rate = monthly_interest_rate)
gt::gt(head(m))

balance	payment	interest	month	principal	has_pmi	pmi	total_cost	effective_interest_rate	n	effective_apr
90000.00	483.14	375.00	0	108.14	TRUE	50	425.00	0.472%	1	5.67%
89891.86	483.14	374.55	1	108.59	TRUE	50	424.55	0.472%	1	5.67%
89783.27	483.14	374.10	2	109.04	TRUE	50	424.10	0.472%	1	5.67%
89674.23	483.14	373.64	3	109.50	TRUE	50	423.64	0.472%	1	5.67%
89564.73	483.14	373.19	4	109.95	TRUE	50	423.19	0.472%	1	5.67%
89454.78	483.14	372.73	5	110.41	TRUE	50	422.73	0.473%	1	5.67%

Here we can see the effective APR over time:

m %>% 
  ggplot(aes(x = month, y = effective_apr)) + 
  geom_step() + 
  labs(x = "Month", y = "Effective APR (including PMI)")

plot of chunk unnamed-chunk-21

We see that the APR levels out after the PMI is taken off and doesn’t really change much compared to the total balance of the loan, but does increase in the 3rd decimal of the percentage. This increase is small compared to the total balance of the loan, but I argue is incorrect to look at things this way. Regardless, an effective APR of 5% vs 5.75% but can still change decisions of where to allocate money: people may decide on paying the mortgage more compared to stocking away money in a savings account, bond, or other asset.

The issue with this thinking is that the PMI isn’t really on the whole loan, but only a portion of it, which dramatically changes how to think of it as a liability.

How to (correctly) think of PMI

The trick, at least for me, was to think of the whole mortgage broken up into 2 parts: the amount of the mortgage left that if I paid off it would remove PMI (PMI balance), and the rest of the mortgage (other balance/rest). In our example, the PMI balance would be $10,000 and the other balance would be $80,000. On the other balance of the mortgage, the APR on that $80K is 5%. On the PMI part, however, the APR is 5% + the rate induced by PMI.

When breaking the balances this way, which I believe is more accurate, we see the PMI balance as a considerably worse liability. This fact should not be surprising as the numerator of the APR (the total cost) has a fixed amount/fee (the PMI), but the denominator (PMI balance) shrinks each month.

Let’s look at the first few months and the effective APR on the PMI balance:

m = m %>% 
  mutate(pmi_balance = balance - pmi_cutoff,
         pmi_balance = ifelse(pmi_balance < 0, 0, pmi_balance),
         other_balance = balance - pmi_balance)
m = m %>% 
  mutate(pmi_balance_interest = round(pmi_balance / balance * interest,2),
         other_balance_interest = interest - pmi_balance_interest)
m = m %>% 
  mutate(
    pmi_balance_total_cost = pmi_balance_interest + pmi,
    effective_pmi_balance_apr = pmi_balance_total_cost/pmi_balance * 12,
    effective_other_apr = other_balance_interest/other_balance * 12,
    effective_pmi_balance_apr = scales::percent(effective_pmi_balance_apr),
    effective_other_apr = scales::percent(effective_other_apr))
data = m %>% 
  filter(has_pmi) %>% 
  select(balance, month, pmi_balance,
         effective_pmi_balance_apr, pmi_balance_interest, 
         pmi_balance_total_cost, effective_apr)
gt::gt(head(data))

balance	month	pmi_balance	effective_pmi_balance_apr	pmi_balance_interest	pmi_balance_total_cost	effective_apr
90000.00	0	10000.00	11.000%	41.67	91.67	5.67%
89891.86	1	9891.86	11.066%	41.22	91.22	5.67%
89783.27	2	9783.27	11.132%	40.76	90.76	5.67%
89674.23	3	9674.23	11.202%	40.31	90.31	5.67%
89564.73	4	9564.73	11.273%	39.85	89.85	5.67%
89454.78	5	9454.78	11.347%	39.40	89.40	5.67%

We see on that $10K the APR is effectively over 11%! Now, even with a great APR (such as 3%), with PMI, the rate is likely high enough that if you have the cash to remove it, it’s a better to remove the liability than to keep the $10K asset, unless you can beat 11% (and tell me how).

Looking after 3 years of payments can give you a better idea of the effective APR on the remaining PMI balance (spoiler, it’s > 15%!):

We can see this effective APR for the PMI balance during the first 6 years:

data %>% 
  filter(month <= 72) %>% 
  mutate(effective_pmi_balance_apr = 
           as.numeric(sub("%", "", effective_pmi_balance_apr))/100) %>% 
  ggplot(aes(x = month, y = effective_pmi_balance_apr)) + 
  scale_y_continuous(label =  scales::percent) +
  geom_step() + 
  labs(x = "Month", y = "PMI Balance Effective APR")

plot of chunk unnamed-chunk-24

This APR increase isn’t shocking if you think of the majority of the numerator getting larger in light of compounding interest being reduced by payment of the balance, but still shocking when your APR goes like that.

And, most drastically, let’s look at towards the remaining balance after we’ve made almost all of payments:

The effective APR is > 100%. I don’t care what asset you have, you’re never going to beat that. In many cases, when looking at the small balance needed to remove PMI, it’s a clear cut case to pay additional money to remove it, but it’s hard to see that affect as strongly towards the beginning of the loan.

Conclusion

Obviously, if you have enough money to get a mortgage without PMI, then do it. Many can’t, hence why they have PMI in the first place. I think the message of “have more money” is silly and unhelpful.

I think the interesting part of this is seeing the balance left to remove PMI and the additional remaining balance of the mortgage as 2 separate loans. Mentally and practically, I think it makes it clear that PMI is a really bad liability. Most surprisingly, it can be even a higher effective APR compared to credit cards! The big difference compared to credit cards and other APR-based products, however, is that PMI is not interest and does not compound per month. But that same fact that PMI is fixed makes for an interesting interest vehicle and warrants a closer inspection other than “it’s just another $50/month”. Though true, when looking at it through the lens of asset and liability interest rates, it should have a high priority for removal in your debt profile and increased priority over time, especially after you have saved up after the initial home purchase.

Some things I wish I knew about Grad School

Posted on October 1, 2020 by strictlystat

I'd like to thank Elizabeth Sweeney for some pointers on this post. All the thoughts are my own, so don't blame her for anything below.

A little about me

I am from a middle-class family from Southeast PA. We never had any serious money issues when I grew up that I remember, but I would not say we were flush. I did go to a Catholic grades school until 6th grade, then I went to public school. I had an IEP (Individualized Education Program) and was involved in “gifted” classes. My high school was a good school with a number of advanced placement (AP) credits. We had guidance counselors and administrators that knew your name.

Both of my parents held full-time jobs my entire life and are tremendously hard working. Neither graduated from college, but my mother did work in admissions at a University, so she had a wealth of knowledge to navigate applying for school, finding scholarships, and also had the benefit of tuition benefits from her job for her institution and a number of reciprocating institutions.

The reason I bring this up is a number of reasons: I was gifted academically, so graduate school was a drastic change and much harder than any education I had previously, I had support throughout my career that not everyone have, and I had someone with know-how of applying for undergraduate and graduate school. So while I'm a first generation college graduate, I had a leg-up on many others like me.

I'm talking about my one, singular graduate school experience. I'm also a white, cis-gender male with a generic first name. There are a number of hurdles I've never encountered. I focus a lot on money and resources throughout this post, but there are a number things I haven't even considered. Please feel free to tell me some things you wish you knew about grad school on Twitter \@StrictlyStat.

The goal

The goal of this is to discuss some of the things that I wish I knew about graduate school, especially the application process and the first year. I only went to one place for grad school, both my Master's and PhD, so these points may not be generalizable to all graduate schools. Also, Johns Hopkins was a tremendously supportive institution with respect to money and people's time in grad school. Blind spots still exist for all institutions. I hope to add comments on what institutions could make this easier or information on how students can help themselves (as many may not point this out at the time).

Many of these blind spots are not intentional, but due to not thinking about it (see Hanlon's razor). You'll realize that when you ask questions (which I promote throughout) and will hear a lot of “Oh, I've never thought of that. No one's ever asked that before, let me check” responses.

What is graduate school?

Graduate school is a time to learn independent research and learning. You are a student, but really a partner in your own learning. Strike that, you own your own learning. Taking ownership over your own learning is a key difference from graduate school compared to all other education before it (at least in the US). In many cases, the classes and structure are to give you a foundation and teaching you how to learn. In essence, your program will give you the tools to teach yourself, not teach you everything you need to know. Also, many sets of required courses are not even the bare minimum – many programs expect you take electives to specialize in your field. So while some classes “didn't teach me all I needed for this topic”, understand that it's partly on you to fill in the gaps. Now, if you believe egregious errors or sets of materials were omitted from a course, you may want to speak to the graduate program director or the chair or send an anonymous email.

Network

It's OK not to have an extensive network of connections starting out. Talking to your undergraduate advisor or another faculty member and leaning on their network is a great strategy. If you don't have those, reach out to faculty/students at other institutions. Most people are happy to spare some time or impart some wisdom (or at least assumed wisdom).

Reaching out to Faculty/Students

Before you apply or enroll somewhere, you should feel free to reach out to current faculty and students. Note, many faculty get a lot of requests like this. Be explicit and as clear as possible. I might say something like:

Dear Dr. XXX,

My name is John Muschelli – I am a prospective student for your department. Would it be possible to have a 30 minute call to discuss your research? I would like to know if YOUR DEPARTMENT is a good fit. I'm happy to converse via email if you do not have time.

Specifically, I'd like to know about YYY research and what students you advise?

Thank you,

John Muschelli
Phone: 555-800-8000

where you can leave off the “Specifically” line. This indicates that you don't want anything other than information, as opposed to a job, a position in the lab, a recommendation, etc. This leaves the door open for them to not respond (which you should consider when applying), or a short call where you have a prepared list of questions (that you should send them), or an email exchange. Don't spam the whole faculty – we talk and will ask “Did you get an email from so-and-so?” which will reduce the chance of discussion. I'll talk more about conferences, but it's also a great idea to do this if you are going to a conference and search the program for people from an institution you are considering. Don't overdo it though – these people will be at other conferences in the future and you don't want to be pushy.

Social Media

It's a digital age. Your social media can be used immensely for networking in a professional setting. If so, make sure it's professional. Use privacy settings in all applications. Reaching out to faculty/students on some platforms can provide a better interaction than email. It can be hit or miss with people, but another avenue. Also, make a website and probably a GitHub account. People will Google your name and you control what's on your site. Use the blogdown blogdown package, and buy a domain on Google (https://domains.google/) for $12/year (year 2020).

Get to know your classmates

Your classmates are your future network. Get to know them better than mere acquaintances. Some will become friends, but almost all will be come future colleagues, collaborators, or contacts. If you go into academia, these contacts will be at places to send your students, either for faculty positions, internships, or jobs. My message (one from a friend) is this: Be generous and share with your classmates to at least get as many perspectives as possible and don't be an island – do group study sessions. You can't always be the best in every subject, work with others. Graduate school is a time for independent research, but not alone research. Ask anyone, and they will tell you that no one works completely alone (at least in Biostatistics).

Money

The #1 thing I'd say I did not understand starting grad school was Money. And as the great Wu-Tang Clan said: C.R.E.A.M..

I didn't know what my value was with respect to getting paid. I didn't understand how to identify payroll issues. I didn't understand taxes. I also didn't know how to handle money, but that was a moot point because my budget was exactly “I need X dollars to live and let's get X+c dollars”, where c was some number where I could eat out a few times a month if I wanted to, while taking out loans for my Master's. Probably most importantly, I didn't understand professional expenses.

Applying to a program – look into a waiver

But before we get to expenses while in the program, let's start with first things first: applying to programs cost money. This cost includes the application fee, taking the GRE, maybe some GRE prep book, and maybe taking the exams multiple times. For the application fee, look on the website and see if there are any waivers. The waiver will likely be school-wide, less likely department-specific. If you can't find the waivers, ask the graduate program director (Google them), a departmental administrator, or the registrar. Also, ask the program director if you not qualify, but believe you still should be eligible for one. Even low-cost application fees add up over applying to a number of schools.

How to make this easier – put on every application page where to apply for a waiver (don't bury it), including department website areas for prospective students, including explicit information about the process.

For example, simply adding:

Please apply for our program by December 1, 11:59PM. Here is the link to the application and a link to the application fee waiver. If you don't qualify, but have need that is not forseen in the application fee waiver, please contact this.person@institution.edu.

You visiting a program – paid or not

You got a visit to a program, great! You should ask the graduate program director if travel and lodging is included. I have found that this commonly happens in PhD programs in Biostatistics/Statistics, but not really Master's programs. Many departments will book travel for you. Understand that a department should have funding for this, as recruitment is an essential function of a department, and without providing these costs will likely gate-keep for many lower income applicants or those without disposable income from their support. If they do not provide funding and you cannot pay for the visit, request an online visit.

Again, ask questions. Ask if this is provided if not clear on their website. Ask if you stay an extra day to see the city, is that on you or the department.

How to make this easier – For example, simply adding the following could clarify many questions before an email is ever drafted:

By Mid-January, you will get an email from us. That is sometimes February. If you don't hear back from February 1, please email this.person@institution.edu. There will be a few phone calls with faculty. We will then get back to you if you are being asked to visit by February 9. If you are asked to visit, we will pay for lodging and travel, scheduled for March 1. You will spend 1.5 days on campus, visiting faculty and students, seeing the campus, and then will have breakfast/brunch the next day followed by a tour of SOME AREA.

Many times these things exist, but behind some firewall, intranet, or are not indexed by search engines. I'm unsure if there are policy issues why sharing this would be an issue, but I think this is helpful for most students.

Professional Expenses

The example of travel/lodging being paid by the institution was something I would have never considered before. I'm not indicating academia is a “business” in the same sense as other corporations, but it is a profession, and therefore has professional expenses. Many departments have recruitment budgets, either for recruitment of faculty or students. Also, many times in your career in graduate school you'll need something for your research to go forward.

Things like, books for class, computers, computing costs, printing costs (it's real), software costs (Adobe Acrobat anyone?), etc. Some of these costs should be explicitly discussed with you. Others may be discussed only after you accept the program, but you can always ask about them.

Probably the most important message for this whole post is:

Asking for things is not only OK, but is expected

This message is for anyone in academia, including faculty. A colleague of mine encapsulated it well: ask yourself “Am I supposed to pay for this or should this be paid for by someone else”, where that someone else may be an advisor, a grant, the department, or the school. You'll never know without asking.

One of the first expenses is usually a new computer (if necessary) and books. Ask more senior students or faculty to borrow their books. You'll eventually want to buy the standard reference texts, but many books can be borrowed. For biostatistics, we are on our computers a lot of the day, and our program provided funds for a computer and books the first year. These funds were essential for first year success.

One of the most expensive costs are travel, usually for a conference. Ask if your institution pays for a conference a year, and if that includes travel and/or lodging. Does the institution book this for your or do you do it on your own? This distinction is crucial: if the institution has you book the travel and reimburses it, what is the timeline for reimbursement? You will have to float that money (interest free) until you're reimbursed, which can take a long time in some places. I didn't have a credit card coming into graduate school and realized that I needed one because I couldn't loan out these amounts out of pocket. See Get a Credit Card below for more discussion.

Similarly, does the school reimburse conference registration, which can be > $500, even for students, only after the conference has been attended? Again, ask your advisor/graduate program director what funds are available for this. Many student paper awards exist, but may not cover everything – who (if anyone) can fill the gap in costs? Also, does the institution have a per diem pay for food (which is destination-dependent) or do you submit receipts? Again, this fact can change how you do meals with colleagues from other institutions.

Although I recommend to apply to departments/work with advisors where the research interests are aligned and personality allows for good collaboration, asking about what percentage of your students in your department/lab go to 1 conference a year? What percentage go to more than 1 conference throughout their graduate school? Conferences are a huge networking tool depending on your career trajectory; you should know if departments value and support them for students (or if they do not).

How to make this easier – discuss these issues with visiting students and put a simple document (behind a firewall/intranet) for students as a reference:

Here are student award options: link, link, link, etc.

Students get $XXX first year for books/computers. You can/cannot buy tablets with this. If this is not sued by year one that does/does not roll over.

When buying supplies for research, make sure you talk to XXX to be able to use the institutions tax-exemption. We cannot reimburse tax.

Conferences are typically paid out for students by advisors. Please ask your advisor their policy.

Reimbursements can take multiple weeks to be paid out. There is a department card you can use, please ask XX before applying

Reimbursements can take multiple weeks to be paid out. Costs of travel and registration can be burdensome, please determine if a credit card.

The last one get sticky because credit card debt can be devastating if it gets out of control.

Get a credit card

I recommend getting a credit card for professional expenses. The reason is that many times you'll have known costs for the future, they will be outside of your budget, and you'll get reimbursed. That mix makes it possible to apply for many fee-free credit cards with sign up bonuses. These bonuses get applied when you pay a certain amount ($1000-$3000) in a certain time, such as 3 months. With conferences and travel (if you're booking), you'll almost surely pass that milestone, especially if you use it for other costs.

If you do not use credit cards or do not think you will use them wisely, I'd recommend only using it for travel. Credit card debt can cause financial issues long-term and I don't want to say that they are universal for people. I would only recommend a credit card if you know you can pay it off in full by the end of the month. Thus, asking the time for a reimbursement to go through should be known beforehand. Also, timing when you make the purchase, such as right after the last credit card statement, can maximize your time to get reimbursed.

The perk is that you don't need to float hundreds of dollars for months in advance (conference registration is well in advance). If you cannot withstand this financial strain, talk to someone. I had situations where I said to an advisor or administrator “I don't have the money to front that” and I was accommodated. If accommodations are not available, and you do not want to borrow money (you should not have to), then I personally would withdraw my registration and submission. This scenario is probably the worst case, but if this ever happens, you should reach out to graduate program directors and the chair because small (relative to the department) financial issues are hurting the research and science, which people take very seriously.

There are a number of sites out there that tell you about what credit card is best for you, but I personally like NerdWallet. Again, make sure there's no annual fee for your first card as you're trying to save money here!

Moving Expenses

I haven't heard any graduate students getting moving costs covered by a department, but those costs can be significant. I would highly recommend discussing this with the program director if you will be struggling to get to the new city/place. I echo this especially for post-docs: ask your advisor. Negotiation is usually possible.
These expenses are also relevant for new faculty (who may be a newly-minted PhD or post-doc). I posted a tweet about this and you can see what the responses were for reference. In many industry positions, moving costs are covered or are bundled into your signing bonus.

How to make this easier – I don't know how to rectify this blind spot without setting out departmental funds. If you're a joining faculty, negotiate it in your offer.

Dinners with Faculty

Many times you will go out to dinner with faculty, usually with seminar speakers. These dinners should be provided by the department. These are amazing networking opportunities, especially when you're applying for a job on the market, and a way to explore the food scene in your area. I didn't know this, and I remember feeling uncomfortable going to my first seminar dinner because it was expensive and I got a small plate for dinner and said I wasn't hungry. I was relieved to see a faculty pick up the tab, and didn't know until later that students were not expected to pay. There will be situations where faculty ask you out to dinner and usually etiquette will dictate they pay, notably if they choose a place out of your price range. If they don't, my vote is to swallow that bill and then politely decline future invitations.

How to make this easier – Make a one-page document that discusses faculty/professional dinners for students. Is alcohol included (maybe reinforce codes of conduct)? How is transportation organized (usually day of and you may be on your own)? Who typically pays (usually the host)?

Mental Health

Mental health is a huge issue in graduate school. A lot of people in graduate school struggle with mental health issues, some caused by graduate school and transition, and many issues are exacerbated by the stress in grad school. The truth is is that graduate school is hard. It's hard because research and trying to push on the limits of a field is hard. Stress, unfortunately is usually a byproduct of this. I'm not trying to be insensitive and I'm not saying that stress can't be lessened or reduced, but I do not believe a stress-free graduate school experience is possible. A low-stress one is maybe possible, but rare. I do believe however, that increased organization and structure could mitigate this stress.

Stress can come from a number of places. As someone who came straight from undergrad, the biggest changes for me that caused some stress were that 1) you will feel like an imposter, 2) you don't have a “boss”, and 3) there is not right answer. Now, I had a great grad school experience and never had to deal with harassment or abuse. These issues are serious and should be brought up to a faculty administrator immediately. Institutions take these issues very seriously and provide health (mental health is health) services to students. So please use any resources at your disposal.

You are (not) an imposter

I wrote a full post about imposter syndrome at https://hopstat.wordpress.com/2015/10/14/dealing-with-imposter-syndrome-in-graduate-school/, so this will be brief. Essentially, my argument is that you are likely comparing yourselves to your more-advanced peers or, even worse, faculty based on skill and accomplishments. This comparison will make you feel inadequate. Also, let me say this again, graduate school is hard. I felt I was academically-inclined for most of my career and being the one that “didn't get it” was a change. I believe experiencing that drastic change helps people becoming better educators by having more empathy and being able to put themselves in students' shoes. I had to learn how to ask about 1000 more questions than I was used to. It took a lot of confidence reinforcement for me, but I came to the conclusion that I hated not knowing what I was doing than any embarrassment by feeling like I was asking obvious questions. One of my life mottoes is “Never feel too stupid to ask a dumb question”.

I ask obvious questions all the time. It's to make sure me and the person presenting (or getting presented to) are on the same first page. If we're not, then there's no point in discussing things more. Also, as someone who teaches, many instructors yearn for engagement. Ask questions.

You don't have a boss

When I say “you don't have a boss” I mean it 2 ways: you don't have a “boss” and you don't have A boss. The first way indicates to me that you will not (usually) have someone asking where you are from 9-5 and checking your office. You may have to log hours, but maybe not. Some weeks you'll get a lot of work done, beating down your inner imposter. Now many advisors do regular check-ins, lab meetings, or are in the same physical space as you (as in a lab). But most times everyone is concerned with their own work and research. This lack of checking in is many times intentional – again, graduate school is a time to learn independent research. At this point in your life, you're an adult and will likely be treated accordingly or as an employee. The flip side of this is that you have immense control over your schedule and flexibility. This flexibility can be a curse if you're disorganized (such as me) or sometimes struggle with discipline in work (also sometimes me). I'm still learning that motivation is fleeting, organization wins championships.

Now, as you don't have one person checking in your work, but you may have several people checking in, or who are your “boss”. For example, you have an advisor, but also a chair. You may be a teaching assistant. Your teachers assign “work”. The graduate program director also oversees your overall progress. You have an advisor you work with, but maybe a number of collaborators with different projects, all asking for their results. Thus, again, graduate school is a time to learn what organizational schemes work for you and what you can do independently. I will say that there isn't any hand-holding here in many cases. I remember the first report I made for a collaborator, that was asked for on short notice, and I got it back with the words on the top “This is shit”. And it was. It was not personal. I am a negative-reinforcement learner and I took that criticism to make a better report. I also essentially said: “I agree it's shit, but you wanted it in a day, so you should probably expect shit”.

Efficient communication and asking a lot of questions is one of the things I have found to help. For example, asking “What would you prioritize over X and Y” or emailing “I have another project that is my top priority due to a deadline, could I get you X by Y date?” can go miles. Communication is only half the battle though, sometimes both people say “yes I want it now”. You either need to clear some weekend or night plans and plow through things (WHILE STILL SLEEPING) or you need to have your advisor do some time-protection work if possible. Also, I have found that people can easily say “yes I need that NOW” via email but those things change during a phone, Zoom, or face-to-face discussion of your priorities (including other priorities of their projects). Also, understand that schools of public health, nursing, medicine, and arts and sciences have very different cultures. When you cross these areas, norms and acceptable things change and you should be aware.

Work Schedule

The role of statistics and biostatistics is to quantify uncertainty, learning to live with uncertain answers is part of the learning. One big issue is that research is never “done” and there is no “right” answer. No one is going to tell you that you were 100% right. That's science. Thus, you can work until you get things done to your satisfaction. That can push projects forward, but also keep you up late into the night or laying awake in bed thinking more could be done.

So although you can usually make your own hours – give yourself a structure. Getting into a solid rhythm is helpful and can provide a solid foundation, especially when times are hard. Coming in by a certain time, leaving by a certain time, working a set number of hours, or setting a number of tasks for the day and then leaving when done are all good strategies for a work-life balance. During courses this will be hard. During independent research, this structure can make things easier immensely.

Lastly, the structure will allow you to finish the last 10-20%. When you get a new project, it's exciting and everything you find may be interesting. When you've reread the manuscript you're submitting for the 20th time and getting another round of edits, that project may physically disgust you. You're over it. You're not motivated to work on it. Without pushing from others, the structure gets you over the line, sometimes like a marathon runner, exhausted, sore, and sweaty.

Random

Dress code

I came to grad school wearing hats every day, sandals, plaid shorts, and t-shirts to class and meet with faculty. While my dress has slowly navigated back towards that, luckily ditching the plaid shorts, that's likely because I'm a faculty member and have a bit of freedom over my dress. I remember taking classes with medical students, who dress in professional garb, including ties, for the first few weeks and feeling immensely under-dressed the first class or two. I'd recommend dressing semi-professional, ties are not necessary, but shoes, pants, dress/skirt, or something of the like would probably be a good first try at first days of class or meetings with faculty. You will get the hang of what you can wear that fits with your style over time, but probably go more professional to start. Also, dress professionally when meeting individually with the chair, dean, or any administrator you don't know personally. But maybe things have changed, that was like > 10 years ago?

Graduation

You need a cap and gown (regalia), just like in undergraduate ceremonies. One key difference is that if you're getting a PhD and going into academia, you may need that regalia for future ceremonies with your students. These are not particularly cheap (https://academicregalia.herffjones.com/category/detail/categoryID/3291), and can be upwards of $1000. I recommend buy them after you are in your position for some time and likely more financially stable. And if you don't go into academia, you won't have that gown just staring at you in your closet, just asking how it's going to turn into on expensive Halloween costume.

The 3 ‘Times’ of a Project

Posted on May 15, 2020 by strictlystat

During a conversation with Sean Kross about projects, particularly data science projects, I tried to explain how things can go right and wrong with a project. I was explaining things with respect to being the data scientist on academic projects, but I think these issues are cross-cutting so figured I’d post them here.

I thought back to when projects did not go well or someone was left frustrated or angry during or at the end of the interaction. To me, the issues usually come down to the 3 “time”s of a project: time, timeline, and timeliness.

Before talking about these “time”s, I think it’s important to note that most of the frustration really comes down to miscommunication. The miscommunication or differing expectations, in my opinion usually fits into one of these time buckets.

Time

Time represents how long you estimate to do something. Particularly, this relates to how many hours a week you can work on a project, or percent effort, also called %FTE (percent full time equivalent). “Time” also means there should be a discussion of whether you have the space in your schedule to commit to something. Many instances you may not have space but you’ve been “strongly urged” to do the work.

Helpful things to do:

Do not say how many hours you have available. Tell them 80% of that or tell them how many you want to work on this. Time is a fluid – it fills the space provided.
Sometimes work out 1-2 “hypotheticals”, such as what if the data is in terrible shape. Even better, wait to give a yes or a no for accepting a project until after you get some of the data, but most people assume you are a “yes” once you get the data.
Estimate (or overestimate) how long the first set of tasks will take.
- this sets the precedent for the project.

It’s fine to deliver this a bit quicker than projected. It excites people (“That was fast!!”), but you can still lag on sending it exactly when it’s done. This time slack allows you to think if the results are right, but more importantly makes it so that when things go wrong (WTH is that data point!?) the expectation of a quick turnaround is mitigated ab it.

One of the main issues is that novelty is a cruel mistress. New and shiny things are exciting. Most projects sound like they can change the world or practice or our understanding of an area. Some can, not all do. Think of a project you’re on right now and try to answer the question that if it dropped right now and someone came back in a week and asked what time you could dedicate to that project again. Would it be the same? How much less? Think of your good and not-so-good projects, and averaging that might give you an idea on how you’ll feel about this new project in 3 months.

Timeline

I know you’re saying “3 months from now!? I get all my projects done quickly!”. That brings us to timeline. The full timeline of the project is the how long the overall goal or set of goals for a project is going to take. This discussion usually is more overarching than the time discussion for a specific task. Is the project one paper? Developing an entire suite of work? Multiple clinical trials?

But let’s focus on one analysis, that (hopefully) results in a paper.

A few questions that could be helpful are:

When do you plan on submiting the paper?
Are all the patients/subjects enrolled followed up?
Is someone (student/intern/visitor) leaving soon and this needs to be done by then?

Many times you’re not privvy to the internal workings of a group, including the fact that the data they’re about to give you may have be stopped and started 3 different time with different analyses.

Many people think once the paper gets thrown down the ravine to the wolves of review, it’s out of sight and mind and never thought of again. But then, it crawls up, bloodied and beaten, back from the land of reviews into you line of sight: REVISION!

You need to ask: When will reviews likely get back from this journal, what’s the turnaround time on those usually (2 weeks to 1 month), who will take lead?

Timeliness

Although it may be a bit of an abuse of the term, the last time is timeliness. I consider timeliness similar to responsiveness. Many projects have long or short-term explicit goals, like a paper or book, but many have implicit deliverables along the way, like short presentations. The discussion here is something like “If you send me a question about this project, how fast do you expect me to respond? Same hour? Same day?”. This discussion sets up the ability to use keywords such as URGENT or NON-URGENT. These can be abused, but at least you know what one party believes is important so that they don’t come back later and indicate you shrugged off something for another day that was pertinent.
Also, effective email writing techniques such as putting in an estimate of how long you think a task would take (could be way off – again good to know what people think) or putting a TL;DR (too long; didn’t read) synopsis at the beginning of a long-worded email.

We’re all battling the evil dragon of email back daily, trying to rescue the prize of “free time”. These little things allow people to prioritize tasks for a project and not open a 2-page email, be overwhelmed and close it, putting it off until later. A little TL;DR can make things a tad easier. Remember that people use email in very different ways; that long email may be a stream of consciousness mess or a well-itemized TODO list that people should refer back to. Now many, and I mean many, different project management solutions exist for this type of work, but 1) I can’t find anyone who agrees on which one to use, 2) some are unwilling to pay for these solutions, and 3) if you’re a data scientist you’re usually not able to force the use of these. Even if you can force using this solution, the next project may say no.

Although most don’t use “project management” tools per se, there are services that most are amenable to that can help these issues. For example, shared folders such as DropBox, OneDrive, and Box provide a one stop shop where materials should be created. Writing a paper? Use Google Docs, or for the LaTeX crowd, Overleaf. As an aside, Overleaf is a great product, that you can even use knitr in! Once they make a way to use this with Rmarkdown (I’m looking at you RStudio), I will throw down the gauntlet and try to only use this service, as it incorporates LaTeX/PDF, dynamic documents, can output DOCX, PPTX, slide decks. ANNND Back to other tools like GitHub for a shared space for code. At the end of the day, you’re trying to end the torment of an email with an attachment of Manuscript_FINAL_2020May15_JM3_REALFINAL_willThisEverEnd?.docx. Many of these tools are painless replacements for the email song and dance, have version history and track changes. Push or them.

I have had horror stories of timeliness. I have had emails that said WE NEED THIS RIGHT NOW. Long into the night, breaking my back (but probably neck because ergonomics is hard) for this project, I’d send off my finished product. Then I’d wait. And wait. And forget. Then remember and get mad that I hadn’t heard anything. Then ping the email and get nothing. Then I’d look up, 6 months had gone by, and I had realized my beard looked like Tom Hanks in Castaway, and feel the serene closure of letting a dead project die. Then a week later I’d get an email saying Thanks for that! WE NEED THIS OTHER THING RIGHT NOW. Don’t do this for your mental health, the health of your facial hair (or lack thereof), and for the stress balls that may explode otherwise.

Conclusion

Time is a fickle thing that we think we have none of (today), a world of (I’LL NEVER DIE!), or some (let’s have a quick chat). For projects, time discussions and expectations are vital to a good collaboration. Like an awkward first date, sometimes you need to get some of the cards on the table otherwise you end up down the line as a depressed John Cusack as he has played in so many movies. Talk about your 3 times of a project, be happy, and collaborations will hopefully flourish!

Some Thoughts as a Junior Faculty (at JHSPH)

Posted on February 14, 2020 by strictlystat

Being a Junior Faculty member, or considering it, leads to a lot of questions. I hope to answer a few of them here. Some of my statements will be specific to Johns Hopkins Bloomberg School of Public Health (JHSPH) and maybe specific to the Department of Biostatistics. Disclaimer: this is the only department I have been in (for PhD and faculty), so not all of these may generalize or apply. All of these opinions are my own and all of this is knowledge that was not taken in confidence.

Do you want to be a faculty?

First and foremost: do you want to do this? I'm not saying you need to be 100% sure about everything and this has been your lifelong dream and you've never thought about anything else. I'm saying, did you like writing papers and doing research, where much of the work you needed to be independently motivated? I like going down the rabbit hole and finding out where it leads me. Maybe too much at times. That means finding a bug in my code, figuring out if my hypothesis is off the mark, or whether I can tackle this problem in front of me. The independence is a large draw for me.

Overall, I believe the flexibility/independence to work on what you're passionate about is the main draw of academia. That doesn't mean you'll never have to do things you don't like or aren't passionate about. It means that you'll have the opportunity to explore your own ideas if you want, or work on interesting research that just-so-happens someone else wrote the grant for. A lot of the other perks of academia you can find in other industries. Many jobs today are allowing for flexible time schedules, conference travel, up to 20% independent research time, remote work, and other things that were unheard of 25 years ago. That's not a bad thing for academia, but just that those perks are not only for academic faculty.

That independence/flexibility comes at some cost. For one thing, you may be paid below “market rate” in industry or consulting. The main cost I see, though, is that independence can be hard sometimes, at least for me. I don't like being told what to work on all the time (see rabbit hole above), but I do like some structured work that has deliverables. Trying to reorder your priorities fluidly can be a bit draining.

One of the best analogies I've heard about being a junior faculty is that your own startup. You're the CEO of your own career. You're finding funding usually by grants compared to VCs. You're a recruiter, usually of students and other collaborators. You're your own assistant, scheduling meetings, staying on top of your email, booking your own travel (maybe), and running the meetings. And you're the team doing the research, writing the code, and delivering the product (papers/presentations/grants); you're the advertiser of the product (vlogs/blogs/presentations/papers/classes). Over time, these roles change in the percent of time you spend doing each task, but when you start out, you're it. And lastly, you're setting the agenda and vision for your career.

I'm an impostor: I don't have ideas

Many graduating students have the concern that they will not have enough ideas to generate new papers or grants. I'd stay that's generally not something you should worry about. No area of research is completely explored; but it may be an issue if you are too narrow in your scope. Almost every paper I have finished has led to at least 3 more questions. Those questions may be about that data set or method or about new data we need collected. Even if your well of ideas dries up temporarily (highly doubtful), if you have energetic collaborators/mentors, they will have enough ideas to lend you. If you're working on something someone else suggested, I recommend to 1) understand why it's important before starting, 2) making sure you have enough interest/passion in this topic, for those nights where the project has turned to your worst enemy, this passion keeps you from totally throwing it in the garbage, and 3) to have expectations discussed before doing the work with respect to the level of help those suggesting is providing, and 4) make sure authorship is at least discussed a bit before doing a whole bunch of work. If that doesn't work, go to one conference and see if you don't come back with a handful of ideas.

Soft money vs. Hard money

Soft-money generally refers to salary funding coming from grants or other awards rather than tuition or endowments. Hard money is the opposite and many times the majority of your salary will come from teaching. There are numbers such as “2-1”, “2-2”, “1-1” that refer to the number of classes you teach in a semester for hard money positions. JHSPH is generally a soft money environment. Moreover, we are in a quarter (not semester) system, so the numbers do not mean the same thing. Depending on how much you teach, however, you will be required to cover anywhere from 60-85% of your salary as a tenure-track faculty at a given time on grants or awards. If you're research track, make it 75-100%.

Research Track vs. Tenure-Track

First off, I'm an Assistant Scientist at JHSPH. This means I'm a research-track faculty member. Other institutions have different names for this track and also may have different tracks for research or clinical work, etc. In some departments, research-track faculty members are treated starkly different than tenure-track members, not just implicitly: some have different voting rights and restrictions on their work and/or mentorship. In Biostatistics at JHSPH, research/scientist- track members have similar voting rights (not completely the same) and are treated very similar to tenure-track faculty.
For example:
You can teach courses.
You can have discretionary accounts.
You can be the PI on a grant (or co-PI).
You usually get competitive offers and can use the AMStat news to guide your salary.
Skills related to research, teaching, service, and mentorship are extremely useful.

Some differences worth noting are:
You cannot be the primary research advisor to a PhD student. You can be an advisor, not the primary. You can be a primary research advisor for a Master's student.
This has pros and cons. You can't be the primary mentor, but can still work with students, and tend to not have to find funding for them as that is likely the duty of the primary advisor
You don't have a built-in sabbatical whereas it's more assumed for tenure track. You could potentially negotiate this.
You are usually hired under a project or a direct mentor.
This does not imply that you cannot work on your own work, but that initially you don't have to find all of your funding when starting.
The search, hiring process, and requirements from the dean is not exactly the same as tenure-track
Startup packages are not necessarily the same. Again, could potentially be negotiated.
You start working on day 1, compared to some “protected time” with tenure-track faculty.
You don't have a “tenure clock”. This can be a double-edged sword.
On one hand, you don't have the same timeline pressure.
On the other hand, you may need to make a concerted effort to set up meetings with your chair and/or mentor to discuss progress with respect to promotions. Our chair has yearly progress meetings with all faculty, regardless of track.
This can also lead to more variable promotion timelines. This can be mitigated by clear communication from the chair and mentor about expectations and previous precedence.
You have different expectations for promotion. These can vary wildly from institution to institution. We have similar expectations in many respects at JHSPH, but do not have as many external letters required for the promotion committee.

Mentorship

How do you choose a mentor? Well, find someone you can talk with, that knows stuff about stuff you don't know well, and will agree to make time for you. We have one formal mentor. But most likely, you'll have many mentors. One is likely to be in the department, but you'll likely find mentors that are collaborators. There are some informal setups in our department, which work overall because most people are open to having you schedule a meeting or walk in and ask some questions. If you find a department where that's not the case, try to get something more formal. Generally, someone in a working group you are in may be a good place to start. We also have an informal lunch on the calendar each day where faculty/post-docs may join, which allows you to meet other faculty that you may not directly work with. I have found this immensely helpful to get to know my fellow faculty, or get some advice from senior faculty that have dense schedules I would not feel comfortable sequestering an hour from.

Grants

One of the most asked questions for new junior faculty is about funding. These questions and discussions can be stressful, especially if you have no experience with grants. I had some experience with grants when being a Master's-level statistician, but never from the viewpoint of a PI.

How do Grants work?

Honestly, I'm still not 100% sure. NIH R Grants have different requirements with respect to page limits (https://grants.nih.gov/grants/how-to-apply-application-guide/format-and-write/page-limits.htm), but they are generally between 6 and 12 pages. That seems like a lot but it isn't. Remember, all the aims of the grant, the introduction, the figures, and novelty of the grant needs to go in there. Don't go over the limit; period.

One thing we do in JHSPH Biostatistics is the faculty share written grant proposals. Some of the grants have been funded, some have not been funded but discussed, and some were not discussed. This allows junior faculty who have never been on an NIH panel see an array of grants. I learned writing papers by reading other papers and applying a similar logic structure. I imagine grants are a similar endeavour. Disclaimer, I've never applied for an NIH grant where I was the main PI and the one who did the lion's share of the writing. But when I do, having examples to draw from can help immensely. I have submitted to internal and other grant mechanisms, but not NIH as a PI.

Study sections and that stuff

I will tackle a few simple questions now. At JHSPH, as it is a school of public health, a lot of grants come from the National Institutes of Health (NIH), at many different institutes or centers in the NIH (called ICs). Many of our faculty (not myself though) have received grants from the National Science Foundation (NSF). These tended to be more theoretical, but not always. There are also a number of internal grants at an institution. For example, I have a DELTA grant (https://provost.jhu.edu/about/digital-initiatives/delta/rfp/), which is an internal JHU grant.

Grants have letters and numbers, those letters generally refer to the type of grant it is (see https://grants.nih.gov/grants/funding/funding_program.htm). Many grants you will apply for will be R grants, which stands for research. Particularly for junior faculty, some target Career development awards (K grants, https://researchtraining.nih.gov/programs/career-development). Many faculty target R01 grants, as they are the most common. Junior faculty may be more likely to target R21 grants as well as they are for research in earlier stages.

If you are a postdoctoral fellow, you can apply for a K99/R00 (sometimes called a “kangaroo” grant, https://researchtraining.nih.gov/programs/career-development/K99-R00), which is a “Pathway to Independence Award”. These are similar to R01s in the funding amount usually. They are highly competitive, but the number of eligible applicants is smaller than the number of faculty.

For many sections, there are requests for applications (RFAs) that go out. These are proposals that call for grants that do a specific type of work, tackle a specific subject area, or require specific infrastructure resources. Make sure you're on the mission of the RFA before going forward. In order to do that, you'll want to talk to a program officer. In many respects, these people are similar to project managers in other settings. They have a portfolio of different divisions; this proposal is not their only one. Most program officers (POs) have extensive backgrounds in science, but not always specific to your field or the niche of the RFA. That can cause struggles when discussing some of the importance of your work, but that's a good thing. It's a good thing because the panel of the grant isn't going to be niche people. If the program officer doesn't see how your proposal fits with the RFA, it's highly unlikely the study section will see it either. Also, the program officers look at a number RFAs other than this specific one, which allows them to maybe identify other sections or RFAs where your grant may be more appropriate. Don't harass them, but they are your contact to ask questions and you should use them.

Funding: Direct and Indirects

Grants have direct and indirect costs. The direct costs are the monies needed to do the work, such as salary, computing, data collection/analysis, etc. This is generally how you can fund your salary, your work, students, and/or post-docs. The indirects or indirect costs relate to money in the budget that is not directly related to the work (hence indirect), such as money for office space, staff, heating/cooling, electricity, other institutional requirements/support. The “indirect rate” is negotiated by the school and the funding body (see https://www.hopkinsmedicine.org/research/resources/offices-policies/ora/handbook/appendixc.html for some rates).

Write a lot

I recommend the book “How to Write a Lot: A Practical Guide to Productive Academic Writing”. It's not expensive and it's a short book. Note, this will not teach you how to write well or publish. It's specifically on how to write a lot. As an academic, that's the majority of the job. Writing papers, writing grants, writing letters of recommendation (eventually), writing letters of support (“I'd work on this grant for sure”), writing presentations, etc. Writing a lot can help, even if the writing isn't that great to start. The book also recommends a writing accountability group (WAG). We have one with junior faculty in our department, and it has led to grants and papers that would not have existed otherwise. If you don't have one, start one. At JHU, our faculty development office helps create them and facilitate them if you don't have the ability or pull to start one on your own (https://www.hopkinsmedicine.org/fac_development/career_path/wags.html).

How do you recruit students?

First, students need to know who you are. That means attending departmental events and meetings where there are students. We have a tea time every week where students and faculty share tea. We discuss a number of things: life, pets, that week's seminars, other non-statistics human things. We also have a chili cookoff at the beginning of every academic year so that new students can meet the department. We also have a holiday and end of the year celebration. We additionally have joint faculty/student meetings to discuss departmental matters. We have had off-site retreats approximately every 1.5 years to discuss long-term matters of the department and adjustment to our vision and our mission. Our offices are all on the same floor, so they see us in the halls and know where we sit. If you are in a department where that's not the case, try to be somewhere visible some days a week (like a coffee shop in the building the students are) if possible.

A large resource for recruiting students is teaching the first or second-year courses. These students get to know you, how you work, and you get to know them. They at least know who you are if they are your teacher (hopefully). Thus, in some hard money environments, you may have discussions at interviews about “buying out” of teaching. This can be beneficial to have a discussion about this option, but not teaching may put off some departments and may limit your ability to recruit students quickly. That being said, I find teaching incredibly rewarding, but also extremely tiring. I have never taught a lot in one day and not felt like it took a lot out of me. But I've never seen those days as I “got nothing accomplished”, which has happened with strictly research days at times.

Conferences

At JHSPH, I had the tremendous opportunity to attend a lot of conferences. I like to travel and see places, network and meet people, and don't mind public speaking. All of those traits are helpful for going to conferences, but are by no means necessary. Sadly, some programs allow students to go to one conference over the course of their degree. This sometimes conveys the idea that conferences aren't useful or aren't for students. Both are patently false. You don't need to go to 5 conferences a year to be a successful faculty member. Heck, you don't need to go to any. But conferences are great places to meet people in your field, get your name out there (advertiser), and make collaborations and connections for future projects. Oh, and students definitely do go to conferences (maybe a future post-doc?). To fund the travel, hopefully there are funds in the grant for travel and conferences. If not, you may have money in a discretionary account that you had from a startup package or other means. Our department also will pay for one conference a year for all faculty. If none of those options exist, try your hardest to get a travel award from the conference. These are highly competitive, may be only open to students or post-docs, and will likely not cover all the costs incurred at the conference.

Staff and Administration

Lastly, respect the hell out of your good staff and administration members. My mother was a secretary at a university. She had stories about professors who were not the nicest to staff and that stuck with me. If you think it's hard getting a meeting with a collaborator, imagine trying to organize 5 senior faculty from different departments to get on thesis defenses or filling a speaker schedule where no one answers emails. The administrative may be the gatekeepers to senior faculty calendars or room schedules. They are the glue that keeps things together at times and the oil that keeps the machine running at others.

The administrative team also usually knows the ins and outs of grant submissions and may be the ones submitting the grant. Respect their time. Do not expect them to reply on weekends or after hours unless absolutely necessary. Our admin at JHSPH Biostatistics have made policies about requiring notification that we are submitting a grant a period of time before submission and the faculty agreed. Moreover, most staff and admin have been in the department much longer than you; they know who to talk to, the answers to your questions, and they generally will meet with you and do a Q&A. Most importantly, if you have good people that do not feel respected at their job, they will leave.

Conclusion

Try to find someone who's done well in the environment you're in. That is likely a mentor, but maybe not. Try to have people know who you are. You'll have ideas for research; you'll probably will write grants. Ask successful grant writers for copies of their work to use a starting template. You weren't always an expert on how to analyze data or write papers; it takes practice, help, and usually a template. Like most things, a lot of anxiety and frustration can be mitigated or avoided by having open, frank discussions about expectations, requirements, and getting feedback. Remember, you will likely have to ask for help if you need it, but your department wants you to succeed.

The way people use AI is ruining Reproducible Science Again

Posted on February 4, 2020 by strictlystat

The basic premise of this article is this: “Would you accept a paper that did a logistic regression, but did not publish the weights due to intellectual property?”. If you answer yes, then I do not think you will agree with some of the following statements. If so, I thank you for your reviewing service and will let the authors for which I review know who you are to send to you.

If you answered no, my question to you is, why do we accept this for artificial intelligence (AI) models? Here I'm using AI in the broad sense, including machine learning, deep learning, and neural networks. In many of these cases, the model itself is only useful as an object. For example, for a random forest, the combination of the individual trees are necessary to do prediction. It is extremely difficult (likely impossible) to reduce this to a reduced representation that would be useful in a paper to do prediction. In a regression framework, even penalized regression, the model can be shown by a series of weights or beta coefficients. For deep learning models, the number of parameters can explode given the complexity, depth, and representation of the network. When using a convolutional neural network (CNN) to segment or classify images, there can be millions of weights for different areas of an image to get a final result. These weights are impractical to print out in a PDF, text file, or supplemental material as it would take a researcher hours to reconstruct this into the network. Thus, the model weights should be released if the results are to be reproducible or useful on an external data set. I will yield that a CNN can be represented in a figure to some degree and be reproduced, but many times other processing, normalization, augmentation, or other non-shown steps are required for reproducibility.

Why is this Happening?

Frameworks such as Tensorflow, Keras, Theano, and PyTorch make deep learning more usable for all researchers. Fitting these models or predicting output (also called inference) can be done on a number of platforms, including mobile, which makes it highly attractive. Moreover, container solutions such as Docker and Singularity allow the entire system to be preserved on which the model was used. So what's the issue? The growing issue is the use of AI, especially in applications of medical data, is that people are not releasing 1) their data, 2) their code, or 3) the model weights.

Release the Data?

Let us tackle the easiest first: the data. Some data was collected without consent to be released, has protected health information (PHI) that cannot be released under protections such as HIPAA (Health Insurance Portability and Accountability Act). It is completely reasonable for researchers to not be able to release the data. Thus, this is totally valid. I will say if they can release the data, many times it is stated it is “available upon request”, but adherence to this policy is not enforced by many journals as the paper is already published (https://science.sciencemag.org/content/354/6317/1242.1, https://twitter.com/gershbrain/status/1207677210221527045) . If authors simply ignore these requests, there can be little ramifications. This may be understandable, because the downsides to the researcher of releasing data, as 1) users could find issues (may be a benefit), 2) it may require maintaining data usage agreements, or 3) many think of this as “intellectual property”, which I will address now.

Release the Code?

Many people, seeing how well AI is working in their application, think that their method could be turned into a commercial product. This may be valid, but must not be used as a shield against reproducible research. Let's turn to releasing the code. If there is no novelty in the framework they used, such as an off-the-shelf VNET, then the code should be released as nothing is “secret”. Even with slight adaptations, unless large and completely new, the code should be released. Many state that if it is off-the-shelf, why would code need to be released? The reason is that although most off-the-shelf methods are used, getting the data into the correct way before running them, including data processing and checks, need to be available. Thus, these “ancillary” scripts are actually crucial for research and reproduction. Even if the architecture is completely novel, it will likely be described in detail in the publication, and thus potentially could be released. Let's assume though that you cannot release the data or the code.

Release the Model?

Lastly, releasing the model. Again, the “model” in this setting can be a complex set of trees or weights, amongst other things. It's uncertain as to whether PHI can be recovered from these models, which is a valid concern given the data cannot be released. I assert that after many discussions that many don't release the model because it is “proprietary” or has potential “intellectual property” that can be commercializable, which I don't disagree with. What I disagree with is that many applications will not fit the requirements for a patent, as slight changes to an algorithm can classify it as a different algorithm. Using these models in a software-as-a-service (SaaS) framework could potentially be profitable, but it's doubtful this will ever happen. Moreover, there is no time limit on these commercializations. Therefore, you claim this can be commercialized, but after 5 years no progress is made, then is it really going to be commercialized or simply an impediment to reproducible and progressive science. If a model fits in the cloud but never comes down, is it a model really at all?

Any Solution?

So what's the answer? I don't know. But here's some help in reviewing.
Personally, I have been putting in boilerplate concerns with a number of medical imaging AI projects, which hopefully you may be able to use:

Overall, the other main concerns are 1) the data is not available to determine quality, 2) no software is available to test or apply this methodology to another data set, and 3) the segmentation/detection results were not directly compared to any of the methodology for segmentation previously published.
Releasing the code for processing and modeling, including the final model weights would greatly increase the impact for this paper is highly encouraged.
Are the data released anywhere? Will it be made public? Will the segmentations/classifications?

I've had authors and editors give the concerns above, which I have yielded to in some cases. I don't think these are 100% necessary for publication, but I would like to know the reasons that I cannot reproduce this analysis or use it to learn how to do better science. Until journals make clearer guidance about these policies (instead of omitting them in many cases), I guess I'll just be ice-skating uphill.

R projects may make large files

Posted on December 17, 2018 by strictlystat

Introduction

I have an “old” MacBook, it's a late 2013 MacBook Pro. I haven't upgraded because I wasn't a fan of the butterfly keyboards and the top row bar. I'm glad to hear you can now get new MacBooks with the “old” keyboards. Also I don't see large advances in the specs of the machine, but I'll stay with Mac because I love the OS and integration.

That being said, one of the downsides to having an old MacBook is that I'm struggling with space at times. I offload a lot of things to the cloud and my external drive, but I like having things locally. Also, I am a huge fan of the RStudio Packages framework. I would say the RStudio IDE is a must for using R nowadays; at least if you're a new user. RStudio Projects alleviates a lot of the problems of working outside of an IDE, such as switching directories (opening an .Rproj file opens to the root directory and here::here uses this), multiple unrelated scripts open (each has its own session/window), and has additional build tools for package development.

How the RStudio IDE integrates with Package Development

Using the RStudio Projects for Package development is great. The tools integrate with devtools, which changed the game with making a package. RStudio additionally wrapped this functionality to keyboard shortcuts and GUI clicks, along with integration to Git. WHen you are compiling and building a package, the RStudio IDE knows that you should restart the R session because all the packages (and options) you previously loaded should to be reset. Now it doesn't want you losing any saved work, so all the objects are cached, the session is restarted, and the cache is restored.

The issue

One of the downsides with this strategy is that I'm impatient. Sometimes, especially with large packages or objects, the RStudio IDE will freeze. I will wait and get annoyed and kill the process. The overall issue is that the cached data is not cleared away. The data is stored in .Rproj.user folder and can be quite big (100s of Mb) depending on what you had in memory. A lot of other files are located in there that are related to your user state (think the 10 Untitled files you just haven't saved yet, what scripts were open, what was in the Viewer). Most of the time for projects that are packages, I don't need this information so I delete the folder. Don't worry, it'll get regenerated when you open that project again.

What's the point

If you're doing some house cleaning for hard drive space, take a look at the .Rproj.user hidden folders and see how large they are. They shouldn't be much more than 1Mb, and that's even pretty big depending on how much code you have. Either way, hope it gives you some “free” space. I guess I could buy another MacBook but this one works perfectly well still.

Here's a simple script allowing you to see the overall size of the directory. There are some things I couldn't find using file.size or file.info after recursively listing the files, so I just used du.

x = list.files(
  pattern = "[.]Rproj[.]user",
  all.files = TRUE, 
  include.dirs = TRUE, 
  recursive = TRUE,
  no.. = TRUE)
dir.size = function(path) {
  res = system(paste0("du ", shQuote(path)), intern = TRUE)
  ss = strsplit(res, "\t")
  ss = sapply(ss, function(x) as.numeric(x[1]))
  sum(ss)
}
sizes = sapply(x, dir.size)

Tips for a Job Search (Academic Edition)

Posted on October 5, 2016 by strictlystat

After going to a few interviews last cycle for assistant professor positions, I figured I should write on some of the points that I found relevant and general. Some of these were tips given to me, some of them are my own. All these represent my opinions and mine alone.

This will be a least a 2-part series, so I will have an update in the coming week or so.

Full disclosure, I did not receive a tenure-track position offer, so take these with a grain of salt. Most of the materials I had sent out are located on my GitHub and website. I found that most people ask previous applicants/students of your advisor/fellow graduate students for copies of their statements, but I feel like these should be more open and editable, so I published them online for our (and other) students.

My Packets/Materials

My CV is located here and my research/teaching statements and my cover letter for academia is located here.

Step 0: Academic or Industry

One of the first things you should think about or know is whether you are looking for academic or industry tech. I chose to apply for academic positions as well as biotech/tech jobs. Not all my peers chose this, but I will tell you why I did:

Academic and Industry turnaround time is different.
1. Academic applications are due around November (get them done in October), but you should be applying around early or mid October as November is somewhat late for applications.
2. Although you apply in November for these jobs, you have time as institutions will not get back to you until January through March.
Both academic and industry offer good jobs. They have pros and cons (which I won’t list here), but they both afford a solid lifestyle and usually some variety at work.
There were a lot of positions in academia open (2016).

Let’s say you want to apply for academia. The rest of this post will be discussing an academic job search and future posts will be on industry searches and also aspects of an interview.

Now you’ve kept the door open for academia, you should know if you are looking more for a teaching gig or research gig. They have vastly different responsibilities and soft/hard money ratios. No one has defined FTE (full time effort) explicitly for me, but I have heard it range from 50-60 hours/week for an assistant professor.

From what I have seen, a “soft” money department requires 70-80% of your salary (FTE) to be covered by grants and the rest from the department (20-30%). These tend to be 12-month appointments. Many biostatistics departments, especially those in a school of public health, fall into this category.

A “hard” money department can range from 50-100% FTE covered from the school, which generally comes from teaching more courses. These are generally a 9-month appointment, where the 25% of the remaining salary (in the summer) comes from grants. Statistics departments and biostatistics departments in schools of medicine can fall into these categories.

Each type of department has their own pros and cons and each department is different.

Timeline

You should probably get applying to many academic institutions around mid-October to early November. Mid-November (although they will not likely be reviewed for quite some time) is a little late in the game. A lot of planning for visits goes on and you want a to be invited and it’s hard for a place to invite you if they’ve filled a lot of “yes” invitations in the future.

Step 1: Your Packet: Academia

Let’s assume you are applying to academia. First thing you need to do is make your packet.

Here’s what you’ll need:

Curriculum Vitae.

This is the most important part. This is your “abstract” or your first impression on a committee most times. It must be updated, formatted well, and have all the relevant information. I remember a professor noting “Someone is going to take 2-3 minutes on a CV. They will go through 20-30 in a session. You need yours to be top-notch.”

Let’s look at the items.

Name, website, email, phone number (optional but recommended). Any blog/social media (Twitter)/GitHub pages. You may want to include preferred communication (phone vs. email).
Eduation: What school and department your are attending. Include advisor(s), expected graduation date and if you have defended (people will ask), areas of research, dates attended. I have seen people include GPAs and others have not.
Previous education: Master’s/Bachelor’s – same as above but I have seen GPAs more commonly here.
Relevant (Research) Experience – not cutting grass in 11th grade. Also, usually limit to the last 10 years unless you have done a lot before that. Put pointed deliverables in the text about what you did/how you added value/why it matters to the reader.
Teaching Experience. Classes you’ve taught, TA-ed, helped create. Include the professor you worked with (people love talking about mutual acquaintences) and your role. Include short courses/tutorials/workshops you lead, created, or participated as a teacher.
Published Publications. Generally descending/ascending by year. Make sure you highlight your name in the author order – people are looking for it. You may want to number them; mine are grouped by year but not numbered. These include in-press publications without the full citation. Make sure you update the citation when the article does get published.
Submitted Publications – these are submitted or under review but not accepted yet. These give people an idea of what you are currently working on and how many projects you work on at a time, though it may be a bad measure of that.
Talks/Presentations/Posters. Include all the talks you have given, including those at your own institution. Gave a talk at a conference? Include it. Working group? Include it. Journal club? Computing club? If it’s a presentation in front of others the projector is on, include it. Some separate posters and talks, but I included them together. Include the conference name or event.
Working Groups. Maybe you work with a group or are on a training grant, but you don’t have publications from it – still add it. People many times know your group
Honors/Awards. Win an award for a comprehensive exam or paper in your department? Put it down! ENAR poster award – umm yeah, that’s awesome to put down. If someone shook your hand and you got anything from a shout out in a room, certificate, to money, put it down.
Software. Do you release software – make that clear! Put links, say what it does. This includes web (Shiny) applications or any type of apps. Was it in undergrad? So what – put it down. I have had discussions with faculty the entire time allotted just about a web application I did one random weekend not related to my main research.
Skills – everyone knows Microsoft Word. Programming languages or other spoken/written languages. Write 1-2 scripts in Python? You’re a beginner not an intermediate. Can you read someone else’s code and know what it does, you’re an intermediate at least (there are other criteria, it’s not set). Do you feel like this i your language and you can speak in it as well as your native verbal language? You’re an expert.
Academic Service – I volunteer and I put that stuff down. Any academic job requires “service” (usually of a different sort), but showing you do service outside of reviewing papers, it fits in line with many university missions. Moreover, if you start a club or run a club in your department that is full blown academic service.
Additional Experience – things that don’t fit above. Do a hackathon? Put it down here. Say the cool project you did and link to it.

There may be other sections for a CV, but those are mine, save for one. I have a “Research Interests” section at the top that says what I’m interested in/want to do research in. This may be good or bad depending on your view. It may become an reason to put you in the no pile before reading, but I think it’s useful.

Remember, academic search committees are looking for someone who can 1) do research, 2) teach, and 3) perform academic service, e.g. mentoring students, serving on thesis committees, serving on other committees (seminar committee, student recruitment, job search commmittee).

Research Statement

Depending on the position you are interviewing for, the teaching statement or research statement is likely to be the first thing read after your CV. That means you should spend a bit of time on it. Like grants, I hear the best way to write one is to get someone else’s who has been successful. Ask previous post docs, your advisor (though it may be dated), and previous students who have graduated for their statements.
I do not think I have been overwhelmingly successful in getting job offers, but I put my research and teaching statement on GitHub.

I have a few guidelines for what I would include in your research statement:

Your philosophy on research.
What you want to do in the next 5 years.
Why institution X and position Y is the place to do it.

“The Professor is in” has some good points in this and this. Check it out.

Teaching Statement

In an academic tenure-track job (and most research-track), you will teach. Teaching can afford you discretionary money in research track and is expected in tenure-track as a portion of your salary generally comes from you teaching. If you haven’t taught a course, were you a graduate assistant or design something for high school students? Put that down. You should highlight any teaching awards you have had in the past and how they have helped you or what led you to receiving them.

Overall, you should have a philosophy for teaching (at least loosely). As a biostatistician who works with a lot of colleagues who are not statisticians or biostatisticians, I (like to) think that I have the skills to bring the material “down a level” into more understanding terms. I believe there should be transparency in grading and an up-front level of expectation on each side of the classroom. Although I don’t find myself to be the most organized while teaching, I feel that it is an important fact because without it the goals of the class can become out of reach or unclear. Anecdotes may be OK, but used only when directly relevant.

It seemed to me that many research institutions “assume” you are a good teacher and they focus more on discussing your research. There are zero places that will say that consistently good teaching is not essential to their program and your success.

If you don’t have any experience teaching, you should 1) consider getting some and 2) consider again that your job is going to require you to do this. Also, although conferences and presentations are not exactly teaching, you can maybe pepper something in there about you feeling comfortable in front of a room of your (1-year-junior) peers.

Cover Letter

Not all places require a cover letter, but some do and it’s a nice touchpoint to start your packet. Some professors have told me they don’t read them (if they don’t require them) and others do.

I think it’s good to include:

Be clear which position you are applying for (most places have multiple)
Where you are graduating from.
Why are you applying there?
How are you qualified.

Letters of Reference/Recommendation

For the people you choose to write you letters of recommendation/reference (I’ll refer to as “letters”), there are no hard and fast guidelines. Except that your advisor should be one of them. This person is the person you (presumably) worked the closest with in the past 4-5 years and they know your strengths and weaknesses best. Moreover, they have likely sat on a committee to hire a new person like you and know how to present your strengths (truthfully) in the best light.

Overall, the goal is ask people early. If a number of students in your department are graduating, they may have too many requests than they can handle in a reasonable timeframe. Some may ask you which places you plan on applying to so that they can maybe make some specific remarks (or a call or 2). Others may ask to see your CV to talk a little bit more about specifics (or to remember exactly what you did).

I worked closely with an non-biostatistician collaborator and I applied to many departments where I’d be working with non-biostatistician collaborators, so I thought he was crucial for a letter. I chose a previous advisor and professors whom I did an extensive project with. You should know who you’ve done work with. If not, check your defense committee again.

Most places you will need 3-4 letters, but have about 5 people you have asked as some places will ask for 5 and some will “allow up to 5”. Make sure you have a file of their full name, email, address, phone number, position, and relation to you (aka advisor/collaborator/etc.).

Step 2: Figure out where the jobs are

For Biostatistics and Statistics, there are some great places to look for jobs online:

University of Florida
University of Washington
Purdue
AMStat
SIAM
Email from AMStat Sections – sign up for them or ask your advisor/friend/colleague to set up a forwarding filter

If you have a place in mind, check out the website for the department. They will have it advertised. Does your membership organization have a magazine? It sounds dated, but a lot of universities still advertise there.

You can also email any of your previous colleagues to ask if their department is hiring. This person should be a persons you would feel comfortable emailing for other reasons.

Check Twitter and social media. Some departments have these and use them to disseminate information. Check them out.

Step 3: Where do you want to live for the next 6 years?

The number one question you should be able to answer is “Why do you want to work here”

You should have a solid answer for that question. Period. Everything else is ancillary to that point.

In many tenure-track positions, it’s 6 years to tenure. If you’re doing well, that is. Leaving a position after 3 years is reasonable, but may not reflect well on you and you will inevitably get asked “Why?”. Moreover, it may seem as though you hadn’t thought thoroughly through on the position. While most of these may be ridiculous because people move jobs for a multitude of reasons (such as partners/family/area/weather/…life), the thoughts will exist.

So ask yourself: “Would I be comfortable/happy living in this town for the next 6 years?” Yes, great. Geographic location and a type of living (city vs. suburb vs. rural) are real things in making your decision. It’s also something that goes into offering someone a job. If the applicant seems great on paper and the interview, but seems to hate the surrounding area or “could never see themself living there”, that may be a thing that puts the decision over to a “no”. You’re not a robot and you have preferences, remember that.

After that question is answered, you more importantly need to answer: “Would I be comfortable/happy working in this place for the next 6 years?” – that’s a bit harder to know, but if there is a “No” creeping around there for some reason, that’s not a great sign. That’s not a dealbreaker for not applying, but remember one thing: interviews are draining. You don’t want to put all your eggs in one basket, but you don’t want a big basket of slightly-cracked eggs. Eggs in this metaphor are your “best self” and cracked eggs are OK, but not so great.

Step 4: Filling out an Unholy amount of forms/Sending Emails

Applications are about dotting i’s and crossingt t’s. They have some automation, but a lot of it is still very manual in its entry. You will have to write and copy and paste many documents over and over. Some will have optical character recognition (OCR) to determine information from your CV. If you have a “standard” CV, this will work. Otherwise, you’ll likely get a bunch of misformatted text you need to delete.

You will need to have accounts for each different university separately as they do not share across for information. Even though most of them use Taleo as a backend. More are using LinkedIn as a resource, which may be a good reason to update your LinkedIn to look like your CV. Many of these systems have places for you to put information about your references so remember to have that text file open with each reference’s information.

If the university you are applying to doesn’t have an automated system set up, you may have to send your packet to a search committee chair or an administrator who is listed on the posting. So you’ll email them and you’ll likely forget something, format something wrong, or forget to say what position you’re applying for, so you’ll get to answer a lot of emails.

Regardless, after the packet is signed off and in, you should (in like 3 weeks) send an email just confirming that everything is there. This is especially important if you don’t get confirmation when your letters of reference are submitted. Applications do fall through the cracks and emails do get overlooked. Do not trust any system in place and always double check your confirmation.

Conclusions

This is one post in hopefully a few on some of my (hopefully useful) insights on the process of applying and interviewing for academic and industry positions for a quant/data scientist/data analyst/research professor. Overall, there is a lot of prep you need to do (now it’s October 5). Some of it will be out of your hands (like letters of reference), which is why it’s so important to be ahead of schedule. Much of it is writing and revising, writing and revising, which you should be good at now. The one takehome message is:

Don’t sell yourself short. You just finished a long, grueling process which at times you probably thought you’d fail at. But you didn’t. Maybe not all the things you’ve done is glamorous or earth-shattering, but you did interesting things. You did things that mattered. Remember that and not make others see that and believe it.

Tips for First Year Comprehensive Exams

Posted on March 28, 2016 by strictlystat

During our program, like most others, you have to take written comprehensive exams (“comps”) at the end of your first year of coursework. For many students it's a time of stress, which can be mitigated with some long-term planning. I wanted to make some suggestions on how to go about this for our (and other) PhD students.

Start the week after spring break

Again, comps are stressful. You can be tested on anything from the material (ideally) from your first year. Professors can throw in problems that seem from left field that you did not study or prep on. How can you learn or study all the material?

The way to make comps more manageable is to have a long-term studying trajectory. We have 2 weeks after the last exam to study and prep, and that is crunch time. In my opinion, that time should be working on the topics you're struggling with, annotating books for crucial theorems (if you're allowed them in the exam), and doing a bunch of problems. Those 2 weeks is not the time to cover everything from day one. That time comes before that 2 weeks.

The week after spring break (the week before this was published) is a good time to start your timeline. That gives you about 10 weeks to study and prep. You can start from the beginning of the year to the current time, or work backward. If nothing else in the first week, make a timeline of what topics or terms you will cover over what time frame. This will reduce stress so that it breaks the test into discrete chunks of time and discrete courses.

Get Past Exams

What's the best preparation for the comprehensive exam? A comprehensive exam. This may be a bit self-evident, but I know I had the feeling of not knowing where to start. Our department sends us the previous exams from the past 5-7 years. Some are may not be equitable with respect to the difficulty or concepts covered, but I believe more questions are always better.

Vanderbilt has some great exams, as does the University of New Mexico, and Villanova. You can go to the reference textbooks (Billingsley, Chung, Casella & Berger, Probability with Martingales (Williams)) to try some problems from the chapters you covered as well.

Work from the back

My strategy is to map each exam (or 2) to a specific week. I worked on the older exams first and saved (e.g. did not look at) the ones from the previous 2 years until the 2 weeks before the test. I also would set out blocks of time (2-3 hours) to try to an entire section of an exam, simulating the conditions for that portion of the test. I think these are helpful at gauging how well your studying is going.

Make a study group

How can you study or summarize all the material? Well, it's much easier if you have a team. You can also bounce ideas off each other. Moreover, the exams you have don't have an answer key, they are just the problems. It helps having others that can 1) check your work (swapping), 2) give you their solutions if you can't work out the problem, and 3) discuss different strategies for solving the problem.

We had a group separately for each section of the exam (probability, theory, methods). This separation helps because some students are retaking only parts of the exam and can help in some areas but don't want to be working on the sections they do not have to take. It also helps segment time studying so you don't focus only on one area while leaving another area (likely the one you don't like and are not the best at) neglected.

Delegate Study Areas

We separated different topics (letting people choose first) for each of the sections for that week. Of those not chosen, the rest needs to be assigned. The people/small team that was assigned to a topic needed to make concise (2-3 page) documents outlining the most important areas. They would also do a 5 minute presentation to the group about why these are the most important areas. That is the time to ask questions (and be prepared to get some when you present).

At the end of the school year, you have an organized study document If you think your notes from the year are organized, you are likely mistaken. Even if you're highly organized (many people are not), there is usually too much superfluous details relevant to the course/homework/etc and not the material. Split it up and let others weed through different areas while you focus on those you were assigned.

Drop the weight

If someone does not deliver on their delegated task, drop them. If there was an understanding that they would get double next time, fine. But if no discussion was made, they are out of the group. That person is not holding up his/her end of the bargain, are getting help for free, while contributing nothing back. All students are busy, and incorporating that is fine, but must be done before the session and at the time of delegation. Otherwise, that non-delivery will likely become a pattern and hurt the entire group. These are your friends and classmates, and it must be clear that any non-delivery is a direct negative to the group. No excuses excuse that.

Do as many problems as possible

Do problems. Do more. And then do some more. The exam is a set of problems. Knowing the material is essential, but the more comfortable you are with doing these difficult problems in a compressed time frame, the better you are. Many tests up until now may have been collaborative, take home, and shorter. Your comprehensive exam will be a bit different, so you have to prepare yourself. We're talking about practice; it's important (sorry AI).

Conclusions

Overall, the best way to perform well on the comprehensive exams is to learn the material as thoroughly as possible. Ideally, that is done during the course. Topics are forgotten and areas are always not fully understood the first time around. Therefore, a methodical, long-term study plan should be made to tackle the year's worth of material. I think a team is the best format for discussion and delegation, but you MUST do work alone (doing the problems), as the team does not collaboratively take the test. If you follow your plan (while obviously learning the new concepts in class), then you should feel as prepared as you can be. Best of luck. I would like to leave you a quote/clip from the recent Bridge of Spies movie:
“Do you never worry?”
“Would it help?”

Trying to at least Doggie Paddle through the Sea of Data, Contributor to http://bmorebiostat.com

MIMS was introduced in the NHANES Study

Package Loading

Set up

Downloading Raw Data

Reading in the Raw data

Raw Data

Calculating MIMS

Calculating MIMS without truncation

MIMS released by NHANES

Compare Calculated MIMS to Released MIMS

Comparing to Default (no truncation): Differences!

Comparing to MIMS with Truncation: Almost Identical

Conclusions

References

Objective

The Walkability Index

Getting the Walkability Index

Using arcgis Package to Access the Walkability Index

Geocoding via the censusxy Package

Conclusion

More Metrics: Smart Location Database

Disclaimer

Why post this?

TL;DR

PMI: Mortgage Interest

Why is PMI bad?

Normal APR

Your Home Loan

Amortization

Amortization/Paydown Schedule

How to (correctly) think of PMI

Conclusion

A little about me

The goal

What is graduate school?

Network

Reaching out to Faculty/Students

Social Media

Get to know your classmates

Money

Applying to a program – look into a waiver

You visiting a program – paid or not

Professional Expenses

Get a credit card

Moving Expenses

Dinners with Faculty

Mental Health

You are (not) an imposter

You don't have a boss

Work Schedule

Random

Dress code

Graduation

Time

Timeline

Timeliness

Conclusion

Do you want to be a faculty?

I'm an impostor: I don't have ideas

Soft money vs. Hard money

Research Track vs. Tenure-Track

Mentorship

Grants

How do Grants work?

Study sections and that stuff

Funding: Direct and Indirects

Write a lot

How do you recruit students?

Conferences

Staff and Administration

Conclusion

Why is this Happening?

Release the Data?

Release the Code?

Release the Model?

Any Solution?

Introduction

How the RStudio IDE integrates with Package Development

The issue

Using `arcgis` Package to Access the Walkability Index

Geocoding via the `censusxy` Package