Package Updates

Several updates. All packages are on CRAN, but please use GitHub for the latest.

  1. dsld (Data Science Looks at Discrimination. New package. Tools for (a) careful investigation of possible discrimination (race, sex, age etc.) and (b) avoiding/amelioration of bias in machine learning algorithms. Comes with a free Quarto textbook on the methodology. Usefull in education, research, litigation etc. https://github.com/matloff/dsld. A draft paper is at https://arxiv.org/abs/2411.04228
  2. regtools. Varied and powerful tools for regression analysis and related data ops. Latest has updated and new functions for “fine tuning,” i.e. grid search for good values of tuning parameters/hyperparameters. Includes smoothing and Bonferonni-Dunn confidence intervals for output. https://github.com/matloff/regtools
  3. qeML (“quick and easy machine learning). Expanded usability for qePlotCurves (e.g. allows both wide and long input), and the new qeMittalGraph function. https://github.com/matloff/qeML

New qeML Plotting Function

I’ve added a new function to qeML 1.2, qeMittalGraph, based on an idea by my student Aditya Mittal. Below is an example that I think is rather compelling.

The basic idea is quite simple (and not necessarily new, just something I had not seen below): Instead of comparing several curves directly, plot their growth from their initial baseline value. So if for example X is time, then all curves start from the common point X = 0, Y = 1. Viewing the curves in this manner may make comparison more insightful.

As an example, we’ll use the currency dataset included in qeML, consisting of data on five European pre-EU currencies.

> data(currency)
> head(currency)
  Can..dollar Ger..mark Fr..franc UK.pound J..yen
1          19       580     4.763       29    602
2          18       609     4.818       44    609
3          20       618     4.806       66    613
4          46       635     4.825       79    607
5          42       631     4.796       77    611
6          45       635     4.818       74    610
curr <- cbind(1:nrow(currency),currency)
names(curr)[1] <- 'weeknum'

OK, let’s graph the raw values:

z <- reshape2::melt(curr,id.vars='weeknum')
qePlotCurves(z,1,3,2)

Image

Now with qeMittalGraph:

qeMittalGraph(curr,'weeknum','rate','country')
Image

We immediately see two clusters, frank/mark/yen and Cdollar/pound, potentially a significant insight. There may be some economic context needed, but clearly this view could be of great interest.

Note that the ‘loess’ smoothing option is the default, which has resulted in one of the curves not passing through (0,1). Setting this option to FALSE would fix this, but at a cost of having jagged curves.

Another class of use cases is graphing the effect of a hyperparameter, say graphing the effect of minimum leaf size X in random forests, over several different datasets, with Y = Mean Absolute Prediction Error.

New R Package: Data Science Looks at Discrimination (dsld)

I’m very pleased to announce a new package, dsld, available on CRAN. This is the work of eight talented undergrad students. I provided the concept and some general guidance, but this is their work.

The package is aimed at dealing with discrimination — race, gender, age — in the workplace, education, health care and so on. It consists of analytical, graphical and tabular tools for:

  • detection of discrimination
  • addressing discriminatory predictive models

We hope the package will be useful in a variety of application venues, such as:

  • inspiring motivation in statistics courses
  • discrimination litigation
  • internal HR audits
  • social science research

The usefulness of the package is further enhanced by the availability of a free companion textbook. It uses dsld examples throughout, but its role is to explain the statistical concepts, not to serve as a user manual for the package.

One can acquire a good idea of the nature of the book by reading the example on law school admissions, pp. 34-43.

Needless to say, comments are welcome!

New Paper on Data Privacy

Readers who are interested in the Data Privacy field may find our new paper (Perry, Matloff, Tendick) of interest, https://tdp.cat/issues21/tdp.a478a22.pdf….

There we introduce a new method that we call RWN, Randomization within Neighborhoods. We present a bit of supporting theory and do some empirical evaluation. We also present a qualitative comparison to other major methods, including Differential Privacy.

An R package is in development (let me know if you wish to use the R code for now).

Knowing Something vs. Knowing the Name of Something: Some Points about Causal Analysis

The famed physicist Richard Feynman once said, “I learned very early the difference between knowing the name of something and knowing something,” a lesson from his father. I think too often we in the statistics/machine learning field are guilty of “only knowing the name of something.” Well, in most cases, we may know a bit more than the name, but not as much as we should to be able to use the concept usefully.

As in most of my stat posts, I hope there is something here for everyone. Some of this will be new and thought provoking for those who are already familiar with causal analysis, but also useful as an introduction for those who don’t have such background.

The post is aimed at developing insight beyond “the name of something” in causal analysis (CA), a data science topic that is not new but has become much more prominent in recent years. As you will see, I am something of a skeptic on CA, and hope to dispel some common misunderstandings regarding it. Make no mistake–I do believe CA is a useful tool–but I hope here to demystify some of the ideas, and bring it down to Earth in terms of the extent of its value.

We will begin by dispelling a common myth about regression analysis (not solely about causal analysis, just ordinary stat, but strongly related). This is such a common myth that some readers will have trouble fully accepting it. How is that for a tantalizing lure?

A quick bit of notation:

Let m(t) = E(Y|X=t) denote the conditional mean of Y given X = t. This is the regression function of Y on X, a function of t in our notation here. X is a vector of predictors/features. If say X is height and Y is weight, then m(70) is the mean (population) weight among people of height 70. If X is height and age, m(70,28) is the mean weight among people of height 70 and age 28. Note that we are not necessarily assuming a linear model for m(t) here, though we often will.

Dispelling a very common myth:

It will be absolutely crucial here to have a firm understanding of what a regression model really does.

One often hears statements like, “We’d like to use a regression model to investigate the impact of family income on such-and-such, but our data is skewed toward the middle class or higher, so our estimated coefficients will be skewed as well.”

For instance, is a student’s performance on the SAT related to family income? We might try to investigate that by fitting a linear model in which we predict SAT by high school grades HGPA and family income FI. The coefficient of FI might give us an idea of how much, if any, income affects SAT scores.

But if in our dataset FI is skewed toward the higher incomes, many people would say such an analysis is invalid.

Actually, there is no validity problem. Let’s take a look at some simulated data.

   m <- matrix(rnorm(10000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
coef(lm(y ~ x))
# (Intercept) x
# 0.002190729 1.003438820
y <- y + 2
x <- x + 2
coef(lm(y ~ x))
# (Intercept) x
# -0.004686912 1.003438820

What happened here? We simulated a simple example in which m(t) = t, and of course the estimated coefficients were close to 1 and 0. Then we skewed the data to the right, and sure enough, got essentially the same answer.

How about this?

m <- matrix(rnorm(10000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
whichBig <- which(x > 1.5)
x <- x[whichBig]
y <- y[whichBig]
coef(lm(y ~ x))
# (Intercept) x
# 0.2525674 0.8474718
m <- matrix(rnorm(100000*2),ncol=2)
x <- m[,1]
y <- m[,1] + m[,2]
whichBig <- which(x > 1.5)
x <- x[whichBig]
y <- y[whichBig]
coef(lm(y ~ x))
# (Intercept) x
# -0.06571407 1.04364323

Here we said, let’s look at only part of our data, the part in which X > 1. Do the estimated coefficients still come out to be about 1 and 0? Turned out not to be, but because I knew they “should” be about 1 and 0, I knew it was a sample size issue. I thus tried a much larger sample, and sure enough, we did indeed get approximately 1 and 0.

To be sure, if we fit a linear model, or logistic or other parametric model, the above points will depend on the model being correct (or nearly so; no parametric model is exact in practice). To eliminate this issue, let’s look at nonparametric methods, which are always correct. Let’s try k-Nearest Neighbors (kNN).

library(qeML)
data(svcensus)
names(svcensus)
# [1] "age" "educ" "occ" "wageinc" "wkswrkd" "gender"
median(svcensus$age)
# [1] 38.21515
# a young person for our 't'
t0 <- data.frame(age=28,educ='16',occ='102',wkswrkd=52,
gender='male')
# estimated m(t), using k-Nearest Neighbors
knnOut <- qeKNN(svcensus,'wageinc',holdout=NULL)
predict(knnOut,t0)
# [,1]
# [1,] 74068
# skew the data toward older
svc1 <- svcensus
which25plus <- which(svc1$age > 25)
svc1 <- svc1[which25plus,]
knnOut1 <- qeKNN(svc1,'wageinc',holdout=NULL)
predict(knnOut1,t0)
# [,1]
# [1,] 74068

No change! The skewing had no effect.

What is going on here? Mathematically, if t is in A, then

E(Y | X = t, X in A) = E(Y | X = t)

since X = t is more restrictive than X in A.

Bottom line: The common thinking that “The distribution of X is skewed,” so this distorts our linear regression analysis,” is incorrect. True, as is always the case with linear models, one must check for linearity (see below), and pay attention to sample sizes (via a look at the size of the estimated coefficients’ standard errors), but there is nothing inherently distortionary about a skewed X distribution.

A specifically causal-analytic example of this idea:

Think of a university and its admissions policy, under which a student is admitted if either their SAT score or high school grade point average HGPA are above a certain threshold, s and h, respectively. Suppose we are interested in the individual effects of SAT and HGPA on university grade point average UGPA, and are considering using a linear model.

Causal analysis enthusiasts who are reading this would immediately recognize this situation as involving a collider. This is taken a danger signal, for a good reason in one particular sense to be described below, but on the other hand it does NOT invalidate our analysis of the individual effects of SAT and HGPA.

We are modeling E(UGPA | SAT=t1 and HGPA=t2), i.e. Y = UGPA and X = (X1,X2) = (SAT,HGPA). The structure of the admissions process is such that our data has been restricted to the set {X1 >= s or X2 >= h}. Due to the considerations discussed above, we see that this restriction does NOT change our estimate of m(t), i.e. our estimated regression coefficients in a linear model. Our ordinary linear model is still valid.

On the other hand, the situation is different if the object of our interest involves the distribution of X itself. For example, we may be interested in the correlation between X1 and X2 here. This should be substantially positive in the general population, but probably will be weaker in the subpopulation of those admitted to the university

Comparing classical and causal analyses:

Do coaching schools increase a student’s SAT test score? Much has been written on this issue pro and con, but here we will be interested less in the answer than the types of analysis that could lead to an answer.

Classical approach: regression analysis

Let’s again consider a linear model, for simplicity, using HGPA as our explanatory variable, in addition to C, an indicator variable for whether the student has had coaching. One concern is that only the more financially well-off families can afford coaching, so let’s use family income FI as another predictor. (We could of course consider some other factors as well, but let’s keep things simple.)

Our linear model then would be

E(S | HGPA, C, FI) = b0 + b1 HGPA + b2 C + b3 FI

where C is 1 or 0, depending on whether the student had coaching.

This would be the classical approach to assessing the effect of coaching. The quantity b2 represents the average extra points on the SAT, for students of fixed grades and family income. Comparing two students who are identical in grades and income, but one having the coaching and the other not, b2 is the expected difference in SAT. It would be the centerpiece of our analysis of the effects of coaching.

This is the traditional notion of ceteris paribus, “everything else being equal.” We might consider adding interaction terms (see below), but traditionally this would be the standard approach.

This model also allows one to investigate whether family income is a factor in SAT scores (directly, rather than though C). Studies on this have gone both ways, and my own analysis has suggested that the effect is small, but in any case, the above model could be used for such analysis.

What about interaction terms? Studies indicate that the higher-income kids are more likely to have been reading for pleasure since an early age. They thus have good reading comprehension and vocabulary, so the SAT coaching is not very useful to them. But the lower-income kids, perhaps reading less for pleasure, might benefit substantially from the coaching.

We would thus add an interaction term:

E(S | HGPA, C, FI) = b0 + b1 HGPA + b2 C + b3 FI + b4 C x FI

The above scenario would be reflected in an estimated value for b4 that is < 0 (assuming C > 0).

Note that now there is no single effect for the coaching, but rather one effect for each income level. A similar issue would arise if we forego a linear model, and turn to some nonparametric regression method, say kNN. This would involve estimating m(t) for various levels of t = (HGPA,C,FI) deemed of interest, and then comparing those estimates.

A causal approach: covariate matching

Note first, once again, that in the above linear model, it is NOT a problem if, say, our sample data skews toward higher incomes. As with any regression problem, we do have to consider sample size and assess linearity, but there is nothing inherently problematic about the skewed X distribution.

What saves us here? The answer is that we are taking covariates –HGPA and FI–into account. Raw comparison of SATs of the coached and noncoached groups could be very misleading in this setting, because the coached group might be richer and thus have, e.g., more experience reading for pleasure. It then would misleadingly look like the coaching led to higher SATs, rather than being due to differential backgrounds between the richer and poorer students.

Well, we can account for differences between covariates for the coached and noncoached groups in another way, by reconfiguring the data points in the two groups in such a way that the data is now balanced. Many ways have been devised for this, and a number of R packages have been developed along these lines.

For example, consider the function Matching::Match(y,treat,x). In our SAT example, y would be SAT, treat would be C, and x would be (HGPA,FI). The function forms pairs of students from the coached and noncoached groups, where in each pair the coached and noncoached students have the same or similar values of x. Those students who are selected (some may not be chosen) now give us two balanced groups, one coached and the other noncoached, who are similar in all respects other than coaching. For instance, the mean HGPA will be similar in the coached and noncoached groups, and so on. (After forming the groups, the pairing no longer is used.) We can now evaluate coaching fairly.

One drawback to covariate matching is that the larger the dimension of the X vector, the harder it is to do accurate covariate matching. An alternative is to match on just one summary quantity, the propensity score, which is the probability that the treatment will be “assigned.” In our SAT coaching example, it could be the probability that the student had coaching, given his/her family income. We can compute this using a logistic model, say, or a nonparametric technique such as kNN.

We then match on the propensity score alone; the third argument in Match would be that score, rather than the covariate vector X. The thinking is that by matching people with score 0.62, say, one of them who had coaching and one who didn’t, we’ve accounted for covariates.

A key point: the regression and matching approaches are fundamentally incomparable

Once again, the regression approach analyzes E(Y | X), a conditional mean, unaffected by the distribution of X. By contrast, covariate matching estimates E(Y), an unconditional mean, which is very strongly related to the distribution of X. If in the “family wealth” example above, the distribution of X is skewed toward the more well-off families, our analysis of the effect of coaching will be fair, but the magnitude of the effect will tell us how well coaching works among well-off families. Note that the same concern arises if the original distribution of X was not skewed but the data points that survive the selection process are skewed.

Directed Acyclic Graphs (DAGs):

Much of the surge in the use of causal analysis in recent years can be attributed to the attractiveness of DAGs; we all like pictures. Yet DAGs are a prime example of the importance of “knowing about something rather than just knowing the name of something.”

There are some extremely important points to keep in mind about DAGs. First, the word causal should not be taken literally; there is nothing in the framework about actual causation.

Second, DAGs cannot be generated by fitting from data; techniques for causal discovery require very strong assumptions. One can assess whether a DAG fits the data well, but not much more.

Third, DAGs are not unique. For any given set of probabilities, many different DAGs might fit those numbers equally well. Carnegie Mellon University statistics professor Cosma Shalizi has commented that it is highly misleading, indeed unethical, for a researcher to present a DAG without presenting other very different DAGs that are nevertheless mathematically equivalent.

Is causal analysis worth the extra trouble? Why not just use a linear regression model?

Linear models are very attractive here, easy to run and understand. One can add meaningful interaction terms, and if linearity is in question, we can try adding polynomial terms.

Another causal technique, structural equation models (SEMs), runs a series of regression models, in which not only is Y is expressed in the components of X, but also those components are expressed in terms of each other. Here again, we need not worry about skewing of distributions, providing we verify our linearity assumptions.

On the other hand, SEMs require that we have the DAG relating all the variables, and this cannot be determined from data; it essentially must be set from domain expertise in the field of study.

Why m(t)?

The above discussion regarding the irrelevance of whether X has a skewed distribution in some direction assumes that the quantity of interest is m(t) = E(Y | X=t), the regression function. Let’s take a closer look at this.

From a simple application of calculus, we know that the function f that minimizes the quantity E[(Y – f(X))2 | X = t] is m. So basically any prediction method that minimizes a sum of squares will take the form of a conditional mean. That includes linear models, of course, and also neural networks.

Note that the predicted value for a new case having X = t is the regression function value, m(t). Thus m is optimal in the sense of minimizing Mean Squared Prediction Error (at the population level).

Some predictive methods directly estimate m(t), such as k-NN and random forests.

What about predicting a dichotomous Y? Code Y to be 1 or 0. Then m(t) reduces to P(Y = 1 | X=t), and one can predict 1 or 0 depending on whether this quantity is greater or less than 0.5. (Or use another threshold for unequal error costs.) It can be shown that in this way, m(t) minimizes Overall Misclassification Error.

What about L1 models, say quantile regression? Here we minimize

E(|Y – f(X)| | X = t),

which has the solution f(t) = conditional median of Y given X = t. Then the same arguments goes through as before.

E[|Y – f(X)| | X = t, X in A] = E[|Y – f(X)| | X = t]

for the same reasons as before.

The same analysis works for any nonnegative loss function.

Assessing linearity:

One way to visually assess linearity (I frown on hypothesis testing in general, especially Goodness of Fit testing) is to fit linear and parametric models, then plot the latter against the former. Here’s an example:

library(qeML)
data(mlb1)
mlb <- mlb1[,-1] # height, weight, age
# predict weight from height, age, first using kNN then # a linear model
knnout <- qeKNN(mlb,'Weight',holdout=NULL)
lmout <- qeLin(mlb,'Weight',holdout=NULL)
lmpreds <- predict(lmout,mlb[,-2])
plot(knnout$regests,lmpreds)
Image

Looks pretty good overall, only a handful of outliers.

Torch for R Now in the qeML Package

I’ve added a new function, qeNeuralTorch, to the qeML package, as an alternative to the package’s qeNeural. It is experimental as this point, but usable and I urge everyone to try it out. In this post, I will (a) state why I felt it desirable to add such a function, (b) show a couple of examples, (c) explain how the function works, thereby giving you an introduction to Torch, and finally (d) explain what at this point appears to be a fundamental problem with Torch.

If you need an introduction to neural networks, see the ML overview vignette in qeML.

Why an alternative to qeNeural?

The qeNeural function is a wrapper for regtools::krsFit, which is based on the R keras package. In turn, that calls functions in the tensorflow package, which calls the original Python version of the same name.

Though there has been much progress in interfacing R to Python, the interface is difficult to set up, especially in terms of the environment variables. That makes the “Torch for R” package, torch, highly attractive, as it contains no Python. Torch itself was developed (in Python) as a clearer alternative to Tensorflow, and is now probably more popular than the latter. So, it’s nice that a version for R has been produced, especially one that is Python-free.

Another advantage is easy access to GPUs, especially on Mac GPUs, which are generally problematic. There is a companion R package, luz, that can be used to speed up Torch computation.

Examples

Here we use the svcensus data from qeML.

lyrsReg <- list(
list('linear',0,100),
list('relu'),
list('linear',100,100),
list('relu'),
list('linear',100,1))

z <- qeNeuralTorch(svcensus,'wageinc',layers=lyrsReg,
learnRate=0.05)

The call format is standard qeML. We need to specify the network structure via its layers, hence an argument of that name. (The qeNeural function does this a little differently.) We see a linear, i.e. fully-connected layer of “0” inputs and 100 outputs, then a reLU activation function, then another hidden layer and activation, and finally another linear layer with 1 output, our predicted values. The “0” is filled in at runtime with the number of features.

The above is for regression problems; here is code for classification problems:

lyrsClass <- list( 
list('linear',0,100),
list('relu'),
list('linear',100,100),
list('relu'),
list('linear',100,1),
list('sigmoid'))

z <- qeNeuralTorch(svcensus,'gender',yesYVal='male',
layers=lyrsClass,learnRate=0.003)

That last entry in the layers formation squeezes the result to the interval (0,1), to make probabilities. Note that that list-of-lists code is just defining the network, not creating it.

How is Torch used in qeNeuralTorch?

One still must work with tensors, which are essentially multidimensional arrays. So there are lines in the function like

xT <- torch_tensor(x)

where the matrix of futures is converted to a tensor. The major work, though, is done in first setting up the network, and then running the data through it in an iterative manner. Note that one does not need to be aware of any of this to use qeNeuralTorch, but as usual, understanding the innards of a function sharpens one’s ability to use it well.

The network formation code within qeNeuralTorch works on the above list of lists to form nnSeqArgs. which again simply defines the network for Torch. Then the network is created:

model <- do.call(nn_sequential,nnSeqArgs)

A side note is that torch::nn_sequential has the formal argument ‘…’. That is no problem for the ordinary author of Torch code; if they have, say, 4 arguments in their particular application, he/she just states them as arguments in a call to nn_sequential. But qeNeuralTorch must allow for a variable number of network layers, hence the use of R’s do.call.

Here is the code that runs the network:

   for (i in 1:nEpochs) {
preds <- model(xT)
loss <- nnf_mse_loss(preds,yT,reduction = "sum")
optimizer$zero_grad()
loss$backward()
optimizer$step()
}

First, our feature data xT is run through the network, which has been initialized with random weights. That produces predictions, preds. The current L2 loss is then computed, then the gradient of the loss determined and the weights updated. The goes through the number of iterations specified by the user, nEpochs.

Our implementation is rather primitive; we use that same loss L2 function even for the classification case (actually this can be justified), and, for now, limited to the 2-class case).

Torch for R uses R6 class structure, rather different from the more common S3 and S4. An example above is the line

loss$backward()

Here loss is an object, mainly containing the current loss value but also containing a function backward. The latter is called on the former, as those who’ve used, e.g., Python or Java will recognize.

Again, you need not know this in order to use qeNeuralTorch.

Performance

The package seems to be very sensitive to the learning rate.

Also, it turns out, at least in my implementation, that Torch’s accuracy is generally much weaker than those of other qeML functions in regression cases, but similar in classification cases.

I surmised that this was due to Torch producing overly-large weights, and investigated by comparing the L2 norms of its prediction with those of other functions. Sure enough, Torch was producing larger predictions.

Torch has a parameter to control this via regularization, weighted_decay. However, this did not appear to help.

My current guess–maybe some who reads this will have suggestions–is that since ML applications tend more toward the classification case, the problem of large weights never really arose. Having predictions that are too extreme may not hurt in classification problems, as this simply brings them closer to 0 or 1 when run through a sigmoid function or similar. Since qeNeuralTorch rounds the sigmoid output to 0 or 1 to produce class predictions, it all may work out well in the end.

Note that this also means one should be cautious if one takes the unrounded output of the network to be the actual probabilities.

Quantile Regression with Random Forests

In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.

My goals in this post, as in the previous one, are to introduce the capabilities of qeML and to point out some general ML issues. The key example of the latter here is the fact that leaves in an RF tree are very similar to neighborhoods in k-NN, which implies that in principle one should be able to do QR in an RFs context, just as we did last time with k-NN.

However, as the saying goes, “Easier said than done.” What was key in the kNN case last time was that the qeKNN function argument smoothingFtn gives the user access to the neighborhoods, in that it allows the user to specify a function that performs a user-requested operation in each neighborhood; smoothingFtn offers a local-linear option, for instance, and in the last post I showed how one could achieve QR via a user-written function.

The situation for RFs is not so simple. The problem is that typical RF software does not provide “hooks” directly analogous to smoothingFtn. Some implementations do provide some useful hooks that could play a role, such as randomForests::getTree, but putting them together for the desired result may not be easy, especially given ambiguities in the documentation.

Fortunately, the grf package includes a QR app. The qeML function qeRFgrf originally wrapped the “ordinary” and local linear options in grf, and I’ve now added QR in v.1.2.

The name ‘grf’ stands for “Generalized Random Forests,” with the main generalizing being similar to smoothingFtn, i.e. to allow functions other than the mean to be applied to the data in the leaves. A second generalization aspect is to tailor the node-splitting process to the type of smoothing done in the leaves.

In particular, grf includes the function quantile_forest, providing just what our reader inquired about. One specifies the quantiles of interest in an argument quantiles, and later calls the paired predict function to obtain the estimated quantiles of “Y” at requested values of the “X” variables.

The qeML package has an interface to grf, as the function qeRFgrf. To access the QR option (qeML v.1.2), set the qeRFgrf argument quantls to a nonnull value. Here is an example using the North American major league baseball players data (included in qeML with the permission of the UCLA Stat Dept.). We find the 20th, 40th, 60th and 80th percentiles of weight, for each height.

library(qeML)
data(mlb1)

z <-qeRFgrf(mlb1[,2:3],'Weight',quantls=c(0.2,0.4,0.6,0.8),holdout=NULL)
w <- predict(z,mlb1[,2,drop=F])
df1 <- data.frame(x=mlb1[,2,drop=F],y=w[,1],z='0.20')
df2 <- data.frame(x=mlb1[,2,drop=F],y=w[,2],z='0.40')
df3 <- data.frame(x=mlb1[,2,drop=F],y=w[,3],z='0.60')
df4 <- data.frame(x=mlb1[,2,drop=F],y=w[,4],z='0.80')
dfall <- rbind(df1,df2,df3,df4)
qeML:::qePlotCurves(dfall,xlab='ht',ylab='wt')
Image

The convenience function qePlotCurves is essentially the code I used in the previous post, now added to v.1.2.

I highly recommend the grf package. My attention was immediately drawn to it when it first came out, as I was pleased to see that I could now do analysis in RFs using non-mean smoothing, as I had been doing with qeKNN. It was written by some top researchers, who also developed the supporting theory.

qeML Example: Nonparametric Quantile Regression

In this post, I will first introduce the concept of quantile regression (QR), a powerful technique that is rarely taught in stat courses. I’ll give an example from the quantreg package, and then will show how qeML can be used to do model-free QR estimation. Along the way, I will also illustrate the use of closures in R.

Notation: We are predicting a scalar Y (including the case of dummy/one-hot variables) from a feature vector X.

In its simplest form, QR estimates the conditional median of Y given X, as opposed to the usual conditional mean, using a linear model. As we all know, the median is less affected by outliers than is the mean, so QR is giving us outlier robustness. As a bonus, we dispense with the homoskedasticity assumption, i.e. constant Var(Y|X).

But it’s more than that. We can model any conditional quantile, e.g. estimate the 80th percentile weight for each human height. Quantile analysis has a variety of applications.

One can conduct QR in R with the quantreg package, written by Prof. Roger Koenker, one of the major names in the QR field. Here is an example, using the qeML dataset mlb:

> data(mlb)
> library(quantreg)
> z <- rq(Weight ~ Height,data=mlb,tau=0.80)
> summary(z)

Call: rq(formula = Weight ~ Height, tau = 0.8, data = mlb)

tau: [1] 0.8

Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -201.66667 17.58797 -11.46617 0.00000
Height 5.66667 0.23856 23.75376 0.00000

As you can see, the call form here is like that of the R linear model function lm, and we could have had multiple predictors, e.g. age in addition to height.

But what if we don’t believe a linear model is appropriate? Of course, as usual we may consider adding polynomial terms, and there is also a package quantreg.nonpar. But we obtain model-free estimates easily using qeKNN in qeML.

Standard k-Nearest Neighbors estimation is simple. Say to predict the weight of someone 70 inches tall and 28 years ago, we find the k closest data points in our training data to the vector (70,28). We then compute the mean weight among those k people, and it’s then our predicted weight for the new person who has known height and age but unknown weight.

But qeKNN offers the user more flexibility, via an argument smoothingFtn. Instead of computing mean Y among the neighbors, we can specify the median, or even specify that a small linear model be fit to the neighboring data. The latter may be useful if the new person to be predicted is either very short or very tall, as things tend to be biased near the edges of a dataset. If the new person is 77 inches tall, most or all people in our neighboring data will be shorter than this, thus lighter, so our prediction based on the mean will be biased downward.

But we can also specify our own smoothingFtn, perfect for QR. We simply define a function that gives us the desired Y quantile among the neighbors.

The call form is

smoothingFtn(nearIdxs,x,y,predpt)

Here x and y are our X and Y training data, predpt is the new X value at which we wish to predict (redundant in most cases), and nearIdxs are the indices in x and y of the nearest neighbors to predpt. Note that at the time kNN calls smoothingFtn, the indices have already been computed.

Our code is then

sftn <- function(nearIdxs,x,y,predpt)
{
nearYs <- y[nearIdxs]
quantile(nearYs,0.80)
}

u <- mlb[c('Height','Age','Weight')]
set.seed(9999) # qeML ftns do random holdout
z <- qeKNN(u,'Weight',smoothingFtn=sftn)
predict(z,c(70,28)) # prints 200

It would be nice, though, to run this for a general quantile level q, rather than the special case 0.80. But we can’t do that directly, because the smoothingFtn argument to qeKNN must be a function object, no provision there for an argument to smoothingFtn. But we can accomplish what we want via R closures.

makeSmFtn <- function(q) function(newIdxs,x,y,predpt) quantile(y[newIdxs],q)

To understand this, one must first know more about the R reserved word function. Consider this simple example:

f <- function(x) x^2

Here we are saying, “R, please create a function for me. Its formal argument will be named x, and it will compute and return the square of that quantity. After you create that function–an object, just like other R entities–assign it to f.” In other words, function creates functions. As I like to tell my students,

The function of the function named function is to create functions!

Now, going back to makeQFtn above, it creates a function object (the call to quantile), and returns that object, just as with f above, but the key point is that here the value of q will be “baked in” to that object.

So our call to qeKNN for general q would be

z <- qeKNN(u,’Weight’,smoothingFtn=makeSmFtn(q))

A Comparison of Several qeML Predictive Methods

Is machine learning overrated, with traditional methods being underrated these days? Yes, ML has had some celebrated successes, but these have come after huge amounts of effort, and it’s possible that similar effort with traditional methods may have produced similar results.

A related issue concerns the type of data. Hard core MLers tend to divide applications into tabular and nontabular data. The former consists of the classical observations in rows, variables in columns format, while the latter means image processing, NLP and the like. The MLers’ prescription: use XGBoost for tabular data, deep learning (in some form) for the nontabular apps.

In this post, I’ll compare the performance of four predictive methods on a year 2000 census dataset svcensus, included with the qeML package. The comparison is just for illustration, not comprehensive by any means. For each method, I vary only one hyperparameter: k for qeKNN; minimum leaf size for qeRFranger; polynomial degree for qePolyLin; and max tree depth for qeXGBoost.

I used only the first 5000 rows of the dataset. There were 5 predictors, becoming 10 after categoricals were converted (internally) to dummies. The variable being predicted is wage income. For each hyperparameter value, 100 runs were done. (Recall: by default, qeML predictive functions form a random holdout set.)

15 hyperparameter values were tried for each method. In the case of qeKNN and qeRFranger, these were seq(5,75,5). For the other two methods, the values were 1:15.

Here are the results:

Image

The winner here turned out to be good ol’ polynomial regression. Obviously overfitting occurs, but somewhat surprisingly, only after degree 6 or so. Random forests seems to have leveled off at 60 or so. All might do better by tweaking other default hyperparameters, especially KNN.

Of course, all the methods here could be considered traditional statistical methods, as all had significant statistician contribution. But only polynomial regression is truly traditional, and it’s interesting to see that it prevailed in this case.

From now on, I plan to make code for my blog posts available in my qeML repo, https://github.com/matloff/qeML in the inst/blogs subdirectory. So, readers can conveniently try their own experiments.

Design a site like this with WordPress.com
Get started