The “Secret Sauce” Used in Many qeML Functions

In writing an R package, it is often useful to build up some function call in string form, then “execute” the string. To give a really simple example:

> s <- '1+1'
> eval(parse(text=s))
[1] 2

Quite a lot of trouble to go to just to find that 1+1 = 2? Yes, but this trick can be extremely useful, as we’ll see here.

data(svcensus)
z <- qePCA(svcensus,'wageinc','qeKNN',pcaProp=0.5)

This says, “Apply Principal Component Analysis to the ‘svcensus’ data, with enough PCs to get 0.5 of the total variance. Then do k-Nearest Neighbor Analysis, fitting qeKNN to the PCs to predict wage income.”

So we are invoking a qeML function that the user requested, on the user’s requested data. Fine, but the requested qeML function is called using its default values, in this case k = 25, the number of neighbors. What if we want a different value of k, say 50? We can run

z <- qePCA(svcensus,'wageinc','qeKNN',pcaProp=0.5),
   opts=list(k=50))

But how does the internal code of qePCA handle this? Here is where eval comes in. Typing qePCA at the R > prompt shows us the code, two key lines of which are

cmd <- buildQEcall(qeName,"newData",yName, 
   opts=opts,holdout=holdout)
qeOut <- eval(parse(text = cmd))

So qeML:::buildQEcall builds up the full call to the user-requested function, including optional arguments, in a character string. Then we use eval and parse to execute the string. Inserting a call in qePCA to browser (not shown), we can take a look at that string:

Browse[1]> cmd
[1] "qeKNN(data = newData,yName=\"wageinc\",holdout = 1000,k=50))"

So qeML:::buildQEcall pieced together the call to the user’s requested function, qeKNN, on the user’s requested data–and with the user’s requested optional arguments. In handling the latter, it among things called names(opts), which got the argument names, ‘k’ here, in string form, exactly what we need.

Note too R’s call function, which similarly creates a function call. (Yes, call produces a call, just like function produces a function.) E.g.,

> sqrt(2)
[1] 1.414214
> sqcall <- call('sqrt',2)
> class(sqcall)
[1] "call"
> eval(sqcall)
[1] 1.414214

All of this is an example of R’s metaprogramming capabilities–code that produces code–a really cool feature of R that computer science people tend to be ignorant of. Next time you encounter a CS Pythonista who dismisses R as “not a real language,” reply “R is built on functional programming and OOP models, with powerful metaprogramming facilities…”

qeML Example: Issues of Overfitting, Dimension Reduction Etc.

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?

Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10)
         full trip_distance  PULocationID  DOLocationID     DayOfWeek
     238.4611      353.2409      253.2761      246.3186      239.2277
There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the regtools package, included in qeML. Then we could try, say,

nyct <- as.data.frame(nyct)
qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean)
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct)
[1] 10000  479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”

But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.

And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.

New Package, New Book!

Sorry I haven’t been very active on this blog lately, but now that I have more time, that will change. I’ve got myriad things to say.

To begin with, then, I’ll announce a major new R package, and my new book.

qeML package (“quick and easy machine learning”)

Featured aspects:

  • Now on CRAN, https://cran.r-project.org/package=qeML.
  • See GitHub README for intro, https://github.com/matloff/qeML.
  • Extremely simple, “one liner” user interface.
  • Ideal for teaching: very simple interface, lots of included datasets, included ML tutorials.
  • Lots for the advanced MLer too: special ML methods, advanced graphics, etc.
  • Lots of related “nuts and bolts” infrastructure functions, e.g. involving R factor conversion.

The Art of Machine Learning, N. Matloff, NSP, 2023

  • For those without prior ML background. (Experienced MLers may learn a thing or two as well.)
  • Anti-“cookbook”: Aims to empower readers to actually use ML in real, non-“sanitized” applications.
  • Really gets into the How? and Why? (and for that matter, the Why Not?).
  • Uses the qeML package.
  • See my writeup on the philosophy of the book, https://github.com/matloff/ArtOfML.

New Statistics Tutorial

I’ve recently completed fastStat, https://github.com/matloff/fastStat,a quick introduction to statistics for those who’ve had a calculus-based probability course. Many such people later need to do statistics, and this will give them quick access. It is modeled after my R tutorial, https://github.com/matloff/fasteR, a quick introduction to R.

It is not just a quick introduction, but a REAL one, a practical one. Even those who do already know statistics will find that they learn from this tutorial.

I write at the top of the tutorial,

..many people know the mechanics of statistics very well, without truly understanding an intuitive levels what those equations are really doing, and this is our focus…For example, consider estimator bias. Students in a math stat course learn the mathematical definition of bias, after which they learn that the sample mean is unbiased and that the sample variance can be adjusted to be unbiased. But that is the last they hear of the issue…..most estimators are in fact biased, and lack ‘fixes’ like that of the sample variance. Does it matter? None of that is discussed in textbooks and courses.

The tutorial begins with sampling, with a realistic view of parametric models, estimation, standard errors, statistical inference, the Bias-Variance Tradeoff, and multivariate distributions.

It then moves to a major section on prediction, using both classical parametric and machine learning methods. Emphasis is again on the Bias-Variance Tradeoff, with a view toward overfitting. A fresh view of the latter is presented.

Finally, there is an overview of data privacy methods, of major importance today.

Take a look! Comments welcome.

Just How Good Is ChatGPT in Data Science?

Many of you may have heard of ChatGPT, a dazzling new AI tool. We are hearing lots of gushing praise for the tool. Well, how well does it do in data science contexts? I tried a few queries here, and found interesting results.

I first requested, “Write an R function that returns every other element of a vector x, starting with the third.” I won’t show the code here, but suffice it to say that it worked! It did give me correct code.

Next, “In R’s built-in dataset CO2, fit a linear regression with ‘uptake’ as the outcome variable, and find the standard error of the intercept term.” Alas, it begged ignorance: “…I do not have the ability to perform calculations or access specific datasets…” Not really true; unfortunately, ChatGPT was not able to pick up the key phrase, “built-in.”

I then gave it what turned out to be an easier question, ” In the output of R’s ‘lm’ function, how do I find the standard error of the intercept term?” It returned a correct and useful answer, stating the ‘lm’ returns lots of material, “…including the standard errors of the model coefficients. The standard error of the intercept term can be accessed by calling the summary function on the model object and looking at the output…”

Next: “In R, what are the advantages of using the density() function in iieu of hist()?” The reply had some correct ingredients, but had problems:

“…the resulting plot [from density()] is smoother and more continuous, which can make it easier to see the underlying distribution of the data. This can be especially useful when the data is noisy or when the number of data points is small…The density function allows you to specify the bandwidth, or the smoothness, of the curve, which can be useful for highlighting different features of the data. In contrast, the hist function does not allow you to control the smoothness of the resulting plot.”

Putting aside the fact that the phrase “more continuous” is nonsense, there are other issues. It is not necessarily true that density() will do better on small/noisy data. And the app misses the point that hist() does have an analog of the bandwidth, namely the bin width.

Finally, I asked a question that every statistician is asked by non-stat people: “How can public opinion polls be accurate if they are based on only a small fraction of the population?” The response (not shown here) notes some standard practices such as stratified sampling, but basically begs the question.

Say for example 1200 people are sampled, as is common, and we stratify by race, That would mean we design the sample to include about 160 Black people. But anyone who wondered about the small sample overall would be even more puzzled as to why 160 African-Americans is “representative.”

So in this case, ChatGPT would give a very misleading answer to an important, common question.

And we see that machines can fail Statistics, just like college students. 🙂

Use of Differential Privacy in the US Census–All for Nothing?

The field of data privacy has long been of broad interest. In a medical database, for instance, how can administrators enable statistical analysis by medical researchers, while at the same time protecting the privacy of individual patients? Over the years, many methods have been proposed and used. I’ve done some work in the area myself.

But in 2006, an approach known as differential privacy (DP) was proposed, by a group of prominent cryptography researchers. With its catchy name and theoretical underpinnings, DP immediately attracted lots of attention. As it is more mathematical than many other statistical disclosure control methods, thus good fodder for theoretical research–it immediately led to a flurry of research papers, showing how to apply DP in various settings.

DP was also adopted by some firms in industry, notably Apple. But what really gave DP a boost was the decision by the US Census Bureau to use DP for their publicly available data, beginning with the most recent census, 2020. On the other hand, that really intensified the opposition to DP. I have my own concerns about the method.

The Bureau, though, had what it considered a compelling reason to abandon their existing privacy methods: Their extensive computer simulations showed that current methods were vulnerable to attack, in such a manner as to exactly reconstruct large portions of the “private” version of the census database. This of course must be avoided at all costs, and DP was implemented.

But now…it turns out that the Bureau’s claim of reconstructivity. was incorrect, according to a recent paper by Krishna Muralidhar, who writes,

“This study shows that there are a practically infinite number of possible reconstructions, and each reconstruction leads to assigning a different identity to the respondents in the reconstructed data. The results reported by the Census Bureau researchers are based on just one of these infinite possible reconstructions and is easily refuted by an alternate reconstruction.”

This is one of the most startling statements I’ve seen in my many years in academia. It would appear that the Bureau committed a “rush to judgment” on a massive scale, just mind boggling, and in addition–much less momentous but still very concerning–gave its imprimatur to methodology that many believe has serious flaws.

Base-R and Tidyverse Code, Side-by-Side

I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy.

As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once proficient, find that they are only equipped to work in a very narrow range of operations. As a result, we see a flurry of online questions from Tidy users asking “How do I do such-and-such,” when a base-R solution would be simple and straightforward.

I believe the examples here illustrate that base-R solutions tend to be simpler, and thus that base-R is a better vehicle for R learners. However, another use of this document would be as a tutorial for base-R users who want to learn Tidy, and vice versa.

A New Approach to Fairness in Machine Learning

During the last year or so, I’ve been quite interested in the issue of fairness in machine learning. This area is more personal for me, as it is the confluence of several interests of mine:

  • My lifelong activity in probability theory, math stat and stat methodology (in which I include ML).
  • My lifelong activism aimed at achieving social justice.
  • My extensive service as an expert witness in litigation involving discrimination (including a land mark age discrimination case, Reid v. Google).

(Further details in my bio.) I hope I will be able to make valued contributions.

My first of two papers in the Fair ML area is now on arXiv. The second should be ready in a couple of weeks.

The present paper, with my former student Wenxi Zhang, is titled, A Novel Regularization Approach to Fair ML. It’s applicable to linear models, random forests and k-NN, and could be adapted to other ML models.

Wenxi and I have a ready-to-use R package for the method, EDFfair. It uses my qeML machine learning library. Both are on GitHub for now, but will go onto CRAN in the next few weeks.

Please try the package out on your favorite fair ML datasets. Feedback, both on the method and the software, would be greatly appreciated.

Base-R Is Alive and Well

As many readers of this blog know, I strongly believe that R learners should be taught base-R, not the tidyverse. Eventually the students may settle on using a mix of the two paradigms, but at the learning stage they will benefit from the fact that base-R is simple and more powerful. I’ve written my thoughts in a detailed essay.

One of the most powerful tools in base-R is tapply(), a workhorse of base-R. I give several examples in my essay in which it is much simpler and easier to use that function instead of the tidyverse.

Yet somehow there is a disdain for tapply() among many who use and teach Tidy. To them, the function is the epitome of “what’s wrong with” base-R. The latest example of this attitude arose in Twitter a few days ago, in which two Tidy supporters were mocking tapply(), treating it as a highly niche function with no value in ordinary daily usage of R. They strongly disagreed with my “workhorse” claim, until I showed them that in the code of ggplot2, Hadley has 7 calls to tapply(),

So I did a little investigation of well-known R packages by RStudio and others. The results, which I’ve added as a new section in my essay, are excerpted below.

——————————–

All the breathless claims that Tidy is more modern and clearer, whilc base-R is old-fashioned and unclear, fly in the face of the fact that RStudio developers, and authors of other prominent R packages, tend to write in base-R, not Tidy. And all of them use some base-R instead of the corresponding Tidy constructs.

package *apply() calls mutate() calls
brms 333 0
broom 38 58
datapasta 31 0
forecast 82 0
future 71 0
ggplot2 78 0
glmnet 92 0
gt 112 87
knitr 73 0
naniar 3 44
parsnip 45 33
purrr 10 0
rmarkdown 0 0
RSQLite 14 0
tensorflow 32 0
tidymodels 8 0
tidytext 5 6
tsibble 8 19
VIM 117 19

Striking numbers to those who learned R via a tidyverse course. In particular, mutate() is one of the very first verbs one learns in a Tidy course, yet mutate() is used 0 times in most of the above packages. And even in the packages in which this function is called a lot, they also have plenty of calls to base-R *apply(), functions which Tidy is supposed to replace.

Now, why do these prominent R developers often use base-R, rather than the allegedly “modern and clearer” Tidy? Because base-R is easier.

And if it’s easier for them, it’s even further easier for R learners. In fact, an article discussed later in this essay, aggressively promoting Tidy, actually accuses students who use base-R instead of Tidy as taking the easy way out. Easier, indeed!

Design a site like this with WordPress.com
Get started