Iain Carmichael

Word embedding tutorial in python

2017-10-09T00:00:00+00:00

I recently gave a tutorial on getting started with word embeddings in Python to a digital humanities group. The tutorial covers material from 15 (vector semantics) and 16 (semantics with dense vectors) from Speech and Language Processing. The data set is ~30,000 Supreme Court opinions provided by CourtListener. The repository comes with a small data set loaded and instructions for getting more data from CourtListener.

You can find the tutorial/instructions/additional resources at: https://github.com/idc9/word_embed_tutorial

Data Science and the Undergraduate Curriculum

2017-08-12T00:00:00+00:00

I recently gave a talk to my department about my experiences and take aways from developing/teaching a new course: STOR 390: Introduction to Data Science Course. The talk is about both the new course and more generally some thoughts about the undergraduate statistics curriculum.

You can find the slides here: https://docs.google.com/presentation/d/1XUaNIybiPD6OpTs-ou5baSUQYiUOJuafXQsvEwrChjc/edit.

There will hopefully be a follow up article/blog post some day, but I think the slides convey the main messages. Many of the points are based existing literature which is linked to at the end along with other courses I found helpful to develop the class.

Releasing software packages

2017-07-15T00:00:00+00:00

I recently released my first R and Python packages. This post contains some thoughts and advice about releasing software packages – particularly for other graduate students.

The question of “should you release a package?” is highly context dependent (e.g. if you are a probabilist the answer is probably no). There are a number of trade-offs to consider. For example, academia does not seem to value software very much. More importantly, there is a large time cost cost to develop software packages that could have been spent writing papers, this includes:

Coding the basic functionality
Turning your code into a package someone else can download and use
Documentation for the code
Providing data analysis examples
Maintaining and updating the package
Responding to user feedback
Surveying the existing literature to make sure your package provides new functionality

I think academia is starting to value software more than it used to¹. I would argue that, in many cases, releasing code is as important as writing a paper. Some of the benefits to you that come from releasing a software package include:

Save future you time. Better code new = less headache in the future.
Fame/glory/prestige for people using your work.
Help other people solve their problems. If part of your rational for doing research/academia is helping to solve problems then good code might be as (or more) impactful as a paper.
Software skills are highly valued in industry.
You might learn new things out of necessity (e.g. computational linear algebra) and/or better understand your own research.

Resources

Programming is typically a small part of the statistics curriculum (and most other scientific disciplines); we don’t think of ourselves as software engineers even though many of us spend a lot of time writing code. Luckily there are many quality, open-source resources that show you how to write better code and release software. Without these resources (particularly the R Packages book) it would have taken me 1-2 orders of magnitude more time to build these packages².

These resources are helpful for creating R/Python packages:

Hadley Wickham’s R Packages book and devtools were incredibly helpful. If you plan on building an R package read this book.
This tutorial on a minimal Python package and coockiecutter give helpful templates and instructions to create a Python package.
Tim Hopper’s talk on releasing code gives a good high level overview of how/why to release code.
Hosting the package on github gives you a lot of functionality for free (e.g. users can submit feedback via github issues)

These resources helped me become a better programmer:

Good Enough Practices in Scientific Computing
Some principles of good programming
Jeff Leek’s book on How to be a Modern Scientist and an uncountable number of simplystatistics posts.
Unit testing made the packages a lot less buggy (testthat for R and unittest for Python).
Reading/borrowing from existing, quality bases including. I found the following helpful:
- R: ggplot2, tidytext.
- Python: sklearn, lightning.

For example, some statistics postdoc positions require (or highly encourage) applicants to have released an open source package. ↩
The time cost to build a package is obviously very context dependent (e.g. your experience, the complexity of the algorithm, etc). To give you one data point; these packages took me 1-2 weeks each and I have about 2 years of coding experience. ↩

R and Python packages for AJIVE

2017-07-15T00:00:00+00:00

I just released R and Python implementations of Angle based Joint and Individual Variation Explained (AJIVE). I recently started working on AJIVE for my thesis and releasing an open source package is one of my goals for my PhD. For the code see:

ajive (R)
jive (Python)

Both packages are currently a little rough (need more examples, more testing, cleaner code, fewer typos, etc), but they will improve with time and as I/other people use them. If you use one of these packages I encourage you to send me critical feedback. Right now the biggest areas in need of improvement are:

More data analysis examples showing how AJIVE can be used.
More testing to squash bugs I haven’t found.
Better documentation – both of the code and explaining the AJIVE procedure.

It feels wrong putting something out there that is not yet polished, but I figured it’s better to get something that works out there and improve it than to spend the rest of the summer perfecting it instead of writing my thesis (i.e. don’t let the perfect be the enemy of the good).

I learned a lot from building these packages. This next post has some thoughts and advice about releasing software packages – particularly for other graduate students. I will (hopefully soon) put up a few posts discussing how AJIVE works and showing some data analysis examples.

Communication in Data Science

2017-06-27T00:00:00+00:00

I posted the notes for a lecture on communication in data science that might be interesting/helpful. This lecture provides four general principles¹ for communication:

adapt to your audience
maximize the signal to noise ratio
use effective redundancy
consider the trade-offs

and discusses some examples of how they apply to various examples in data science (visualization, code structure and literate programming).

Communications skills are important at all levels of technical pursuits from releasing a software package to conducting research, however, they are under emphasized in STEM education. These notes are from an undergraduate Introduction to Data Science course I taught last semester and are my best attempt to incorporate communication into the curriculum. Any feedback that might improve this lecture (or help me become a better communicator) is welcome!

The first three of these are from Trees, Maps and Theorems. ↩

Some basic optimization algorithms in Python

2017-05-17T00:00:00+00:00

After taking a convex optimization class this past semester I implemented a few basic algorithms for unconstrained optimization (e.g. Nesterov’s accelerated gradient descent) in Python in this repo: https://github.com/idc9/optimization_algos.

The purpose of this repo is for me to learn and to have bare bones implementations of these algorithms sitting around. I tried to make the code modular and simple as possible so that you (or a future me) can modify it for other purposes (e.g. add bells and whistles, implement other algorithms, etc). While off the shelf solvers such as sklean or cvxopt are preferable for many applications there are times when you want full control over the solver.

Right now the repo focuses on first order methods (GD, SGD, accelerated GD, etc) for empirical risk minimization problems. For some useful introductory references see:

An overview of gradient descent optimization algorithms by Sebastian Ruder (good high level overview)
Optimization Methods for Large-Scale Machine Learning by Léon Bottou, Frank E. Curtis, and Jorge Nocedal
Convex Optimization by Boyd and Vandenberghe (or see video lectures)

A few more interesting references:

My favorite resources

2017-05-17T00:00:00+00:00

One of the most underrated parts of modern stats/machine learning is that many of the best resources are available online for free from textbooks to MOOCs to code snippets. Like many people in the area I’ve used these resources to teach myself a lot of what I know. Here is a google doc with some of my favorite resources:

Iain’s favorite stat/ML resources

Most of these are available for free online (ok you can actually find all of them if you look hard enough). Here are a few worth highlighting:

R for Data Science is the bible for R
JHU data science specialization on Coursera (free if you audit it)
Elements of Statistical Learning
Chris Albon’s website has lot’s of helpful code snippets (particularly for pandas)
Metacademy has road-maps to learn many concepts in statistics, math, cs, etc
Deep Learning is an excellent overview of deep learning (and useful perspective on ML in general)

UNC team wins $20,000 and a chance at a job from datathon

2017-04-28T00:00:00+00:00

Old posts

2017-01-01T00:00:00+00:00