<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://idc9.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="http://idc9.github.io/" rel="alternate" type="text/html" /><updated>2024-11-04T17:45:04+00:00</updated><id>http://idc9.github.io/feed.xml</id><title type="html">Iain Carmichael</title><subtitle>Iain&apos;s personal website.</subtitle><entry><title type="html">Word embedding tutorial in python</title><link href="http://idc9.github.io/nlp/2017/10/09/word-embedding-tutorial.html" rel="alternate" type="text/html" title="Word embedding tutorial in python" /><published>2017-10-09T00:00:00+00:00</published><updated>2017-10-09T00:00:00+00:00</updated><id>http://idc9.github.io/nlp/2017/10/09/word-embedding-tutorial</id><content type="html" xml:base="http://idc9.github.io/nlp/2017/10/09/word-embedding-tutorial.html"><![CDATA[<p>I recently gave a tutorial on getting started with word embeddings in Python to a digital humanities group.  The tutorial covers material from <a href="https://web.stanford.edu/~jurafsky/slp3/15.pdf">15 (vector semantics)</a> and <a href="https://web.stanford.edu/~jurafsky/slp3/16.pdf">16 (semantics with dense vectors)</a> from <a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and Language Processing</a>. The data set is ~30,000 Supreme Court opinions provided by <a href="https://www.courtlistener.com/">CourtListener</a>. The repository comes with a small data set loaded and instructions for getting more data from CourtListener.</p>

<p>You can find the tutorial/instructions/additional resources at: <a href="https://github.com/idc9/word_embed_tutorial"><strong>https://github.com/idc9/word_embed_tutorial</strong></a></p>]]></content><author><name></name></author><category term="nlp" /><summary type="html"><![CDATA[I recently gave a tutorial on getting started with word embeddings in Python to a digital humanities group. The tutorial covers material from 15 (vector semantics) and 16 (semantics with dense vectors) from Speech and Language Processing. The data set is ~30,000 Supreme Court opinions provided by CourtListener. The repository comes with a small data set loaded and instructions for getting more data from CourtListener.]]></summary></entry><entry><title type="html">Data Science and the Undergraduate Curriculum</title><link href="http://idc9.github.io/data_science/2017/08/12/data-science-undergrad-curriculum-talk.html" rel="alternate" type="text/html" title="Data Science and the Undergraduate Curriculum" /><published>2017-08-12T00:00:00+00:00</published><updated>2017-08-12T00:00:00+00:00</updated><id>http://idc9.github.io/data_science/2017/08/12/data-science-undergrad-curriculum-talk</id><content type="html" xml:base="http://idc9.github.io/data_science/2017/08/12/data-science-undergrad-curriculum-talk.html"><![CDATA[<p>I recently gave <a href="http://stat-or.unc.edu/event/stor-colloquium-iain-davis-unc-chapel-hill">a talk</a> to my department about my experiences and take aways from developing/teaching a new course: <a href="https://idc9.github.io/stor390/">STOR 390: Introduction to Data Science Course</a>. The talk is about both the new course and more generally some thoughts about the undergraduate statistics curriculum.</p>

<p>You can find the slides here: <a href="https://docs.google.com/presentation/d/1XUaNIybiPD6OpTs-ou5baSUQYiUOJuafXQsvEwrChjc/edit"><strong>https://docs.google.com/presentation/d/1XUaNIybiPD6OpTs-ou5baSUQYiUOJuafXQsvEwrChjc/edit</strong></a>.</p>

<p>There will hopefully be a follow up article/blog post some day, but I think the slides convey the main messages. Many of the points are based existing literature which is linked to at the end along with other courses I found helpful to develop the class.</p>]]></content><author><name></name></author><category term="data_science" /><summary type="html"><![CDATA[I recently gave a talk to my department about my experiences and take aways from developing/teaching a new course: STOR 390: Introduction to Data Science Course. The talk is about both the new course and more generally some thoughts about the undergraduate statistics curriculum.]]></summary></entry><entry><title type="html">Releasing software packages</title><link href="http://idc9.github.io/software/2017/07/15/ajive-lessons.html" rel="alternate" type="text/html" title="Releasing software packages" /><published>2017-07-15T00:00:00+00:00</published><updated>2017-07-15T00:00:00+00:00</updated><id>http://idc9.github.io/software/2017/07/15/ajive-lessons</id><content type="html" xml:base="http://idc9.github.io/software/2017/07/15/ajive-lessons.html"><![CDATA[<p><a href="/software/2017/07/15/ajive-package.html">I recently released</a> my first <a href="https://github.com/idc9/r_jive">R</a> and <a href="https://github.com/idc9/py_jive">Python</a> packages. This post contains some thoughts and advice about releasing software packages – particularly for other graduate students.</p>

<p>The question of “should you release a package?” is highly context dependent (e.g. if you are a probabilist the answer is probably no). There are a number of trade-offs to consider. For example, academia does not seem to value software very much. More importantly, there is a large time cost cost to develop software packages that could have been spent writing papers, this includes:</p>

<ul>
  <li>Coding the basic functionality</li>
  <li>Turning your code into a package someone else can download and use</li>
  <li>Documentation for the code</li>
  <li>Providing data analysis examples</li>
  <li>Maintaining and updating the package</li>
  <li>Responding to user feedback</li>
  <li>Surveying the existing literature to make sure your package provides new functionality</li>
</ul>

<p>I think academia is starting to value software more than it used to<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. I would argue that, in many cases, releasing code is as important as writing a paper. Some of the benefits to you that come from releasing a software package include:</p>

<ul>
  <li>Save future you time. Better code new = less headache in the future.</li>
  <li>Fame/glory/prestige for people using your work.</li>
  <li>Help other people solve their problems. If part of your rational for doing research/academia is helping to solve problems then good code might be as (or more) impactful as a paper.</li>
  <li>Software skills are highly valued in industry.</li>
  <li>You might learn new things out of necessity (e.g. computational linear algebra) and/or better understand your own research.</li>
</ul>

<h1 id="resources">Resources</h1>

<p>Programming is typically a small part of the statistics curriculum (and most other scientific disciplines); we don’t think of ourselves as software engineers even though many of us spend a lot of time writing code. Luckily there are many quality, open-source resources that show you how to write better code and release software. Without these resources (particularly the  <a href="http://r-pkgs.had.co.nz/">R Packages</a> book) it would have taken me 1-2 orders of magnitude more time to build these packages<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>

<p>These resources are helpful for <strong>creating R/Python packages</strong>:</p>

<ul>
  <li>Hadley Wickham’s <a href="http://r-pkgs.had.co.nz/"><strong>R Packages book</strong></a> and <a href="https://github.com/hadley/devtools">devtools</a> were incredibly helpful. If you plan on building an R package read this book.</li>
  <li><a href="http://python-packaging.readthedocs.io/en/latest/index.html"><strong>This tutorial on a minimal Python package</strong></a> and <a href="https://github.com/audreyr/cookiecutter">coockiecutter</a> give helpful templates and instructions to create a Python package.</li>
  <li>Tim Hopper’s <a href="https://www.youtube.com/watch?v=uRul8QdYvqQ">talk on releasing code</a> gives a good high level overview of how/why to release code.</li>
  <li>Hosting the package on <a href="github.com">github</a> gives you a lot of functionality for free (e.g. users can submit feedback via github issues)</li>
</ul>

<p>These resources helped me become a <strong>better programmer</strong>:</p>

<ul>
  <li><a href="https://arxiv.org/pdf/1609.00037.pdf">Good Enough Practices in Scientific Computing</a></li>
  <li><a href="http://www.artima.com/weblogs/viewpost.jsp?thread=331531">Some principles of good programming</a></li>
  <li>Jeff Leek’s book on <a href="https://leanpub.com/modernscientist">How to be a Modern Scientist</a> and an uncountable number of <a href="https://simplystatistics.org/">simplystatistics</a> posts.</li>
  <li>Unit testing made the packages a lot less buggy (<a href="http://r-pkgs.had.co.nz/tests.html">testthat</a> for R and <a href="https://github.com/ehmatthes/pcc/releases/download/v1.0.0/beginners_python_cheat_sheet_pcc_testing.pdf">unittest</a> for Python).</li>
  <li>Reading/borrowing from existing, quality bases including. I found the following helpful:
    <ul>
      <li>R: <a href="https://github.com/tidyverse/ggplot2">ggplot2</a>, <a href="https://github.com/juliasilge/tidytext">tidytext</a>.</li>
      <li>Python: <a href="https://github.com/scikit-learn/scikit-learn">sklearn</a>, <a href="https://github.com/scikit-learn-contrib/lightning">lightning</a>.</li>
    </ul>
  </li>
</ul>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>For example, some <a href="http://jtleek.com/jobs/">statistics postdoc</a> positions require (or highly encourage) applicants to have released an open source package. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>The time cost to build a package is obviously very context dependent (e.g. your experience, the complexity of the algorithm, etc). To give you one data point; these packages took me 1-2 weeks each and I have about 2 years of coding experience. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="software" /><summary type="html"><![CDATA[I recently released my first R and Python packages. This post contains some thoughts and advice about releasing software packages – particularly for other graduate students.]]></summary></entry><entry><title type="html">R and Python packages for AJIVE</title><link href="http://idc9.github.io/software/2017/07/15/ajive-package.html" rel="alternate" type="text/html" title="R and Python packages for AJIVE" /><published>2017-07-15T00:00:00+00:00</published><updated>2017-07-15T00:00:00+00:00</updated><id>http://idc9.github.io/software/2017/07/15/ajive-package</id><content type="html" xml:base="http://idc9.github.io/software/2017/07/15/ajive-package.html"><![CDATA[<p>I just released R and Python implementations of <a href="https://arxiv.org/abs/1704.02060">Angle based Joint and Individual Variation Explained</a> (AJIVE). I recently started working on AJIVE for my thesis and releasing an open source package is one of my goals for my PhD. For the code see:</p>

<ul>
  <li><a href="https://github.com/idc9/r_jive"><strong>ajive</strong></a> (R)</li>
  <li><a href="https://github.com/idc9/py_jive"><strong>jive</strong></a> (Python)</li>
</ul>

<p>Both packages are currently a little rough (need more examples, more testing, cleaner code, fewer typos, etc), but they will improve with time and as I/other people use them. If you use one of these packages I encourage you to <strong>send me critical feedback</strong>. Right now the biggest areas in need of improvement are:</p>

<ul>
  <li>More data analysis examples showing how AJIVE can be used.</li>
  <li>More testing to squash bugs I haven’t found.</li>
  <li>Better documentation – both of the code and explaining the AJIVE procedure.</li>
</ul>

<p>It feels wrong putting something out there that is not yet polished, but I figured it’s better to get something that works out there and improve it than to spend the rest of the summer perfecting it instead of writing my thesis (i.e. <em>don’t let the perfect be the enemy of the good</em>).</p>

<p>I learned a lot from building these packages. <a href="/software/2017/07/15/ajive-lessons.html">This next post</a> has some thoughts and advice about releasing software packages – particularly for other graduate students. I will (hopefully soon) put up a few posts discussing how AJIVE works and showing some data analysis examples.</p>]]></content><author><name></name></author><category term="software" /><summary type="html"><![CDATA[I just released R and Python implementations of Angle based Joint and Individual Variation Explained (AJIVE). I recently started working on AJIVE for my thesis and releasing an open source package is one of my goals for my PhD. For the code see:]]></summary></entry><entry><title type="html">Communication in Data Science</title><link href="http://idc9.github.io/communication/2017/06/27/effective-communication.html" rel="alternate" type="text/html" title="Communication in Data Science" /><published>2017-06-27T00:00:00+00:00</published><updated>2017-06-27T00:00:00+00:00</updated><id>http://idc9.github.io/communication/2017/06/27/effective-communication</id><content type="html" xml:base="http://idc9.github.io/communication/2017/06/27/effective-communication.html"><![CDATA[<p>I posted <a href="https://idc9.github.io/stor390/notes/communication/communication.html"><strong>the notes for a lecture on communication</strong></a> in data science that might be interesting/helpful. This lecture provides four general principles<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> for communication:</p>

<ol>
  <li>adapt to your audience</li>
  <li>maximize the signal to noise ratio</li>
  <li>use effective redundancy</li>
  <li>consider the trade-offs</li>
</ol>

<p>and discusses some examples of how they apply to various examples in data science (visualization, code structure and literate programming).</p>

<p>Communications skills are important at all levels of technical pursuits from <a href="http://r-pkgs.had.co.nz/vignettes.html">releasing a software package</a> to conducting <a href="http://distill.pub/2017/research-debt/">research</a>, however, they are under emphasized in STEM education. These notes are from an undergraduate <a href="https://idc9.github.io/stor390/">Introduction to Data Science</a> course I taught last semester and are my best attempt to incorporate communication into the curriculum. Any feedback that might improve this lecture (or help me become a better communicator) is welcome!</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The first three of these are from <a href="http://www.treesmapsandtheorems.com/">Trees, Maps and Theorems</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="communication" /><summary type="html"><![CDATA[I posted the notes for a lecture on communication in data science that might be interesting/helpful. This lecture provides four general principles1 for communication: The first three of these are from Trees, Maps and Theorems. &#8617;]]></summary></entry><entry><title type="html">Some basic optimization algorithms in Python</title><link href="http://idc9.github.io/optimization/2017/05/17/basic-optimization.html" rel="alternate" type="text/html" title="Some basic optimization algorithms in Python" /><published>2017-05-17T00:00:00+00:00</published><updated>2017-05-17T00:00:00+00:00</updated><id>http://idc9.github.io/optimization/2017/05/17/basic-optimization</id><content type="html" xml:base="http://idc9.github.io/optimization/2017/05/17/basic-optimization.html"><![CDATA[<p>After taking a convex optimization class this past semester I implemented a few basic algorithms for unconstrained optimization (e.g. <a href="https://github.com/idc9/optimization_algos/blob/master/opt_algos/accelerated_gradient_descent.py">Nesterov’s accelerated gradient descent</a>) in Python in this repo: <a href="https://github.com/idc9/optimization_algos"><strong>https://github.com/idc9/optimization_algos</strong></a>.</p>

<p>The purpose of this repo is for me to learn and to have bare bones implementations of these algorithms sitting around. I tried to make the code modular and simple as possible so that you (or a future me) can modify it for other purposes (e.g. add bells and whistles, implement other algorithms, etc). While off the shelf solvers such as <a href="http://scikit-learn.org/stable/">sklean</a> or <a href="http://cvxopt.org/">cvxopt</a> are preferable for many applications there are times when you want full control over the solver.</p>

<p>Right now the repo focuses on first order methods (GD, SGD, accelerated GD, etc) for <a href="http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote10.html">empirical risk minimization</a> problems. For some useful introductory references see:</p>

<ul>
  <li><a href="http://sebastianruder.com/optimizing-gradient-descent/index.html">An overview of gradient descent optimization algorithms</a> by Sebastian Ruder (good high level overview)</li>
  <li><a href="https://arxiv.org/abs/1606.04838">Optimization Methods for Large-Scale Machine Learning</a> by Léon Bottou, Frank E. Curtis, and Jorge Nocedal</li>
  <li><a href="https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf">Convex Optimization</a> by Boyd and Vandenberghe (or see <a href="https://www.youtube.com/view_play_list?p=3940DD956CDF0622">video lectures</a>)</li>
</ul>

<p>A few more interesting references:</p>
<ul>
  <li><a href="https://arxiv.org/pdf/1405.4980.pdf">Convex Optimization: Algorithms and Complexity</a> by Sebastien Bubeck</li>
  <li><a href="https://blogs.princeton.edu/imabandit/2014/03/06/nesterovs-accelerated-gradient-descent-for-smooth-and-strongly-convex-optimization/">Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization</a></li>
  <li><a href="http://distill.pub/2017/momentum/">Why Momentum Really Works</a></li>
</ul>]]></content><author><name></name></author><category term="optimization" /><summary type="html"><![CDATA[After taking a convex optimization class this past semester I implemented a few basic algorithms for unconstrained optimization (e.g. Nesterov’s accelerated gradient descent) in Python in this repo: https://github.com/idc9/optimization_algos.]]></summary></entry><entry><title type="html">My favorite resources</title><link href="http://idc9.github.io/resources/2017/05/17/favorite-resources.html" rel="alternate" type="text/html" title="My favorite resources" /><published>2017-05-17T00:00:00+00:00</published><updated>2017-05-17T00:00:00+00:00</updated><id>http://idc9.github.io/resources/2017/05/17/favorite-resources</id><content type="html" xml:base="http://idc9.github.io/resources/2017/05/17/favorite-resources.html"><![CDATA[<p>One of the most underrated parts of modern stats/machine learning is that many of the best resources are available online for free from textbooks to MOOCs to code snippets. Like many people in the area I’ve used these resources to teach myself a lot of what I know. Here is a google doc with some of my favorite resources:</p>

<ul>
  <li><a href="https://docs.google.com/document/d/18gBqIGNyOqzqygRjFAIA7jnieXRVMtBc2_yL7UD-3TM/edit?usp=sharing"><strong>Iain’s favorite stat/ML resources</strong></a></li>
</ul>

<p>Most of these are available for free online (ok you can actually find all of them if you look hard enough). Here are a few worth highlighting:</p>

<ul>
  <li><a href="http://r4ds.had.co.nz/">R for Data Science</a> is the bible for R</li>
  <li><a href="https://www.coursera.org/specializations/jhu-data-science">JHU data science specialization on Coursera</a> (free if you audit it)</li>
  <li><a href="https://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf">Elements of Statistical Learning</a></li>
  <li><a href="https://chrisalbon.com/">Chris Albon’s website</a> has lot’s of helpful code snippets (particularly for pandas)</li>
  <li><a href="https://metacademy.org/browse">Metacademy</a> has road-maps to learn many concepts in statistics, math, cs, etc</li>
  <li><a href="http://www.deeplearningbook.org/">Deep Learning</a> is an excellent overview of deep learning (and useful perspective on ML in general)</li>
</ul>]]></content><author><name></name></author><category term="resources" /><summary type="html"><![CDATA[One of the most underrated parts of modern stats/machine learning is that many of the best resources are available online for free from textbooks to MOOCs to code snippets. Like many people in the area I’ve used these resources to teach myself a lot of what I know. Here is a google doc with some of my favorite resources:]]></summary></entry><entry><title type="html">UNC team wins $20,000 and a chance at a job from datathon</title><link href="http://idc9.github.io/datathon/2017/04/28/datathon.html" rel="alternate" type="text/html" title="UNC team wins $20,000 and a chance at a job from datathon" /><published>2017-04-28T00:00:00+00:00</published><updated>2017-04-28T00:00:00+00:00</updated><id>http://idc9.github.io/datathon/2017/04/28/datathon</id><content type="html" xml:base="http://idc9.github.io/datathon/2017/04/28/datathon.html"><![CDATA[]]></content><author><name></name></author><category term="datathon" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Old posts</title><link href="http://idc9.github.io/old/posts/2017/01/01/old-posts.html" rel="alternate" type="text/html" title="Old posts" /><published>2017-01-01T00:00:00+00:00</published><updated>2017-01-01T00:00:00+00:00</updated><id>http://idc9.github.io/old/posts/2017/01/01/old-posts</id><content type="html" xml:base="http://idc9.github.io/old/posts/2017/01/01/old-posts.html"><![CDATA[]]></content><author><name></name></author><category term="old" /><category term="posts" /><summary type="html"><![CDATA[]]></summary></entry></feed>