Getting to Know Tim Hopper with Brenton Mallen</h1> <article> <h1>Getting to Know Tim Hopper with Brenton Mallen</h1> <p>Mon, 02 Feb 2026 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.linkedin.com/in/brentonmallen/" target="_blank" rel="noopener" >Brenton Mallen on LinkedIn</a></li> <li><a class="link" href="https://tdhopper.com" target="_blank" rel="noopener" >Tim Hopper’s website</a></li> <li><a class="link" href="https://shouldigetaphd.com" target="_blank" rel="noopener" >Should I Get a PhD?</a></li> <li><a class="link" href="https://pydevtools.com" target="_blank" rel="noopener" >Python Developer Tooling Handbook</a></li> <li><a class="link" href="https://www.goodreads.com/book/show/54493401-project-hail-mary" target="_blank" rel="noopener" ><em>Project Hail Mary</em> by Andy Weir</a></li> <li><a class="link" href="https://www.ea.com/games/it-takes-two" target="_blank" rel="noopener" ><em>It Takes Two</em> (video game)</a></li> <li><a class="link" href="https://www.gcc.edu/" target="_blank" rel="noopener" >Grove City College</a></li> <li><a class="link" href="https://www.rti.org/" target="_blank" rel="noopener" >RTI International</a></li> <li><a class="link" href="https://www.amazonrobotics.com/" target="_blank" rel="noopener" >Kiva Systems (now Amazon Robotics)</a></li> <li><a class="link" href="https://www.or.ncsu.edu/" target="_blank" rel="noopener" >NC State Operations Research</a></li> <li><a class="link" href="https://en.wikipedia.org/wiki/Encarta" target="_blank" rel="noopener" >Encarta</a></li> <li><a class="link" href="https://www.goodreads.com/series/49073-the-chronicles-of-narnia" target="_blank" rel="noopener" ><em>The Chronicles of Narnia</em> by C.S. Lewis</a></li> <li><a class="link" href="https://www.goodreads.com/series/66175-the-lord-of-the-rings" target="_blank" rel="noopener" ><em>The Lord of the Rings</em> by J.R.R. Tolkien</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> <h2 id="summary">Summary </h2><p>In this special episode of <em>Into the Hopper</em>, the tables are turned as my friend and former colleague Brenton Mallen interviews me. We worked together at two different companies over the years, and Brenton realized that despite knowing me for over a decade, he didn’t really know the full Tim Hopper story.</p> <p>We cover a lot of ground: growing up in southern West Virginia, my winding educational path through physics, computer science, and math, my time working at a children’s home in Tennessee, two PhD programs I started but didn’t finish, how I stumbled into data science through Twitter and grad school, meeting my wife Maggie, becoming a parent to four kids, hobbies like woodworking, photography, and knitting, and our recent move to Indiana to be closer to family. We also discuss the impact of AI on education and what it means for raising kids in this rapidly changing world.</p> <h2 id="transcript">Transcript </h2><p><strong>Brenton:</strong> So we don’t waste it—take one! Hi, welcome to Into the Hopper. I’m Brenton Mallen, and I’m here with my special guest, Tim Hopper. Thanks for having me.</p> <p><strong>Tim:</strong> Thanks for having me.</p> <p><strong>Brenton:</strong> Well, you’re welcome. How’d I get here?</p> <p><strong>Tim:</strong> Beats me. I don’t know how we got here. Actually, maybe we’ll talk about that. What are we doing today?</p> <p><strong>Brenton:</strong> For context, I had messaged you—I don’t even remember, it’s been a while now—asking when someone was going to interview you for your podcast. And I guess I volunteered by asking that question.</p> <p><strong>Tim:</strong> Very good.</p> <p><strong>Brenton:</strong> As we started talking about doing this and I wrote down some stuff to talk about, I realized a couple of things. One, I’ve just been doing a lot of thinking about friendship and what it means to be a friend and what it means to have a friend. We throw that term around quite a bit. I’ve been trying to realize what relationships I have that are friends, what that means, and can I make those relationships and connections a bit deeper. I’ve known you, I don’t know, 11 years now? Ten? Over a decade.</p> <p><strong>Tim:</strong> Just over 10, because we started working together in October of 2015.</p> <p><strong>Brenton:</strong> Right. Was that PyData where we met?</p> <p><strong>Tim:</strong> We were already working together for a few weeks, and then we went to PyData.</p> <p><strong>Brenton:</strong> I think I met you for the first time in person at PyData, yeah, definitely.</p> <p><strong>Tim:</strong> In New York.</p> <p><strong>Brenton:</strong> PyData 2015.</p> <p><strong>Tim:</strong> At Capital One or no—I don’t know—the World Trade Center.</p> <p><strong>Brenton:</strong> I realized I’ve quote-unquote “known” you for over a decade, but we talk decently regularly and I don’t know the Tim Hopper. So I thought at the very least we can have a conversation, I could get to know you a little bit better, and you can get a podcast episode out of it. If not, at least we’ll have a conversation.</p> <p><strong>Tim:</strong> Everyone will know the real Tim Hopper after this.</p> <p><strong>Brenton:</strong> We’ll see—could be good, could be bad. Who knows? But then again, your data’s out there already.</p> <p><strong>Tim:</strong> Usually it’s selectively shared, but today whatever you ask, I’m obligated to answer.</p> <p><strong>Brenton:</strong> I will make note of that. I just wanted to give you an opportunity to tell your story, coming from rural West Virginia to the big Kardashian machine of data science you’ve become.</p> <p><strong>Tim:</strong> Maybe we should explain first—we worked together at two different companies, actually, from 2015 to 2017 and then 2020 during early COVID until ‘21, right?</p> <p><strong>Brenton:</strong> I think so. Something like that. My memory is not what it used to be.</p> <p><strong>Tim:</strong> It’s all right. I’m pretty sure I’m right about that.</p> <p><strong>Brenton:</strong> As you look at a resume. But yeah, so let’s start there, I guess, at the very beginning. You grew up in West Virginia, right?</p> <hr> <h3 id="growing-up-in-west-virginia">Growing Up in West Virginia </h3><p><strong>Tim:</strong> I did, yeah. I was born and raised in southern West Virginia, about as far south as you can be. My dad is a family practice doctor, and he had moved there after medical school to work in family practice. I lived in the same house my whole childhood. My parents moved back to North Carolina—where they lived previously—in 2007 when I was a senior in college, but my whole childhood was West Virginia.</p> <p><strong>Brenton:</strong> Did you move from West Virginia to North Carolina when you were at school?</p> <p><strong>Tim:</strong> To give the short story: I grew up in West Virginia, went to college in western Pennsylvania, then lived in Tennessee for a year, Virginia for a year, North Carolina for 15 years, and as of two weeks ago, I live in Indiana.</p> <p><strong>Brenton:</strong> So what was West Virginia like?</p> <p><strong>Tim:</strong> It was great. My mom and I were just talking about this the other day. She said when we were younger, people would say, “Oh, you shouldn’t raise your kids there. You need them in a place where there’s opportunity and different things.”</p> <p>I’m sure I had some discontentment with various things as a kid, but I really have no complaints about growing up in West Virginia. I tell people that it’s only been in the last 10 years or so that I’ve started to tell people I grew up in <em>rural</em> West Virginia, because the town I grew up in was 6,000 people—I think it’s the 13th biggest town in West Virginia. We had Walmart. It wasn’t rural. We were right off the interstate—the town is Princeton, right off I-77—so we had interstate amenities. They have Starbucks there now.</p> <p>But West Virginia gets real rural real fast. In comparison, I didn’t grow up in the hollers, way off the beaten path. It’s stunningly beautiful. It just has been snowing here in Indiana, and I’ve realized I was very spoiled growing up in a valley where you could just sled wherever. My eight-year-old wanted to sled, so we had to drive somewhere to find a hill, which is a new experience for me.</p> <p>I think it was a very safe place to grow up. People were very nice and welcoming. I think I had sufficient opportunities. I didn’t have the opportunities that a lot of kids have in bigger areas, but I didn’t have real complaints. I was satisfied with my education. I’m very glad I grew up there.</p> <p><strong>Brenton:</strong> You were an only child?</p> <p><strong>Tim:</strong> No, I’m the youngest of four. I have three older sisters.</p> <p><strong>Brenton:</strong> Did I know that?</p> <p><strong>Tim:</strong> I don’t know.</p> <p><strong>Brenton:</strong> I feel like we’ve talked more about Maggie’s siblings.</p> <p><strong>Tim:</strong> Maggie has four sisters. I have three. So we have a lot of aunties in our family, as I like to say.</p> <p><strong>Brenton:</strong> What was that like? For context, I’m the middle of three boys, which I imagine is just really different.</p> <p><strong>Tim:</strong> Honestly, in going to college, I made a lot of really good male friends. I had good friends growing up, but in college I made friends that just became like brothers to me—and they’re still my friends to this day. Now I have two boys, a six-year-old and an eight-year-old, and seeing their bond—it’s just different from anything I had with my sisters. But I don’t think I had any complaints having sisters. Being the only boy and the youngest was probably kind of lonely at times, but that was fine. They used to paint my nails. My mom says my oldest sister used to like to play school down in the basement, and I would get upset that we didn’t have more recess during school, that kind of thing. But I still have a good relationship with my sisters to this day. I’m very thankful for my upbringing and for my whole family.</p> <hr> <h3 id="school-sports-and-early-interests">School, Sports, and Early Interests </h3><p><strong>Brenton:</strong> What about going through school—grade school, friends, hobbies? What do kids do in West Virginia?</p> <p><strong>Tim:</strong> For school, I had kind of a mixed bag. My mom homeschooled me from kindergarten to fourth grade. Then I went to a private Christian school for fifth and sixth grade. Then seventh to 12th, I went to the public school, middle school and high school. So I went to a variety of schools.</p> <p>As a kid, I loved to play soccer—not basketball, as some people might be surprised to hear.</p> <p><strong>Brenton:</strong> Are you longer in the legs or longer in the torso? I forget.</p> <p><strong>Tim:</strong> I’m all torso and no coordination. But I played soccer from when I was pretty young—maybe like six or whenever you can start—until I was in middle school. I played basketball for fifth and sixth grade. I was just kind of a stereotypical boy as a younger kid, very restless and wanted to be outside playing things. We played outside a lot.</p> <p>I kind of lost interest in sports going on. I’m not that competitive.</p> <p><strong>Brenton:</strong> Sounds like you were more sporty back then, and now you’re less so.</p> <p><strong>Tim:</strong> I was never that good. I’m just not that competitive or driven. As a kid, I probably was not that good at working for things. I was always very good at math and school as a whole came pretty easy to me—spelling never did, still doesn’t. I didn’t really work hard for sports.</p> <p>I started to play some instruments—piano with a private teacher and then some instruments in school—but never really liked to practice. Same problem.</p> <p>I’ve always been kind of into reading a lot. My mom says I used to love to read the World Book Encyclopedia when I was a kid.</p> <p><strong>Brenton:</strong> We had a set of those.</p> <p><strong>Tim:</strong> Yeah. I played with my sisters. I was in Cub Scouts from right when you can do that, and then I was a Boy Scout all the way through high school, so that was a big part of my life.</p> <p>We did family things. My dad was a big golfer, so I picked up golf for a while with him until I was too tall to use my extra-length golf clubs. Then I was like, “All right, this isn’t worth it for me.”</p> <p>In a lot of ways, pretty normal. In fifth grade, we got a Windows 95 computer. We had had like a DOS computer, maybe a 3.1, before that. But in fifth grade we got a Windows computer with <strong>Encarta</strong> on it.</p> <p><strong>Brenton:</strong> Okay.</p> <p><strong>Tim:</strong> Encarta 95. That just rocked my world. My mom talks about how, when I was in fifth grade, I did a report on Thomas Edison in school. As soon as we got Encarta, I found Thomas Edison, and you could play the clip from one of the early recordings—he does like “Mary Had a Little Lamb” or something, there’s still a recording of. I played that in Encarta, and that blew her mind—both that it was a thing and that I found it so fast.</p> <p>I got more and more interested in computers and just tinkering with computers as time went on, which in a lot of ways led to what I do professionally. It was always kind of a side interest.</p> <hr> <h3 id="early-programming-and-lord-of-the-rings-websites">Early Programming and Lord of the Rings Websites </h3><p><strong>Brenton:</strong> You said you ran a website—I think like a GeoCities thing or something.</p> <p><strong>Tim:</strong> Yeah, in high school I started to make websites. My friend and I started making GeoCities and AngelFire pages—that was one of the competitors—probably starting in late middle school. I ran a <em>Lord of the Rings</em> website. Actually, my AngelFire website you can still find—I probably shouldn’t link to it—but I made a <em>Princess Bride</em> fan site around 2000, ‘99. I was just copying content from other websites. I learned HTML by—in those days you didn’t have all this complex stuff—so I could just go open other websites and copy and paste the HTML.</p> <p>Then I started doing <em>Lord of the Rings</em> websites. Through that in high school, I started to get curious about programming, but I literally had no way of learning programming really. I wasn’t into BBS and forums and things around programming back then. So I graduated from high school in 2004—this is the early 2000s—and I just wasn’t in tune with what was available through the internet.</p> <p>I was trying to find books to learn programming and bought a book on PHP. That was really where I first started programming, to make <em>Lord of the Rings</em> websites. I did that through high school. It was like 2001 to 2003 when the <em>Lord of the Rings</em> movies came out. I was in high school.</p> <p><strong>Brenton:</strong> It was the perfect time.</p> <p><strong>Tim:</strong> I had already read the books and was a big fan. The main site was like TheOneRing.net or something.</p> <p><strong>Tim:</strong> Yeah, I won a big competition on TheOneRing.net and won this big <strong>Weta</strong> gift that was like worth hundreds of dollars. I was active on the <em>Lord of the Rings</em> forums and stuff. That kind of thing got me started in programming beyond HTML—and TI-BASIC, the really basic programming you could do on a TI calculator—that’s where I really cut my teeth.</p> <hr> <h3 id="reading-and-sci-fi">Reading and Sci-Fi </h3><p><strong>Brenton:</strong> When did you first read <em>Lord of the Rings</em>? And what do you like to read—more fantasy?</p> <p><strong>Tim:</strong> My cousin gave me his copies of <em>Lord of the Rings</em>. I have two male cousins who are like six years older than me that I idolized—shout out Justin and Joseph if you’re listening. Joseph gave me copies of <em>Lord of the Rings</em>. I was pretty young, maybe like nine or ten, and they sat on my shelf for a long time.</p> <p><strong>Brenton:</strong> Did you start with <em>The Hobbit</em>?</p> <p><strong>Tim:</strong> That’s a good question. I don’t know the answer. I finally got to reading <em>Lord of the Rings</em> around late middle school, like 2000-ish. I’ve never actually been a huge fantasy reader and I’m still not that much.</p> <p>My mom read us <em>Narnia</em> as kids, so those were in my repertoire. These days I read some fantasy stuff to my kids. We’ve been reading <em>The Chronicles of Prydain</em>—Lloyd Alexander—they’re really old, from like 50 years ago. We’ve read the <em>Narnia</em> series. Eventually we’ll read <em>Lord of the Rings</em>. We’ve read <em>The Hobbit</em>.</p> <p>I’ve not been a big fantasy person. I do enjoy sci-fi, usually either hard sci-fi or kind of fluff sci-fi that’s really easy. I like the hard sci-fi for the science part—like Andy Weir stuff.</p> <p><strong>Brenton:</strong> That kind of stuff.</p> <p><strong>Tim:</strong> Oh yeah, I love Andy Weir. <em>Project Hail Mary</em> is one of my favorites.</p> <p><strong>Brenton:</strong> I’ve read that like seven or eight times now.</p> <p><strong>Tim:</strong> I should read it again.</p> <p><strong>Brenton:</strong> The audio book is really good with the way the alien talks.</p> <p><strong>Tim:</strong> Yeah, I think I listened to it. These days I read a lot of history, some theology, biography, and I enjoy novels—I just don’t make a lot of time for them. I do listen to, like, Jack Reacher and just kind of dumb mystery thriller type stuff when I need a mental break. But I don’t have a lot of time for reading these days.</p> <hr> <h3 id="college-from-computer-engineering-to-math">College: From Computer Engineering to Math </h3><p><strong>Brenton:</strong> So <em>Lord of the Rings</em> websites—that’s early 2000s, you’re graduating high school in 2004?</p> <p><strong>Tim:</strong> 2004, yeah.</p> <p><strong>Brenton:</strong> I was 2005.</p> <p><strong>Tim:</strong> The day Ronald Reagan died—graduated from high school.</p> <p><strong>Brenton:</strong> Was he giving the commencement speech?</p> <p><strong>Tim:</strong> He wasn’t.</p> <p><strong>Brenton:</strong> So then school after that—did you know where you wanted to go? Did you know what you wanted to do?</p> <p><strong>Tim:</strong> I thought in high school that I wanted to study computer engineering. I don’t think I really knew what computer engineering was, and I didn’t really know how to learn what it was. I just thought computers were really cool.</p> <p>I don’t know that I even knew—I mean, I had heard of computer science, but I don’t know that I knew it was necessarily an alternative. We had a small public university nearby where I knew some people studied computing, like two-year degrees for network admin type roles. That didn’t seem that interesting to me. I didn’t love programming, even though it was something I wanted to learn and do.</p> <p><strong>Brenton:</strong> Did your parents try to influence you into a certain direction?</p> <p><strong>Tim:</strong> That’s a good question.</p> <p><strong>Brenton:</strong> Like, no one asked you to be a doctor?</p> <p><strong>Tim:</strong> No. My dad’s a doctor, my mom’s a nurse, and they never pressured us to pursue medical things. I think they largely tried not to pressure us, for better or worse. They really wanted us to go to college—that was their big goal for us. But I don’t think they necessarily pressured me a lot.</p> <p>I just didn’t know people around me who were programmers or computer scientists or computer engineers. I just knew that I thought computers were neat and seemed to be kind of the future. So I had computer engineering in mind.</p> <p>I was looking at Virginia Tech, which was fairly near where I grew up and obviously a very reputable school. But I’m a Christian and grew up in church, and my parents encouraged us to think about looking at Christian colleges. The college I went to was called <strong>Grove City College</strong> in western Pennsylvania. It’s a Christian college that happened to have computer engineering, which is pretty rare for smaller schools.</p> <p>I actually ended up visiting there, and without that much perspective, I was like, “Oh, this is great. I’ll go here.” I went to visit Virginia Tech and Grove City, then applied to Grove City for early admission—which is basically like if you get in early, you’re supposed to go. So I applied early admission and then didn’t apply to any other colleges.</p> <p>It was a fairly limited information decision. I had met somebody while I was in high school who had gone there like 20 years previously, and he really recommended it to me. I trusted him, so I decided to go as a computer engineering student.</p> <p>But before I matriculated, I changed my mind to pursue physics instead, because I really loved physics in high school. I had a great physics teacher named Mr. Smith who was inspirational to me—I was his teaching assistant. I really loved the math side of physics.</p> <p>In hindsight, I enjoyed that because in high school math, besides geometry, math is so routine—you’re just learning the mechanics. In physics, you’re actually getting to solve problems.</p> <p><strong>Brenton:</strong> It’s like applied math to a degree. It frames a problem you’re trying to solve or some observation.</p> <p><strong>Tim:</strong> Exactly. I never really enjoyed labs that much—partially my laziness—but I enjoyed the math and problem-solving side. So before I started, I decided to do physics. I was like, “I’ll become a physics professor, that makes sense.” I don’t think I knew any physics professors. My dad had one friend from college who had gotten a PhD in physics and then been stuck in really mediocre jobs his whole life. He was always just kind of like, “Well, be careful about that.”</p> <p><strong>Brenton:</strong> Which is the inspiration for <em>Should You Get a PhD?</em></p> <p><strong>Tim:</strong> Yeah, I mean, it was good advice. To jump ahead, later I jumped out of two PhD programs and am fairly cynical now about whether PhD programs are a good route for people.</p> <p><strong>Tim:</strong> I ended up in college as a math major. I studied physics and started out—first-year physics and engineering is all pretty similar, you’re taking basic classes.</p> <p>To jump back a little: in high school, I got ahead in math through basically just wanting to go faster. My parents were able to get me to take classes at a faster pace. By the end of junior year, I’d finished calculus. My senior year, I took some college math classes at a nearby college.</p> <p>So I was actually ahead—I had finished calculus basically before I got to college because I’d already taken it. As a result, in college I started to explore some higher math stuff and was kind of enjoying physics, but I really didn’t enjoy the labs and hands-on side of it. More and more, I was like, “I just really love math classes.”</p> <p>But I was also enjoying—I’d taken a computer programming class my freshman year, a C++ class. Despite being very different from what I’d done with PHP in high school, it really set me up for understanding what I was getting in C++. So that was fun and really scratched the problem-solving itch.</p> <p>My sophomore year, I decided to change to computer science.</p> <p><strong>Brenton:</strong> So you went in as a math major?</p> <p><strong>Tim:</strong> I applied as computer engineering, went in as physics, and then a year later changed to computer science.</p> <p>I continued to take some physics classes on the side and finished a physics minor. But sophomore year in computer science, you take <strong>theory of computing</strong>, which is basically a math class—you’re trying to prove things that are computable, doing Turing machines, NP problems, that kind of stuff.</p> <p>I really loved that class—I thought it was really fascinating. The professor I had for my first programming class and also that theory class was a former mathematician, which is probably relevant. But then I realized all the computer science majors hated that class. They just wanted to be programming. I wasn’t just enthusiastic about programming or wanting to write programs necessarily, but I was interested in the problem-solving.</p> <p>So as I was taking it and realizing it was a math class, I was like, “I should just really be a math major.”</p> <p><strong>Brenton:</strong> Did you change again?</p> <p><strong>Tim:</strong> Yeah. Freshman year I was a physics major, sophomore year I was a computer science major, and then junior year I changed to math. As a math major it was a liberal arts degree, so I had to take a language at my undergrad. I ended up my junior year starting freshman French—which I had to do four semesters of with these freshmen as a junior.</p> <p><strong>Brenton:</strong> How’s your French now?</p> <p><strong>Tim:</strong> It’s horrible, and I still have nightmares about French class. I literally wake up in the night sweating that I’m missing a French test or something.</p> <hr> <h3 id="liberal-arts-education-and-history">Liberal Arts Education and History </h3><p><strong>Brenton:</strong> Did your school have gen ed courses? Did you take something that you might not have thought of before that maybe you gained an appreciation for?</p> <p><strong>Tim:</strong> Yeah. It’s a liberal arts school, and we had to do kind of a core liberal arts curriculum—really good stuff like history, art, and some basic philosophy. Being a Christian school, there was a Bible component also.</p> <p>It got me interested in philosophy for sure, and it helped develop my interest in history, which is one of my biggest side interests now. Before college, I had zero interest in history.</p> <p><strong>Brenton:</strong> I feel like it’s something you gain an appreciation for over time.</p> <p><strong>Tim:</strong> Yeah, and I can’t necessarily point to one thing particularly. It broadened my world in a lot of ways and got me to think about a lot of things.</p> <p>I ended up liking one of the history professors a lot. He also taught philosophy of education, so I took his philosophy of education track that’s normally just education majors. It helped me think a lot about the purpose of education.</p> <blockquote> <p>His big thing was that the purpose of education is really “what’s the purpose of people?"—and our education should help shape people towards that.</p> </blockquote> <p><strong>Brenton:</strong> Do you have anything you walked away from that you carry with you today?</p> <p><strong>Tim:</strong> Yeah, I think that point I just made. As you think about education—I have four children now—what you’re trying to get out of it… is education just trying to make people better workers in a capitalist society? I think there is value in being a productive contributor to society, but it’s much bigger than just “you go and work.”</p> <p>It really helped me think of education—and thus parenting—as shaping humans. What’s the full picture of what we want our humans to be? For my kids’ school, it’s not just about “is what they’re doing valuable?” The question is not necessarily “when are they ever going to use this?” but “is their mind being opened to be people who understand the world and are interested in the world and interested in people?” Hopefully having a bigger-orbed perspective on education.</p> <p>And recognizing that education doesn’t just mean schooling. I don’t know if it’s a real quote, but there’s an often-quoted Mark Twain line: “I don’t let my schooling interfere with my education.” The conversations you have with your kid for 20 minutes before bedtime can be just as much of their education as their math lessons.</p> <p><strong>Brenton:</strong> In D&D, it’s like the difference between wisdom and intelligence.</p> <p><strong>Tim:</strong> Yeah, I think that’s a real thing.</p> <p><strong>Brenton:</strong> We’ll come back to the kids and parenting—I’ve got some questions there I’d like to get more of your perspective on.</p> <p><strong>Tim:</strong> Also, after college I started a program in the history of math as a PhD program. I only stayed in for a year, but that was definitely a result of my liberal arts education—being interested not just in things as “what labor produces value that makes money” but just ideas.</p> <p>Some of my professors helped me be interested in history of math, but it was also my own personal development. Being curious and interested in things came through my college experience and made me interested in that as a topic.</p> <p>I looked at programs like philosophy of science—a lot of departments are history and philosophy of science together. I actually took a class in history of science or philosophy of science in college as well. It just got me interested in exploring ideas for their own sake and not necessarily because they have a specific use for anyone.</p> <p><strong>Brenton:</strong> Do you have a particular period of history that you’re most interested in or find yourself studying more?</p> <p><strong>Tim:</strong> People can fast forward like 10 minutes if they want to get back to the interesting part of the podcast.</p> <p><strong>Brenton:</strong> I’m here to have a conversation with you, so…</p> <p><strong>Tim:</strong> I’m really interested in the history of science and history of math. It’s been less of something I’ve pursued since my early twenties when I started this PhD program. I ended up leaving after a year for a number of reasons, but partially because I was like, “Okay, it’s good to be interested in ideas, but I also want to eat for the rest of my life, so I should be thinking about what’s going to give me a job.”</p> <p>In that program, I was interested in like applied math in the 20th century. So much happened around—a character I was interested in is like <strong>Von Neumann</strong>, who was such a massive figure, influential in the Manhattan Project but also shaped so much of modern mathematics. The 20th century math is really interesting.</p> <p><strong>Brenton:</strong> I think you’re a fan of Claude Shannon too?</p> <p><strong>Tim:</strong> Yeah, for sure. I think that kind of stuff is just so remarkable—like Bell Labs and so many players in the Manhattan Project, for better or worse. It’s really fascinating. And Feynman—I used to read Feynman’s books back in the day. That era really fascinated me.</p> <p>But also earlier stuff—all the greats: Gauss, Euler, Fermat, Fourier. All of that 18th, 19th century stuff is really intriguing too.</p> <p>These days, my interest is much more in <strong>American religious history</strong>. I’m interested in my own religious tradition, which is Presbyterianism and Presbyterian history in particular. But I’m also just interested in the socio-historical questions: how Christianity—primarily, not exclusively—has developed in America, and what the American experiment has enabled in terms of getting so many variants of Christianity. What kind of cultural, social, and theological perspectives caused that to develop?</p> <p>My weird nerd habit is I love to drive around—I’m going to have to get some new skills here in Indiana—but in North Carolina, I’d drive around, see country churches, and think about what immigration and different religious movements led to that particular type of church being in a specific area.</p> <p>That’s kind of my guilty pleasure hobby: just thinking about those ideas.</p> <p><strong>Brenton:</strong> Are you exploring the history of how things are founded and built and grow and change, or different theological perspectives of why there are different denominations?</p> <p><strong>Tim:</strong> It’s very intertwined. Both dimensions influence each other. You definitely have the theological thing, but the theological influences aren’t necessarily separated from the cultural or immigration patterns.</p> <p>If you just look at the US broadly, the places Germans immigrated, you have Lutheran churches. The places Scots-Irish immigrated, you have Presbyterian churches. The places the English immigrated, you have Episcopal churches. And then all of that also morphed into the plethora of Methodist and Baptist churches.</p> <p>The American religious landscape in Christianity and its variants—not to mention all other religions—is so wildly diverse in a way that would have been unimaginable 300 years ago. I just think it’s really fascinating to think about why and how that happened.</p> <p>That’s what I lay in bed and think about at night to take my mind off other things. I actually just became the official webmaster of the North Carolina Presbyterian Historical Society.</p> <p><strong>Brenton:</strong> Just in time to leave.</p> <p><strong>Tim:</strong> Yeah, well, I was like, “I’m moving. I can still do it if you want.” I’ve been spending the last few weeks, aside from moving, trying to get several elderly people to help me find the password to change the DNS to fix the website. It’s been my task of late—nobody knows the password to log in.</p> <p>My family is part Presbyterian for 500 years. So there’s also a personal family history aspect of it to me, which is really how it started—as an adult, learning more about grandparents and great-grandparents that I didn’t know. Piecing together their stories with American history is fascinating to me.</p> <p>Anyway, that’s a topic of interest to a limited audience, but there you go.</p> <hr> <h3 id="graduating-and-the-math-prize">Graduating and the Math Prize </h3><p><strong>Brenton:</strong> All right, well then we’ll come back to the mainstream. So you finally decided to switch to being a math major. Did you graduate as a math major?</p> <p><strong>Tim:</strong> I did, yeah. I graduated math with highest honors, which no one has ever cared about since. I worked too hard to be a summa cum laude math major.</p> <p>Actually, there was a cash prize for the highest average grade math major, and I shared it with another guy. I wouldn’t have if I had studied for my real analysis final—if I had studied for it and not gotten a B, I would have gotten the full cash prize. But instead I shared it with another guy.</p> <p><strong>Brenton:</strong> Well, like most things, it’s best shared, right?</p> <p><strong>Tim:</strong> Probably.</p> <p><strong>Brenton:</strong> Being a math major, do you have opinions on brands of chalk, or chalk versus whiteboard?</p> <p><strong>Tim:</strong> That’s an interesting question because I’ve always hated handwriting—from when I was a child. Being homeschooled, my mom had a hard time with me hating handwriting. She used to ask, “What are you going to do when you need to write a note when you’re older?” And she says when I was pretty young, I told her I would just type it on the computer. Now she just laughs about that because that’s all we ever do.</p> <p><strong>Brenton:</strong> That’s what we call a thought leader.</p> <p><strong>Tim:</strong> Again, it’s really laziness. I still hate handwriting. I just had to sign documents to move, and they’re like, “Oh, well, it’d be good if your signature kind of spelled your name.”</p> <p>I do like, in theory, writing on a board—I know mathematicians are big about chalk—and I love teaching math. But I’ve never been that great at it, and I find it very tiring. I’ve never been into ordering chalk from Japan or something.</p> <p>When I did teach math, I was very poor as a grad student. An undergrad experience that was meaningful for me: the summer after my junior year, I did a <strong>REU program</strong>—Research Experience for Undergrads, an NSF-funded research program.</p> <p>That was meaningful partially because I got there wanting to do a math program, and my advisor’s first question when I got there was, “How good are you at programming?”</p> <p>All we did all summer was write C programs for computational graph theory—generate all these graphs and try to find the properties you want in the graphs. I went there to do math and did programming all summer, which was an impactful experience for me.</p> <p>Then I came back to college my senior year, and they wanted me to do a research presentation. My roommate and one of my friends who’s like a brother to me—Todd—actually came and coached me on writing on a chalkboard because I wanted to do a good job. He stood behind me and critiqued me as I tried to give my presentation so that I would be able to do a better job.</p> <p><strong>Brenton:</strong> Just for the way it looked, handwriting-wise and composition?</p> <p><strong>Tim:</strong> Yeah, and just making sure it’s not sloping down to the side and that kind of thing. My handwriting’s not good, my spelling’s not good. It’s really not a good thing for me to be doing.</p> <hr> <h3 id="grad-school-history-of-math-at-uva">Grad School: History of Math at UVA </h3><p><strong>Brenton:</strong> So math major, then grad school. Was that something you wanted to do immediately, or were you convinced to do?</p> <p><strong>Tim:</strong> Ever since I thought “I’ll study physics in college and be a physics professor,” I was like, “Well, I’ll just go to grad school. I like school, I’m good at school.”</p> <p><strong>Brenton:</strong> Was this still the mindset of going to be a professor?</p> <p><strong>Tim:</strong> Yeah, I think that’s really what was in my mind. For undergrads who really enjoy the experience—which I did—it’s very tempting to think your professors have the dream. They’re the pinnacle. So I was going to go to grad school for math.</p> <p>I considered a variety of things: math, logic in the philosophy department. I visited the University of Pittsburgh’s History and Philosophy of Math department. But then I applied and surprisingly got accepted to the <strong>University of Virginia’s History of Math program</strong>, which seems very esoteric. It’s only one professor and she’s very selective, so it was kind of surprising that I got selected. I probably shamed her by dropping out after a year.</p> <p>It’s part of the math department. Your PhD is basically a math PhD, but you write a dissertation on a history topic. You take the math courses, the math qualifying exams, and then do history stuff on top of it, plus two foreign languages.</p> <p>I applied to that and got accepted. But during college, I had gone every spring during spring break for four years—college students went on different mission trips around the world. I went with a group to a <strong>children’s home in Tennessee</strong>, a Christian children’s home. We just went and did manual labor projects to help them—building trails and steps, cleaning out barns, that kind of stuff.</p> <p>I had a really good experience doing that. It’s a small children’s home for children in crisis situations, not of their own making—so not kids who have been particularly bad kids, but they’re from bad situations. They take students either on summer breaks during college or often right after college to come work for one to three years. You live in the home with the kids and help teach them, cook for them, and just be there with them.</p> <p>I asked the University of Virginia if I could defer my admission for a year, which is pretty common in graduate programs. They said I could. So I went to work at this children’s home from 2008 to 2009 for 12 months. I lived in a house with nine boys, middle-school-age boys, and did all that—they had a kind of hybrid homeschool model. They come in with a lot of different situations, and their schooling isn’t necessarily grade level. So you’re just teaching them to what their ability is—math, history, English, reading.</p> <p><strong>Brenton:</strong> That sounds really humbling.</p> <p><strong>Tim:</strong> It was, yeah. My uncle said a very helpful thing to me: “Tim, when you go to grad school, you’ll learn how to be a mathematician. But if you do this, I think you’ll learn how to be a man.”</p> <p>There’s a lot to that. Even now being a parent, it was a really helpful experience—and very hard, very tiring. But I’m really glad I did that. It was in the Smoky Mountains in Tennessee, like a half mile from the Smoky Mountain National Park.</p> <p><strong>Brenton:</strong> Do you keep in touch with any of them?</p> <p><strong>Tim:</strong> Some, off and on through the years. It’s been a while. A few of them got dismissed, not necessarily in good situations—although actually one of those who got kicked out was messaging me on Facebook maybe 18 months ago. We had some exchanges. I actually just found out recently one of them died, which was really sad to hear.</p> <p>I’ve connected with a few of them on Facebook. They’re all like in their probably 30s now, which is kind of crazy. You freeze them in your mind as middle schoolers, but they’re grown-ups with kids and things.</p> <p>So I did that for a year. At the same time, I was practicing my French because for this history of math program, you have to have reading competency in two foreign languages.</p> <p><strong>Brenton:</strong> I was thinking like Latin or something.</p> <p><strong>Tim:</strong> Latin could be one of them. I wasn’t interested in stuff that old necessarily. But I was practicing French reading—reading is a lot easier than speaking or hearing French. I can’t understand any spoken French. I know you’re speaking French, and that’s all I can tell.</p> <p><strong>Brenton:</strong> My grad professor was from France.</p> <p><strong>Tim:</strong> Yeah. So I did that, and then I moved straight to Charlottesville after a year and started this history of math program. I did pretty well.</p> <p>My education was really good, but it was tough going up against students who had gone to bigger universities. You find out people started having taken graduate classes as undergrads, or they sat in on graduate classes, which was something I didn’t have the opportunity for at all. They just had more exposure to things.</p> <p>But my grad school experience really convinced me my undergrad gave me a really solid foundation and taught me how to think—especially one professor, Dr. Thompson, who intentionally made his undergrad classes similar to graduate classes. He would give take-home tests with really open-ended questions where we were learning new material on the take-home test. That was really good preparation for grad school.</p> <p>But it was still very hard. While I was there, I was kind of having something of an existential crisis around “what am I doing with my life?” and realizing I had just been going to grad school thinking “this is what I’m going to do.” I hadn’t really thought through the implications of it.</p> <p>I started to realize the history of math isn’t actually that marketable of a degree to have, and you spend half your 20s getting it. Former students of my advisor, who I got to meet—they were great—but I was like, “Okay, so I’m gonna do this and then maybe hopefully get to teach at some podunk university? Maybe that’s not the best use of my time.”</p> <hr> <h3 id="discovering-operations-research">Discovering Operations Research </h3><p>Somewhere along—late in college—I discovered <strong>operations research</strong> as a discipline: using math and optimization models to solve business and industrial problems. I discovered that through Cornell’s program. While I was looking at grad schools, I found it and was like, “Oh, this sounds really interesting.” It matched a lot of how I think about things—optimizing things, efficiency, math, and computation all in one.</p> <p>I basically decided to leave halfway through my first year at UVA. I applied to an OR program at UVA, one at North Carolina State University, and a master’s program at University of Tennessee—mostly just because I loved that area.</p> <p>For a variety of reasons, I decided to go to North Carolina State. My parents had moved from West Virginia to Greensboro, North Carolina—an hour and a half away—when I was in college. Two of my sisters actually lived in Durham, which is one city over from Raleigh where NC State is. For various other reasons—I knew some other people down there—I decided to go to NC State for a PhD program in operations research.</p> <p><strong>Brenton:</strong> Did you go straight into a PhD from undergrad, or did you do a master’s?</p> <p><strong>Tim:</strong> I didn’t do a master’s. In the US, a PhD program has you take the master’s coursework for the first two years, but I didn’t do an independent master’s ahead of that.</p> <p>So I graduated from college in ‘08, went straight to work at the children’s home, one year later started at UVA as a PhD student. But for the first two years, you’re doing the coursework.</p> <hr> <h3 id="teaching-calculus">Teaching Calculus </h3><p><strong>Brenton:</strong> Was your teaching before you started the PhD program, or was that part of your deal there?</p> <p><strong>Tim:</strong> I was a TA my first year, but usually you don’t teach your first year—you’re an assistant to another grad student who’s teaching. I was a TA for calculus one and two, where you’re grading and having office hours but not actually lecturing.</p> <p>But that summer, I was lecturing, not having ever done it before. Actually, there’s a funny story—to me anyway. I’d actually never taken calculus two. In high school, I took calculus early like I said, but I did the AB calculus test because that’s all we had. We didn’t offer calc two. My senior year, I signed up for calculus three at that local college. Then I got my undergrad to waive the fact that I’d never taken calc two, where you learn all these tricks for integration like trig substitution. I didn’t know any of that.</p> <p>By the time I TA’d, I had mostly learned it. But this girl came to my office one day when I was a TA with a question I did not know the answer to at all. I knew nothing about calc two. She actually ended up dropping the class after she came to my office, which I feel horrified about—that I ruined her life because I was doing something I was unqualified for.</p> <p><strong>Brenton:</strong> I think calculus did that.</p> <p><strong>Tim:</strong> But when I taught calc two, I learned. Teaching a week’s worth of material in a day is not only hard for the students—it’s hard for me.</p> <p><strong>Brenton:</strong> Did you have to come up with the curriculum and lessons too?</p> <p><strong>Tim:</strong> Yeah, basically the grad students had all these PDFs of notes they would pass around to each other. I would get somebody’s notes, but you’d go teach for four hours in the morning and then go home and grade and get ready for the next day. It was very intense.</p> <p>But I learned so much calculus standing at the board, writing in chalk, explaining it. I’m like, “Oh, that’s what that means”—quietly to myself so the kids don’t think I don’t know.</p> <p><strong>Brenton:</strong> Quietly to yourself so the kids don’t think, you know…</p> <p><strong>Tim:</strong> Right. But discovering a very good life lesson that teaching is such a great way to learn things. As you know, I still love teaching things when I work—trying to teach my colleagues, and through various online resources I make. I still love teaching. People think I enjoy other people learning, but it also helps me learn as well.</p> <hr> <h3 id="nc-state-and-operations-research">NC State and Operations Research </h3><p><strong>Tim:</strong> So I went to North Carolina State, started doing classes again. They didn’t really care that I had been in a PhD program for a year previously. I didn’t get a degree or anything from UVA.</p> <p>The motivation for doing operations research was: okay, here’s interesting problem-solving but also something that seems like it has jobs related to it because it’s useful. I don’t know if that’s actually true for operations research, but at least I was learning useful skills.</p> <p>I was just doing my coursework, and then I found an advisor after a year. He asked me if I wanted to work with him. He did healthcare optimization stuff—treatment optimization. You use these mathematical models to figure out, like, what’s the optimal schedule to give a chemotherapy drug to treat cancer? He also did healthcare operations problems—like, how do you schedule your patients most efficiently so they’re not sitting in the waiting room for a long time?</p> <p><strong>Brenton:</strong> Would that be from clinical trial data, trying to build out those models?</p> <p><strong>Tim:</strong> It can be from a variety of things. They use real data, they make stuff up. I ended up doing the operations side—we partnered with the oncology clinic, collected real data on patient wait times, and built a <strong>simulation model</strong> of the oncology clinic.</p> <p>Another part of operations research besides optimization is simulations—discrete event simulations. If you can build a realistic simulation of the situation, then you can modify the simulation and see if you can remove bottlenecks. Similar to Six Sigma and that kind of business optimization stuff.</p> <p><strong>Brenton:</strong> Is that what brought about the PhD downfall with Six Sigma?</p> <p><strong>Tim:</strong> No, I ended up working on semi-theoretical problems. My advisor was an expert in something called <strong>stochastic integer linear programming</strong>.</p> <p>A big part of operations research is optimization problems where you have a mathematical formula you’re trying to optimize with mathematical constraints. You can describe business problems that way. But stochastic optimization says, “Oh, I’m not just optimizing these, but maybe my coefficients are random variables.” So I want to minimize the worst outcome or maximize the average or something—which is mathematically significantly more complicated.</p> <p>He worked on a problem on patient scheduling. When you call your doctor and make an appointment, they’re not operating under certainty—they don’t know who else is going to schedule. So trying to quantify that uncertainty and make the best decision. He made this huge theoretical optimization problem that he could only solve for <strong>three patients</strong>. So if your clinic sees a total of three patients, you can get the optimal solution.</p> <p><strong>Brenton:</strong> We call it the three-body problem.</p> <p><strong>Tim:</strong> So I worked on essentially an approximation solution using what’s now called <strong>reinforcement learning</strong>—computer scientists had always called it that, but in operations research they called it approximate dynamic programming. It’s essentially the same thing: you simulate possible outcomes and use that to figure out your optimization.</p> <p>Which was good. And another strain going on at the same time: in late 2009, I had gotten on Twitter. As a lazy, procrastinating grad student, I spent a lot of time on Twitter. <strong>Data science</strong> in that era was becoming a thing, in no small part because of people blogging about it and talking about it on Twitter. I was like, “Oh, this is really interesting—they’re using data and computation and math to solve real problems.”</p> <p>This machine learning stuff sounded really interesting. I didn’t have any exposure to that in college or before.</p> <p>Because of the way my program was at NC State, I was able to start taking some machine learning coursework. My research was essentially turning optimization problems into machine learning pattern recognition problems—simulating all these possible outcomes and using machine learning models to learn those patterns, then figuring out how to optimize over that.</p> <p>This is 2010 to 2012, learning machine learning and data science stuff as it was becoming relevant in a way it had never been before. That was partially luck, but also me trying to shape my curriculum into things that seemed relevant in industry and useful to me. That helped set the stage for a lot of stuff.</p> <hr> <h3 id="kiva-systems-internship">Kiva Systems Internship </h3><p><strong>Tim:</strong> That summer—after one year at NC State—I talked my way into an internship at <strong>Kiva Systems</strong>, which is now Amazon Robotics, doing warehouse automation. I read about it over Christmas break in <em>Wired Magazine</em>, and found out the head of research at Kiva Systems was Pete Werman, who was an NC State professor before going to work for his old friend at Kiva Systems.</p> <p>I cold-emailed him and was like, “Can I come work for you for the summer?” I turned that into an internship, which is wild that he hired me, having only talked on the phone. I drove up to Boston that summer, met him on my first day of work, and worked on simulation models for warehouse optimization.</p> <p>That gave me exposure to real-world applications.</p> <p><strong>Brenton:</strong> Applications, yeah.</p> <p><strong>Tim:</strong> Optimization and basically operations research in the real world. A good taste. And it paid better than grad school did, which is another good taste of the future.</p> <p>So I did that, went back, and kept going with grad school my second year. After I came back from Boston, all my free time that summer I’d been studying for my qualifying exams. I came back and passed all my qualifying exams, which is one of the big steps towards the PhD.</p> <p><strong>Brenton:</strong> That’s an accomplishment in and of itself.</p> <p><strong>Tim:</strong> Yeah. I’ve always been good at tests, which is a completely useless skill, but it helps in school.</p> <p><strong>Brenton:</strong> I mean, some people say life is a test.</p> <p><strong>Tim:</strong> I don’t think taking math tests is a good life skill, but I am good at it.</p> <hr> <h3 id="leaving-the-phd-program">Leaving the PhD Program </h3><p><strong>Tim:</strong> I came back and did research for a year. Then the following May—after the end of my second year—my advisor called me into his office and said, “I’m taking a job at the University of Michigan in Ann Arbor. I would like for you to stay here at NC State and continue working with me on your PhD.” He was going to have a guest appointment so he could continue to advise me.</p> <p>I was like, “Okay, that’s fine. I’ll do that.” I kept working with him through the summer. The summer was a very lonely time in the office—just staring at a computer in this sterile office doing research. He was gone, actually in China for a lot of that summer. There was nothing to break up the day. I wasn’t teaching. I would go to the gym—which is something that became a hobby in my adult life—but it was a very lonely and somewhat depressing time.</p> <p>By August, I was like, “He’s not even good at working with me from across the hall. How is he going to work with me from Michigan to North Carolina, which is far away?” There were other students going to have the same arrangement with him, but I was the newest student, so I was going to be in it for the longest—probably at least three more years if I’m lucky. That’s just not a good situation.</p> <p><strong>Brenton:</strong> You’re going to be there for all the learning and growing pains of trying to do that remotely.</p> <p><strong>Tim:</strong> For at least three more years. And I was not making much money—the money at NC State was worse as a PhD student than at UVA. I was making like $17,000 a year or something. I didn’t have that many expenses, but that’s not a lot of money.</p> <p>So that August, I started looking for jobs—mostly locally but in the data science-ish realm. I was doing that totally on the side. I didn’t tell my advisor because I needed him to keep paying me in the meantime, and he was in Michigan, so it was okay. I was taking one class that last semester—a <strong>Bayesian networks class</strong>, like a machine learning class.</p> <p>Another part of this story: over those two years of getting interested in data science stuff, I started teaching myself <strong>Python</strong> and <strong>R</strong> as well—which I learned about from Twitter. I did my research in Python, which my advisor didn’t know and he didn’t want me to do. He wanted me to do C++, which he knew. But I was like, “Oh, Python’s the up-and-coming thing, and it has this thing called scikit-learn where you can access all these machine learning models”—which we could use for our reinforcement learning. That was really good that I did that.</p> <p><strong>Brenton:</strong> You’re very proactive.</p> <p><strong>Tim:</strong> I tried to be. I really think I just got lucky in so many ways.</p> <p><strong>Brenton:</strong> Well, that’s not without effort, you know.</p> <p><strong>Tim:</strong> My effort was not always—I didn’t always know what I was doing. I don’t think it’s always good advice, like “just follow your interests and it’ll work out.” But for me, it did in a lot of ways.</p> <p>I interviewed for a variety of jobs. I got offered one at a local chain of retirement homes in North Carolina. They had hired a grad student to build optimization models for their pricing structure of their nursing homes. She was rolling off the project, and they wanted somebody to do it full time. They offered me a job—for the amount of money they thought they were going to make, they didn’t offer me very much money.</p> <p><strong>Brenton:</strong> But still more than grad school was paying.</p> <p><strong>Tim:</strong> Yeah. I also got offered a job at <strong>RTI International</strong>, which is a nonprofit research company in the Raleigh-Durham area. They do government contract stuff—not defense, but kind of humanitarian government stuff: education, economics, health.</p> <p>I think some executive had read an article in some business magazine about data science and thought, “We should hire a data scientist.” They basically hired me out of hype, when it was a company that was essentially doing data science for 50 years—statisticians, programmers, doing data analysis, surveys, all this stuff. They didn’t really know what they were hiring me for.</p> <p>They offered me that job in October of 2012, and I decided to take it because it seemed less depressing than working in the nursing home administrative office.</p> <p>So then I called my advisor. I was like, “Oh, I’m dropping out, by the way.”</p> <p><strong>Brenton:</strong> How did he take that news?</p> <p><strong>Tim:</strong> I don’t think he was totally surprised. I think he was probably disappointed that I hadn’t led on to it anymore. But I just had to protect myself at that point.</p> <p>So I finished up that semester, finished up the last class, and started working late October of 2012. I’ve been in industry ever since then.</p> <hr> <h3 id="meeting-maggie-and-getting-married">Meeting Maggie and Getting Married </h3><p><strong>Brenton:</strong> What year was this?</p> <p><strong>Tim:</strong> 2012, yeah. October 2012.</p> <p><strong>Brenton:</strong> When did you meet Maggie?</p> <p><strong>Tim:</strong> I didn’t meet Maggie until a few years later. We got married in August of 2015.</p> <p><strong>Brenton:</strong> It was right before Distill then, right?</p> <p><strong>Tim:</strong> Yeah, literally right before Distill. To jump ahead a little bit, right before I got married, I found out I was needing a new job three weeks before I got married—which was a crazy time. And then came to work with you at Distill.</p> <p><strong>Brenton:</strong> I didn’t realize those events were that close together.</p> <p><strong>Tim:</strong> We got married in August. I started working there in October, I believe. But three weeks before my wedding, my boss called me and was basically like, “The whole project is falling apart”—I worked on a DARPA project, government stuff.</p> <p>At that point before Maggie, I was single, lived alone, ate a lot of peanut butter, gained weight. Not actually the best time of my life.</p> <p><strong>Brenton:</strong> Well, you know, we all go through our ups and downs. So then—okay, Distill—and then I’m debating if we want to talk about your career, because…</p> <p><strong>Tim:</strong> We can, it’s pretty public. It’s on my resume.</p> <p><strong>Brenton:</strong> Sure. I’m more interested in terms of—at least for me, selfishly—I’d like to know more about you becoming a parent. In terms of hobbies, I know you were doing the workout stuff and strongman competitions at one point. And life outside of work. Let’s start there instead of going into career stuff. Life outside of work from that point.</p> <hr> <h3 id="hobbies-woodworking-and-hiking">Hobbies: Woodworking and Hiking </h3><p><strong>Tim:</strong> One thing I discovered late in grad school: NC State has a place called the Craft Center. It’s a non-academic place for students to just go do stuff—and alumni. One thing they had is a wood shop. I actually got into <strong>woodworking</strong> for a few years there, which I wish I had known about all through grad school. I think it would have been good for my mental health.</p> <p>Maybe my last semester I took a wood class. It’s an incredible wood shop—it really spoiled me because I could never have the setup they have there. For a little bit, they had a guy who was just a master woodworker teaching. I think that was actually right after I finished grad school—I continued to take some classes there.</p> <p><strong>Brenton:</strong> What did you build there?</p> <p><strong>Tim:</strong> In the classes, I made some basic stuff—a Shaker shelf, a little side table. I had been working on a big project to make a top for a standing desk, a big walnut top. I was also making a monitor stand. I never actually totally finished it, which is unfortunate. I was going to make these really cool swing-out drawers—I built nice walnut drawers that were going to be on heavy-duty pivot hinges, just pivot out on one point to open, and then mount it on standard standing desk legs.</p> <p>I still have the desktop—my 3D printer sits on it now—but I never finished the monitor hutch part. I learned a lot. Wood is such an interesting, complex medium to work in because it’s so precise but also—wood is very alive because of moisture things.</p> <p><strong>Brenton:</strong> But it’s approachable.</p> <p><strong>Tim:</strong> Yeah. It makes you learn to appreciate the ability to press the undo button, because there’s no undo button in woodworking.</p> <p>I also got more into <strong>hiking</strong> in that era. I’d been in Boy Scouts, but I didn’t really love the outdoors when I was younger.</p> <p><strong>Brenton:</strong> What didn’t you like about it back then?</p> <p><strong>Tim:</strong> That’s a good question. I just—it didn’t really… I think part of it is I just had a very good childhood. When I used to golf with my dad, he golfed a lot to blow off steam, and I just didn’t need to blow off steam because I had a very stable home. School was very stable. I don’t think I needed to go blow off steam that much.</p> <p>But during college, I started to really enjoy the outdoors more. During and right after grad school, I got more into hiking. North Carolina just has some world-class scenery—not world-class height of mountains, but stunning scenery. I started to explore North Carolina more, the mountains more.</p> <p>That also became more of an opportunity as I had money to drive myself out to the mountains, stay in a hotel, and hike.</p> <p><strong>Brenton:</strong> Speaking of driving, when did you get your Matrix?</p> <p><strong>Tim:</strong> I bought my <strong>Toyota Matrix</strong> that I still drive in 2010. Right after I moved to Raleigh. I had been driving an old Mercury Mountaineer, which is like a Ford Explorer. It was falling apart—V8, all-wheel drive, horrible gas mileage—and gas had just gotten really expensive. I lived a long way from school and was like, “I can’t afford to keep driving this.”</p> <p>So I bought a Toyota Matrix with 107,000 miles on it that I still drive now with <strong>260,000 miles</strong> on it. It’s my pride and joy. I recently joined a Toyota Matrix owner’s Facebook group.</p> <p><strong>Brenton:</strong> I had a Matrix! I had to get rid of it mainly because I was commuting 90 miles one way and it was a manual. Jolene couldn’t drive it, so we just got something else.</p> <p><strong>Tim:</strong> Mine’s the automatic, and the transmission has been rock solid.</p> <p><strong>Brenton:</strong> It’s such a good car. You can do all kinds of stuff—the seats go down and it’s got so much room.</p> <p><strong>Tim:</strong> I hauled 28 bags of mulch from Lowe’s last summer in the back. We just moved, and I packed an incredible number of boxes flat in there to take to the recycling center. I’m going to keep driving it until I can’t.</p> <p><strong>Brenton:</strong> When I took a fundamentals of engineering course, we did one of those newspaper bridges. We did this massive suspension bridge and had to put down the seats on one side so it would go from the back of the trunk to the front of the car to transport it.</p> <p><strong>Tim:</strong> Yeah. Also right as I was finishing grad school, I became more involved in my church. I became the <strong>treasurer</strong> for my church, which I did until three weeks ago—for a long time. And I became a deacon at the church. In the Presbyterian tradition, a deacon is someone who helps people who have financial needs or other material needs in the church or outside of the church. So I did that for a while.</p> <p>Those early days right out of grad school felt very busy—getting used to having a full-time job. I also started living by myself for the first time with no roommates. So I couldn’t blame anyone for the bathroom or kitchen not being cleaned.</p> <p>I actually like to cook a lot too. I don’t cook much anymore, but I cooked a lot through grad school and those early days. I enjoyed having people in my house—having friends over and cooking for them.</p> <p>And getting into reading more—kind of back into reading in that era, being more deliberate about it. Those were my good years of reading, kind of reading in solitude before having kids. Just having a quiet house to read in.</p> <hr> <h3 id="photography-and-wildlife">Photography and Wildlife </h3><p><strong>Tim:</strong> Photography also started interesting me more in that era too. My dad’s always been a big photographer, and I’d taken a photography class in high school but hadn’t really stuck.</p> <p><strong>Brenton:</strong> Was it always nature photography in particular, or just photography in general?</p> <p><strong>Tim:</strong> My dad has always been really good at photographing people and events—that had never really been my thing. But I think it started to come back as I started to get more into hiking and being out in nature.</p> <p>North Carolina has some world-class waterfalls. I would go to waterfalls and try to take pictures on my iPhone 5 or something—just worthless. Waterfalls are actually really hard to photograph, and maybe not worth photographing that much. But I was like, “How can I capture this?”</p> <p><strong>Brenton:</strong> Neutral density filter?</p> <p><strong>Tim:</strong> Yeah, I started to learn more about the techniques and bought different cameras.</p> <p>It was actually 10 years ago last weekend that I got into—after I was married—we had an ice storm come through in January of 2016. I had a Sony a6300 or something, a little crop sensor camera. I was photographing birds on our bird feeder on the ice during the ice storm, and Maggie saw I enjoyed it so much. She said, “You should get a nicer camera.”</p> <p>That kind of kicked off my interest in <strong>wildlife photography</strong>, which is now—I don’t know, we haven’t talked about photography as much in the last few years—almost exclusively my wildlife is my children at this point. Which is harder than any animal.</p> <p>But it’s still my love. I would love to do more wildlife stuff—even more than landscapes. I think wildlife is what I really love.</p> <hr> <h3 id="becoming-a-parent">Becoming a Parent </h3><p><strong>Brenton:</strong> Speaking of wildlife and kids, let’s talk about becoming a parent. Did you always want to be a parent?</p> <p><strong>Tim:</strong> Yeah, I’ve always loved kids. In elementary school, my parents were foster parents. I didn’t have younger siblings, but my mom’s a pediatric nurse. So she was specifically a foster parent for babies with medical needs—that was her expertise.</p> <p>We had, not for long term—they’re mostly short term—but we had <strong>25 different foster kids</strong> in my elementary school years. Some as short as like a weekend, and some for maybe months at a time—but not years.</p> <p>My mom was kind of the revolving door of child care for her friends—her friends’ kids were all at our house all the time. So I was just around a lot of kids growing up, and that was a good experience for me.</p> <p><strong>Brenton:</strong> Do you think that might have influenced or motivated you to do the children’s home stuff?</p> <p><strong>Tim:</strong> Yeah, probably. Those were older kids—middle-school age—kind of a different demographic in a lot of ways, but I’m sure that was influential.</p> <p>I always hoped I would have kids. I was around a lot of big families growing up. I grew up in a family of four, which is a big family—it’s above average.</p> <p>Maggie and I got married in August of 2015.</p> <p>Another kind of crazy story you probably don’t know: Maggie’s older sister had married one of my best friends from college two years earlier. That’s how we met. My brother-in-law is one of my closest friends from college. She had grown up in a big family as well and always wanted to have kids.</p> <p>We had our first, James, two years and one day after our wedding. We checked into the hospital on our second anniversary—he was a week late.</p> <p>And now we have four: eight, six, almost four, and two. Boy, boy, girl, girl—kind of in the even gap in this little window. Which is a good distribution. It’s crazy—you could not have planned it any better.</p> <p>It’s really fun to see the kids get to enjoy each other. They all enjoy each other, but the boys really pair off and do a lot of boy stuff, and the girls pair off and giggle and talk about princesses a lot. It’s very fun.</p> <p><strong>Brenton:</strong> I’m curious about your thoughts and philosophies of parenting. You had a pretty happy, healthy childhood. What are some things you take away from your parents that you try to mimic with your kids?</p> <p><strong>Tim:</strong> I think one thing with my parents is I always knew that they loved and supported us and just had our backs. That’s a big thing we want our children to see—that we’re there for them.</p> <p>We also want them to know we’re in charge. I’m not as much into some of the modern parenting that’s more like you’re chummy with your children. We really enjoy the time with them, but we are in charge. They hopefully really know that we’re supporting them.</p> <p>As I was describing my college situation and liberal arts education—really opening their minds—my kids are homeschooled for right now. Maggie teaches them at home, the boys anyway—they’re eight and six.</p> <p>We’re trying to make education not just a really rote thing—“here you just have to follow these steps”—or just learning things that are useful in the sense of “you’re going to be able to get a job one day.” I think it’s good to be able to get a job—we’ve talked about that. But at the same time, we want them to really understand and think deeply and understand the world and even just enjoy the world.</p> <p>There’s teaching your kids to just enjoy things. Liberal arts education—often that’s in literature and stuff, really enjoying it. For me, having studied math, I just intrinsically enjoy math. I think it’s really beautiful—people hear that and think “oh, arithmetic”—and arithmetic is in its own way—but math at a higher level is past arithmetic.</p> <blockquote> <p>I think math is intrinsically a beautiful thing, and we’re trying to teach the kids to embrace that beauty of things.</p> </blockquote> <p><strong>Brenton:</strong> The further you go into something, the more nuanced it becomes. The more you appreciate the challenges or perspectives, and it broadens your perspective in general.</p> <p><strong>Tim:</strong> Yeah. Feynman said something like—to paraphrase—nothing’s boring enough if you go deep enough into it. I don’t know if that’s totally true, but there’s a lot of truth to that.</p> <p>You and I both got into “data science” interested in doing more of the interesting machine learning models. Both of us have done a lot more infrastructure stuff. I know you probably do some model stuff, but I exclusively do infrastructure-type engineering things now.</p> <p>I gave a talk about this a couple of years ago, but learning to embrace the beauty even of that has been a formative thing for me. I want to instill that in my kids—not about programming necessarily, but just about the things they do.</p> <p>With all the hardships and evil in the world, there’s such a great world to live in. So many beautiful things.</p> <p><strong>Brenton:</strong> There’s a Brandon Sanderson book, and there’s a quote I really liked. It’s something along the lines of: “It’s a difficult time to live. It doesn’t have to be a difficult time to love.”</p> <p><strong>Tim:</strong> Yeah. There’s a C.S. Lewis quote along the same lines—where he’s defending writing things during World War II. He basically talks about, you know, if the bombs fall on us while we do it, we should just be making the most of it while the bombs fall.</p> <p>I don’t know if that’s my political philosophy necessarily, but I do try to look at the world that way. Even in the drudgery of my own work, I try to really embrace the beautiful parts of it.</p> <p>We both love our different hobbies. Hobbies are an opportunity to just enjoy good things—even if you’re not sharing it, even if nobody knows about it. You do all kinds of things in your own home that people don’t know about.</p> <p><strong>Brenton:</strong> I think one of my favorite things about different hobbies is learning that things are complicated. Problems are easier when they’re not yours, kind of a thing. It helps you appreciate that somebody does something—you can appreciate that it took a lot of effort or some courage to do.</p> <p>I used to watch a lot of Bob Ross, and one of the things he used to talk about was—he would paint the whole background and then you want to put the big tree in the front, the foreground. He’d say, “Okay, now you got to be brave and put this thing in.” I was laughing—what is this guy talking about, being brave while you’re painting? It’s just a painting.</p> <p>But I’ve painted, and you have to have some courage to mess up the thing you just spent a lot of time doing—to take that risk and put something new in front of it.</p> <p><strong>Tim:</strong> Yeah. In my attempt to not just do things in life that I’m good at, I’ve learned as an adult to really work hard and push through things and just try things. That’s something I’m trying to instill in my kids.</p> <p>One of them doesn’t like to do something if it’s not going to turn out well. I tell them about the Edison quote—“we’ve learned a thousand ways that don’t work.” That’s what I do in my job all day long—try things, see it doesn’t work, try something else. I just restarted a failing computation job right before you and I got on. Although now my AI tools are pretty good at doing that for me, so maybe I’m gonna get back to being lazy.</p> <p><strong>Brenton:</strong> How do you think being a parent has impacted you?</p> <p><strong>Tim:</strong> I’d have to give that some reflection. One thing I’ve definitely experienced—and a lot of people, and maybe more surprisingly a lot of fathers, talk about this—being a parent makes you a lot more emotional. You can’t watch an emotional scene in a movie the same way. Or any movie where a kid gets hurt—it’s just impossible to watch. You experience it differently when you don’t have kids.</p> <p>I think it keeps you young in some ways. Your kids are at the stage they’re at, and you can be an adult most of the time, but sometimes you can just sit down with a two-year-old.</p> <p><strong>Brenton:</strong> You experience childhood vicariously, I guess.</p> <p><strong>Tim:</strong> Yeah. One of the cool things—all my kids have had this, and I’ve seen some people on Twitter mention it, I think it’s fairly universal but I don’t know why—two-year-olds have a stage where they get obsessed with seeing the moon. Every time they get out of the car, it’s like, “Look, the moon, the moon!” Around two years and three months or so.</p> <p>Those kinds of things are such a good reminder—like, oh, this is crazy. There’s this huge rock flying around us every day, and we totally take it for granted. There are so many things kids are learning for the first time. It’s like, “Oh yeah, I should stop and appreciate that the moon is up there, appreciate the stars are up there.”</p> <p><strong>Brenton:</strong> One of my favorite things about astronomy and astrophotography is it’s very humbling, I think.</p> <p><strong>Tim:</strong> Yeah. I was hoping at our new house—you gave me some telescope recommendations a while ago. My eight-year-old keeps asking about it.</p> <p><strong>Brenton:</strong> Those have probably changed by now.</p> <p><strong>Tim:</strong> Well, I was hoping our new house would be perfect, but they recently installed some new street lights right behind us that I didn’t know were going to be there. It’s not as dark as I thought. But we’re kind of on the edge of Indianapolis, so I’m hoping we can get out a little bit more and start getting into some astronomy together.</p> <p><strong>Brenton:</strong> Really fun, especially with the kids—everyone’s seen it for the first time.</p> <p><strong>Tim:</strong> Yeah. There are a lot of things that are just neat to—like reading to the kids. That’s been one of my favorite things as a parent. Some stuff I’ve been reading to them for the first time—that’s fun—but like getting to read <em>The Chronicles of Narnia</em> to them, which my mom read to us. Or reading <em>The Hobbit</em> to them.</p> <p><strong>Brenton:</strong> Are you going to do <em>Lord of the Rings</em> and <em>The Hobbit</em>?</p> <p><strong>Tim:</strong> I’m a little torn on when to read <em>The Lord of the Rings</em>. There’s a Tolkien quote about how <em>Lord of the Rings</em> isn’t a children’s book, and you can only read it for the first time once, so don’t do it too soon. I would love to start it now, but I think I want to wait till they’re a little older.</p> <p>But even then, just getting to think, “What other stuff can we read together?” My boys and I read most evenings before bed. Recently they’ve gotten into the Three Stooges—they sometimes beg me to watch the Three Stooges instead of reading to them. But that’s been really fun.</p> <hr> <h3 id="ai-and-the-future">AI and the Future </h3><p><strong>Brenton:</strong> With the way things are—AI taking over the world for better or worse, the Pandora’s box is open—the world is changing very dramatically. How does that impact your hopes and thoughts about your kids and their future, their prospects?</p> <p><strong>Tim:</strong> It’s a big question. I’ve thought about it some, and I really just don’t know the answers.</p> <p>The short answer for me is: for them, not much needs to change yet. My boys love to do image generation on ChatGPT or Gemini. It’s usually like “make a baby riding a motorcycle on Mars” or something silly.</p> <p>With my eight-year-old—he just has endless questions about things: animals, technology, TV shows. So we’ll ask the AI.</p> <p><strong>Brenton:</strong> You had a World Book Encyclopedia to flip through.</p> <p><strong>Tim:</strong> Yeah, I should get him some—I try to get him a lot of reference-type books because he’s a lot like me and likes that kind of stuff.</p> <p>I’ve been trying to teach my boys some basic programming using Scratch—a visual programming thing—and some iPad stuff. With my eight-year-old, I think we could probably <strong>vibe code</strong> a game using Claude Code and some kind of text-to-speech thing. He could just talk to the computer and build some kind of game. I’m excited to show him something like that.</p> <p>But at the same time, it seems to still be really valuable to learn the fundamentals. I don’t want to derail him from learning the fundamentals.</p> <p><strong>Brenton:</strong> It makes me think of calculus three—that was the time we were allowed to start using a graphing calculator. That is the moment I stopped learning calculus.</p> <p><strong>Tim:</strong> Yeah. I remember when I was teaching calculus, Wolfram Alpha had come out and was showing the steps on integration and derivation.</p> <p>I’m really glad I was in school in an era where I just had to grind through problems. That professor who would give us long take-homes that you’d spend eight, twelve hours on, just really wrestling through problems—probably now, if you took his abstract algebra take-home test, ChatGPT could one-shot solve the entire test.</p> <p>Which is awesome—I use these tools all day, every day, and it’s fantastic. But I’m so glad I was schooled in foundations in a lot of ways.</p> <p><strong>Brenton:</strong> Similarly, I’m glad we grew up before social media and the internet kind of took over. I don’t know what it would be like to grow up with that just being in your face all the time.</p> <p><strong>Tim:</strong> Absolutely, yeah. We’re slow to introduce technology things to my kids.</p> <p>It seems unquestionable that AI stuff is going to have an impact on whatever they choose to do in their careers—and at some point in their schooling if they’re going to college. I don’t want to keep them from learning how to use those tools well.</p> <p>But at the same time, with programming stuff, I use AI tools literally eight-plus hours a day. Maybe I deceive myself sometimes, but I think I’m better at using them because of a lot of experience prior to using them—understanding Python really well and how things work.</p> <p>I have a lot of both professional experience and education and fundamentals that are still useful. I can still think about the time complexity of an algorithm that the AI tool generates.</p> <p>I honestly don’t know what all those answers look like. But it’s probably to some degree comparable to not rushing to using the graphing calculator—or the TI-89 that could do symbolic integration. It is worth actually grinding through and learning some stuff first.</p> <p>But it’s going to be harder and harder for that. Kids in public school are getting Chromebooks, and once you have Google Docs, you’re getting access to Gemini.</p> <p><strong>Brenton:</strong> The cognitive offload is pretty real.</p> <p><strong>Tim:</strong> Absolutely. All of us are wrestling with that in our own jobs—what the implications are, what’s good and bad about it.</p> <p>Even just writing—I don’t think I’m an amazing writer, but I went through school having to really write and being graded on that and learning from that. I want my kids to learn how to write well because I want them to be able to communicate their ideas well—not just generate essays.</p> <p>I honestly don’t know all the answers. I’ve been asking everybody on my podcasts about the future, and nobody really knows. I don’t know either.</p> <hr> <h3 id="moving-to-indiana">Moving to Indiana </h3><p><strong>Brenton:</strong> Speaking more about the future and maybe wrapping up—you just did a pretty big move. What are your aspirations going from here, from a hobby perspective or family perspective?</p> <p><strong>Tim:</strong> We moved to Indiana to be closer to family—my wife’s family specifically. She grew up in the town we’re living in now.</p> <p>In a lot of ways, we’re just trying to develop that. Neither of us have really lived that close to family as adults.</p> <p><strong>Brenton:</strong> Especially being as close with family as you guys are.</p> <p><strong>Tim:</strong> With kids, our kids have not grown up spending a lot of time with family. Now they have a cousin who’s a mile away, which is a whole new experience. We’re hoping to lean into that and really make the most of it.</p> <p>Maggie and I are currently—if you have any opinions—Maggie wants us to find a new hobby that she and I can do together in the evenings when the kids are in bed.</p> <p><strong>Brenton:</strong> I was going to say board games, but we know where that’s going to go.</p> <p><strong>Tim:</strong> I’m not a board game player. Our off-and-on activity has been playing Cat Quest on the Switch.</p> <p><strong>Brenton:</strong> There’s a studio that makes exclusively two-player games—the most recent one was called <em>Split Fiction</em>, which was really good. And they have <em>It Takes Two</em>, which is like a relationship-sort-of game. It’s really good—I would recommend that for video games. I don’t know if it’s on the Switch though.</p> <p><strong>Tim:</strong> My buddy’s been playing that with his wife.</p> <p><strong>Brenton:</strong> The most recent one is <em>Split Fiction</em>—more like a platformer puzzly kind of thing, but it does things in such a neat way.</p> <p>And then painting—I like painting quite a bit. Try to constrain how much you’re doing in a given time. I get these little four-by-six canvas blocks and just do something small.</p> <p><strong>Tim:</strong> My sister-in-law who lives a mile from us is a very successful professional painter.</p> <p><strong>Brenton:</strong> Well, don’t invite them. Or get tips and then don’t tell Maggie—then you can show them off.</p> <p><strong>Tim:</strong> I like the idea of foraying into something I don’t have a lot of ability in.</p> <p><strong>Brenton:</strong> We did pottery for a while—that’s not something you can do readily at home without some investment. But we did a pottery class for like eight weeks and then rented studio space and did that for quite a while. It was a lot of fun.</p> <p><strong>Tim:</strong> At the place I did woodworking, we took one pottery class before we had kids. That was really enjoyable.</p> <p><strong>Brenton:</strong> Anything like making or going through that creative process together is always fun.</p> <p><strong>Tim:</strong> Yeah. We’ve just spent the past eight years having kids—a very intense time of our life. We’re not planning to have more. Now we have a sort of clean slate. What kinds of things can we do?</p> <p>We also have a big basement now, which is awesome. I have space to do stuff.</p> <p><strong>Brenton:</strong> For the board game collection.</p> <p><strong>Tim:</strong> Yeah, you know me. Unfortunately, my kids are enjoying games. Unfortunately. I’m not going to get into that. All Maggie’s family here loves games too, so… might have to…</p> <p><strong>Brenton:</strong> I’ll have the podcast part two—I’ll have to get some more understanding of your perspective there.</p> <p><strong>Tim:</strong> Part of it is I’m not competitive.</p> <p><strong>Brenton:</strong> Okay, that makes sense.</p> <hr> <h3 id="making-things-and-knitting">Making Things and Knitting </h3><p><strong>Tim:</strong> Part of it is—something we haven’t really talked about—I really do love to make stuff. I loved woodworking. One of my main hobbies now is making websites—putting stuff online. My current big project is the <strong>Python Developer Tooling Handbook</strong>.</p> <p><strong>Brenton:</strong> What about the basketball thing? Is that just a chain of tweets?</p> <p><strong>Tim:</strong> It was a Twitter and Instagram account. I’ve tried to turn it into a website—I need to use Claude Code now to get that going again. My “Do You Play Ball” site. And I do some religious history-related stuff online too.</p> <p>I really love making things. Even loving to read—that’s actually a tension for me in my hobbies. Reading is like you’re not making anything. When I sit down to do something, I love to just tinker. A website is a common outlet for that.</p> <p>Even photography, I think for me, is making something. Photographers talk about “making an image”—and I really do think that’s true. Not just in how you capture it and how you decide to capture it and how you choose to edit it, but even choosing what images you show versus which ones you delete is part of the making process.</p> <p><strong>Brenton:</strong> People delete pictures now?</p> <p><strong>Tim:</strong> I delete a lot. Quite a bit.</p> <p><strong>Brenton:</strong> It sounds like one thing to add some variety would be the tangible kind of creation—something that’s not on the computer, like the woodworking kind of thing.</p> <p><strong>Tim:</strong> I also like to <strong>knit</strong>. Do you know about me knitting?</p> <p><strong>Brenton:</strong> I do know about you knitting. I forgot to bring that up too.</p> <p><strong>Tim:</strong> That’s something Maggie and I do sometimes together. She crochets, and I knit as a hobby. We do that and watch movies or something.</p> <p><strong>Brenton:</strong> And your rocking chairs.</p> <p><strong>Tim:</strong> Yeah, I wish I had a rocking chair.</p> <p><strong>Brenton:</strong> I know someone who’s making a rocking chair.</p> <p><strong>Tim:</strong> My boys are actually trying to teach them how to crochet right now, which is fun.</p> <p><strong>Brenton:</strong> That’s cool.</p> <p><strong>Tim:</strong> Knitting is a very slow way to make things, but it’s something I started back in college 20 years ago. I picked it up again recently.</p> <p><strong>Brenton:</strong> You’re gonna need it for all the cold weather.</p> <p><strong>Tim:</strong> It has been unbelievably cold here. It was six below Fahrenheit this morning when I woke up. And now it’s 10 PM, it is four above zero. It has been so cold for the two weeks I’ve been here.</p> <p>Here’s a hat that I knit—I never actually really finished it, but…</p> <p><strong>Brenton:</strong> That’s really nice.</p> <p><strong>Tim:</strong> I’m actually a very good knitter. People get intimidated when they see it.</p> <p><strong>Brenton:</strong> You could go to the knitting circle. Jolene’s been eyeing one at the local store—try and encourage her to go.</p> <p><strong>Tim:</strong> It’s all ladies, but it’s fun.</p> <p>An interesting thing about knitting is that knitting patterns are written in a really abstract way. If you’re used to programming and math, they’re very unintimidating to read. They’re not even complicated logic—it’s just really concise so they can put it on one page. I think that’s why people find the actual mechanics of complicated things aren’t usually that hard—it’s just, are you willing to be patient with the abstract stuff? Which is a no-brainer to me.</p> <p><strong>Brenton:</strong> There’s also the geometry aspect too, in terms of crochet—there’s like a…</p> <p><strong>Tim:</strong> Knitting is less geometric in the sense that when you’re knitting, you basically only are in one dimension at a time. Crochet is sort of more dimensional because it’s less constrained. But knitting is so constrained that if you can understand the abstraction, it’s only a very limited scope at any given time. Because you’re in one place and you can’t be in other places—I mean, there are some exceptions.</p> <p><strong>Brenton:</strong> The other thing that’s neat about crochet is there’s no machine that can crochet. So every crochet piece you find somewhere in a store—that’s been handmade. There’s something special there, I think.</p> <p><strong>Tim:</strong> Yeah. That’s what Jolene points out to me all the time. That’s cool. And she picked up knitting too, so she’s been doing a lot of that.</p> <p>You can machine knit too, but to get really intricate stuff, usually you can’t machine knit it as well.</p> <p><strong>Brenton:</strong> You like the Fair Isle stuff where you make the pictures and things?</p> <p><strong>Tim:</strong> Yeah, I enjoy Fair Isle as well.</p> <p><strong>Brenton:</strong> Or the pixel art sort of stuff.</p> <p><strong>Tim:</strong> Yeah, it’s very fun.</p> <hr> <h3 id="closing">Closing </h3><p><strong>Brenton:</strong> All right. Hey, buddy. It’s been lovely to talk to you. It’s been a while.</p> <p><strong>Tim:</strong> Do you feel like you know me better?</p> <p><strong>Brenton:</strong> I feel like I know you better. I feel like there are things we can still talk about. I’d like to talk more—you know, philosophy, theology, those types of things. I think would be a fun conversation to have. So if this thing gets like 10,000 likes or something, we can do that part two. Sounds good.</p> <p>Where can people find you? Do you have anything you want to plug?</p> <p><strong>Tim:</strong> I mean, this podcast is at <a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >podcast.tdhopper.com</a>, which is my main website, and that links to most of my stuff. I’m <a class="link" href="https://x.com/tdhopper" target="_blank" rel="noopener" >tdhopper</a> on Twitter/X and Instagram. “tdhopper” is kind of my online identity;.</p> <p><strong>Brenton:</strong> All right, well, send my love to Maggie and the kids. It’s been nice spending more time with you.</p> <p><strong>Tim:</strong> Yeah, thanks Brenton. Thank you. Bye.</p> </article> <article> <h1>AI-Assisted SQL Generation with Till Döhmen</h1> <p>Fri, 30 Jan 2026 01:05:00 -0500</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.linkedin.com/in/tdoehmen/" target="_blank" rel="noopener" >Till’s LinkedIn</a></li> <li><a class="link" href="https://tdoehmen.github.io/" target="_blank" rel="noopener" >Till’s Website</a></li> <li><a class="link" href="https://motherduck.com/" target="_blank" rel="noopener" >MotherDuck</a></li> <li><a class="link" href="https://duckdb.org/" target="_blank" rel="noopener" >DuckDB</a></li> <li><a class="link" href="https://www.hopsworks.ai/" target="_blank" rel="noopener" >Hopsworks</a></li> <li><a class="link" href="https://changelog.com/practicalai/292" target="_blank" rel="noopener" >Till on Practical AI Podcast</a></li> <li><a class="link" href="https://yale-lily.github.io/spider" target="_blank" rel="noopener" >Spider Text-to-SQL Benchmark</a></li> <li><a class="link" href="https://github.com/matsonj/mviz" target="_blank" rel="noopener" >mviz - Observable Dashboard Skill by Jacob Matson</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> <h2 id="summary">Summary </h2><p>In this episode of <em>Into the Hopper</em>, I sit down with Till Döhmen, AI Lead at MotherDuck, to explore the evolving landscape of AI-assisted SQL generation. Till brings a unique perspective from his PhD research in databases and his work building AI features at MotherDuck, the serverless data warehouse built on DuckDB.</p> <p>We discuss how text-to-SQL has matured from academic benchmarks to practical tools, the importance of documentation and schema context for agents, and the emerging role of MCP servers and skills in customizing agent behavior. Till shares insights on building trust in agent-generated queries, why thinking about what a new hire would need to know helps you build better AI workflows, and how agents are reshaping the analyst’s role from SQL writing to question refinement.</p> <h2 id="transcript">Transcript </h2><p><strong>Tim:</strong> Welcome to Into the Hopper podcast. Today I’m sitting down with Till Döhmen, the AI Lead at MotherDuck. With a background spanning PhD research in databases to ML ops engineering at Hopsworks, Till sits at the intersection of database theory and modern AI. Welcome, Till.</p> <p><strong>Till:</strong> Thanks for having me.</p> <p><strong>Tim:</strong> We’re going to dive into AI-assisted SQL generation, which is something I’ve explored a little bit but don’t know a lot about. We’ll move beyond the simple text-to-query prompts to discuss agentic workflows, optimization, and whether letting AI write our queries changes how we understand our data.</p> <p><strong>Till:</strong> Great.</p> <hr> <h3 id="about-motherduck">About MotherDuck </h3><p><strong>Tim:</strong> First of all, do you want to tell us about MotherDuck?</p> <p><strong>Till:</strong> Yeah, for sure. I joined MotherDuck two and a half years ago, after ChatGPT came out. I joined as the AI guy. Before that I was working at Hopsworks, mostly working on classical machine learning ops. Hopsworks is a feature store platform.</p> <p>Coming from my PhD research, I was super interested in how we can automate data engineering for data science pipelines, and that felt like a great place to work on these types of problems. Hopsworks was doing feature processing and data engineering workloads on that platform. They were using Spark, and there were small and medium-sized datasets that felt a little bit painful because of startup times and so on. There was a demo we were doing for customers—I think it took seven minutes or so to run end to end.</p> <p>My biggest project there ended up being migrating these workflows to a second path where we could do the data engineering with DuckDB. That got me even more interested in working more on DuckDB.</p> <p>Research-wise, I was looking into foreign key detection—how can we identify which columns in tables have foreign key references in the database? LLMs turned out to be really useful for that. The second step was realizing that LLMs are actually really good at translating text to SQL. Surprisingly good at that time, though still far away from being super useful in practice. But that’s how my journey at MotherDuck started.</p> <p>MotherDuck is a data warehouse for analytics based on DuckDB. We run DuckDB instances in the cloud, and users use those instances to process their queries. It’s integrated pretty nicely with the local DuckDB client. People typically use DuckDB or Polars to process data locally very efficiently—you can process tens or even hundreds of gigabytes on your local machine.</p> <p>When the dataset gets really big, or when you’re processing data stored in the cloud—say you want to analyze a hundred gigabytes or a terabyte of partitioned data, or Iceberg tables—the processing power of the local machine isn’t optimal anymore. We provide a very simple way to connect your local DuckDB client to the instances we host and expand the processing capabilities seamlessly. You can keep writing the same queries, but the processing happens on the server side when needed.</p> <p>The other benefits are data sharing. DuckDB is usually single-player mode—you have a local database file or just data in memory. With MotherDuck, you can work more within teams. You can share data between users, have a shared workspace. You also have a web UI where you can query DuckDB—in that case, DuckDB runs in the browser with WebAssembly.</p> <p>That’s also pretty interesting from a technology standpoint because you can do a lot of interesting things. Potentially you can save the network overhead—you don’t have to go to the server for everything. It’s a web application, but still you have an entire database running in your browser that can do query parsing and binding. That also plays a role in the AI features we have at MotherDuck.</p> <hr> <h3 id="building-ai-features-at-motherduck">Building AI Features at MotherDuck </h3><p><strong>Till:</strong> I was the AI guy, so I did everything that involved AI. In the beginning I worked on text-to-SQL features. But they weren’t at the point yet where they were useful enough to position prominently—this was end of 2023. We couldn’t confidently put that in the primary path and say, “Hey, when you come to the UI, you don’t have to write SQL anymore. Just give us your question and we’ll generate the SQL for you.”</p> <p>It was more framed as a way to help SQL analysts be a little more efficient. And then it turns out, when you focus on that, there are other problems worth solving beyond text-to-SQL.</p> <p>For example, fixing queries. That’s very common—you type a query, mix up two letters, and get an error message. DuckDB actually has pretty useful error messages. But still, it’s so obvious when the database says, “This table does not exist. Did you mean this other table?” And 99.9% of the time, yes, that’s exactly what I meant. Can you please just use it?</p> <p>That’s basically what this feature we call “Fix It” does. It kicks in and asks a language model to do the fix for the user. We added other things like editing queries—if you’re not sure how a function signature looks, you can mark that part of the query and describe in natural language what kind of change you want to apply. Those things have been very useful.</p> <hr> <h3 id="the-history-of-text-to-sql">The History of Text-to-SQL </h3><p><strong>Tim:</strong> Do you know much about the world pre-2023 of automation in SQL generation?</p> <p><strong>Till:</strong> I don’t think I’ve been around long enough in the data world to really go back to the beginnings. When I hear Jordan talk about this, he refers to things that happened probably when I was still a student—10, 12, 15 years ago. And the history goes back even further.</p> <p><strong>Tim:</strong> Probably to the seventies.</p> <p><strong>Till:</strong> I think it’s even part of the philosophy of SQL when you read the very first papers. They wanted to design it to be close to natural language. So this vision has always been there.</p> <p>But when I started looking into it, there was a ton of research around text-to-SQL. Not so many products, but a lot of papers. It was the time of fine-tuning—everyone was fine-tuning small models for text-to-SQL. Before large language models, it was BERT-style models or very small transformers.</p> <p>There were these benchmarks—Spider and others—that’s what people were typically competing on. When you look at the benchmarks, they’re very simple: small databases with three to five tables, very well-known columns. They’re not really a reflection of enterprise database reality. So there was always this limitation because the results didn’t translate to the real world.</p> <hr> <h3 id="the-state-of-text-to-sql-today">The State of Text-to-SQL Today </h3><p><strong>Tim:</strong> I’m interested in what’s possible today from two angles. First, if I’m an analyst with a business problem I think I can describe, and I don’t want to write a bunch of CTEs and conditionals—how far can I get using natural language to generate queries? And how does that differ from other agentic software development?</p> <p><strong>Till:</strong> I remember about four years ago, there was a Hacker News discussion about what’s exciting about LLMs. Some people talked about text-to-SQL, others about code generation. Code generation felt much more unattainable at that time. SQL has a relatively simple grammar compared to a general-purpose programming language.</p> <p>When I think back to that time and where we are now—where people routinely use AI tools for coding—that has changed massively. That makes me say more comfortably that language models are actually really good at doing these types of things in SQL.</p> <p>I use LLMs as a coding assistant. I so often use them to debug user queries. When a customer has some issue with a query, I have conceptual ideas in my mind for how to optimize it, but these might be huge queries. Especially for experimentation—could this approach be useful? Could the other one? I just let them do it more efficiently than I could ever do manually.</p> <p>But it’s in a scenario where I assume the user has some degree of SQL knowledge, and maybe is also opinionated about how the SQL should be written. So I can see some friction, especially when models are used out of the box without additional context or prompting. Which dialect should they even write? DuckDB is not that prevalent in the pre-training data. There are things in the DuckDB documentation that models out of the box aren’t aware of.</p> <hr> <h3 id="mcp-servers-and-documentation">MCP Servers and Documentation </h3><p><strong>Tim:</strong> Say I’m a Postgres developer wanting to do analytics. What’s the best toolset out there? Can I just tell Claude Code I have access to psql, say “go learn my schemas,” and then give it my prompt? Or are there more specific agents doing this better?</p> <p><strong>Till:</strong> There are multiple aspects to the answer. The question understanding part and translating that to a query that semantically makes sense—you give the LLM access to the schemas so it knows the tables, the column names. It can potentially also explore the data a bit, because data content is quite important for filters and so on. If the model doesn’t have access to data content, it’s pretty certain it’ll hallucinate filter parameters.</p> <p>Providing that access nowadays is super simple. You give an agent the right tools. There’s a Postgres MCP server. I would probably just take Cursor or Claude Code or whatever tool I like, plug in a Postgres MCP server, and start kicking the tires. I’d expect that to work fairly reliably.</p> <p>I recently wrote a research paper about this—it’s not published yet. I was looking at the DuckDB and Postgres database documentation, every single function, every single statement, and probing language models for their knowledge about these things.</p> <p>It was interesting that even Postgres functions that have been around for quite some time—the frontier models have very little knowledge about some function names, parameters, and how to use them. It’s a small subset of functionality, but I could imagine as a database admin or data engineer deep into optimizing certain aspects of workloads, I care about those features. At that point you actually need to provide the human-written documentation as context to the LLM.</p> <p>Some MCP servers do that. It’s basically a RAG pipeline that chunks and indexes the documentation and makes it retrievable through natural language questions. If that’s exposed in the MCP server, the agent can call an “ask docs question” tool when it’s in doubt about functionality. It passes a natural language question—“How does this feature work?"—and gets an answer grounded in the documentation.</p> <p><strong>Tim:</strong> So your MCP can talk to the documentation as well as your own data?</p> <p><strong>Till:</strong> Exactly. We’re actually using the exact same documentation chatbot for the MCP that we use for our human-facing documentation.</p> <p>The point of the paper I was writing is that these things should actually be built into the database to some degree. There are functions in DuckDB—you can do <code>SELECT * FROM duckdb_functions()</code> and get a table of all functions with descriptions. But the description field is only populated for half the functions. Database developers assume humans can always go to the documentation website.</p> <p>It’s similar for Postgres—only 30-40% of functions have descriptions, and the textual descriptions are often uninformative. I believe if database developers invested a little more into built-in documentation, that problem could easily go away, because agents are quite good at searching for information that way.</p> <hr> <h3 id="the-challenge-of-undocumented-data">The Challenge of Undocumented Data </h3><p><strong>Tim:</strong> What about data documentation? At my previous job at a bank, we had a large data warehouse with data dumps from production tables into AWS data lake storage. It was weakly documented in terms of what the tables meant. Even analysts were looking at things and trying to infer from the data what columns they were looking at. Sometimes even wondering if a column is a key for another table with different names. Are agents getting good at discovering those kinds of things? Or is that just a way to shoot yourself in the foot by letting the agent hallucinate what it’s seeing?</p> <p><strong>Till:</strong> If you give the agent infinite time to do data exploration—you prompt it saying “here’s my question, please write the SQL to answer it, but before you start, explore the entire database, look at every table, every column, think about what the columns are about, make a plan to explore the database extensively, write a report, and only then start writing the SQL”—I think that could work quite effectively.</p> <p>But you have to think about it this way: what would happen if you gave a proficient analyst who has never seen your database access to your data warehouse and asked that person the same question? How much of a hard time would this person have? What are the things this person cannot know or cannot infer from the data? The same limitations apply to the LLM.</p> <p>If there are things that are really not documented, non-obvious, or not inferable from how the data looks, it definitely needs to be provided in context in some shape or form. But this exploration takes an insane amount of time and it’s expensive. You have to run queries and you’re wasting tokens. You don’t want to do this every single time.</p> <p>Another thing is query history, if you have access to that—the queries people have run in the past, sometimes even comments in those queries. That’s also very good for the agent to dig into.</p> <p>I think this exploration of schema and query history and collection of relevant context shouldn’t happen on the user path where the user asks a question. It should be considered part of data engineering work or data maintenance work.</p> <p>If I create a new table as a data engineer, I might usually have written into a data catalog or some internal documentation page what the data is about. Maybe I’ve sent a Slack message or added info to a Linear ticket. But nowadays, it’s probably also part of the job to document this in a form that agents can consume easily.</p> <p>I wish I could tell you there’s an established standard for how to do this. Different companies try different things, and we’re one of those companies with a particular approach. DuckDB and other databases have table and column comments in the metadata catalog. You can add natural language comments. It’s a bit underused, but we recommend people start with using those to add context.</p> <p>There’s a gap in our product where I think there’s a real need to capture knowledge at a database level and organization level. Our MCP server could surface this information to the agent. It could be as simple as a markdown document.</p> <p>Other companies go more toward the semantic layer direction—a much more structured format to represent semantic information about your data. Which columns and tables are related—this goes back to foreign key detection.</p> <p>My experience specifically with foreign keys is that LLMs are really good out of the box at finding them. But if the schema is very complex, or you have multiple possible tables—staging or landing tables, different transformation layers side by side—you don’t want the join path to use unclean data.</p> <hr> <h3 id="building-trust-in-agent-generated-sql">Building Trust in Agent-Generated SQL </h3><p><strong>Tim:</strong> I tweeted three years ago when ChatGPT was fresh that World War III was going to be started by AI-generated SQL. My fear is that this potentially makes it seem too easy. It’s somewhat easy to get syntactically correct SQL or something that looks like SQL, and then with an agent, something that runs.</p> <p>Two questions. One: what are good practices to build confidence that agent-generated SQL is correct? As anyone who’s spent time with SQL knows, that’s also a challenge with human-generated SQL—testing SQL is a hard problem.</p> <p>Two: do you have a concern that by not writing our SQL, we’re not thinking about our data enough?</p> <p><strong>Till:</strong> I would say we’re thinking about it in different ways now.</p> <p><strong>Tim:</strong> So what practices can help us know that agent-generated queries are correct?</p> <p><strong>Till:</strong> It really depends on the type of question you’re asking and what the stakes are. If I’m doing exploratory data analysis, just interested in digging into things, I always have the option to question certain results. If the agent gives me an answer that feels off, I have an intuition for the data. Building that intuition is a starting point.</p> <p>But an agent can be useful for that as well. If I would do it manually, I’d run a SELECT star on the table, look at summary statistics, drill down into dimensions. The agent can do the same thing for me. The SQL it generates is typically not that complex. If I want to build trust, I could double-check the SQL. Also, there’s no reason for the agent to be intentionally deceiving. On average, the things the agent does are going in the right direction.</p> <p>Maybe it’s off on a few things because it’s making implicit assumptions that are wrong. Part of the issue could be the prompting or not having provided enough context. As I said earlier, it’s important to think: how would someone approach this who has never seen that data before? If I have a new hire onboarding today and ask them to calculate this metric, could that person succeed? What would be the typical mistakes? What can I do to help that person or the agent succeed?</p> <p>This exploration helps build intuition about the data. Either way, whether SQL is hand-written or agent-written. If I have doubt, I can ask the model to explain it.</p> <p>I believe in this Socratic questioning approach. If I just keep questioning and asking, and there are no contradictions in the evidence, I can build up certainty and trust. I don’t need to know the answers, but by double-checking and poking, I can find out whether what the agent said is correct.</p> <p><strong>Tim:</strong> An interesting thing with SQL is you can do the inverse. Generate the SQL and then in a different session with new context, say “given this SQL, tell me what it does” and see if that matches what you think.</p> <p><strong>Till:</strong> Yeah.</p> <p><strong>Tim:</strong> That’s a neat feature of Cursor now—you can run multiple models and have multiple responses.</p> <p><strong>Till:</strong> Oh, I didn’t know that.</p> <p><strong>Tim:</strong> You click which models you want, each model can run multiple times, and you review the outputs. If reviews are coming in similar from OpenAI, Google, and Anthropic, I feel pretty confident.</p> <p><strong>Till:</strong> One of the things I really like about Claude Code is how self-critical it can be at times. It almost feels like it stops and thinks for a second whether the direction it’s going is good or not. This is not a quality of the base model itself—if I use Opus in Claude Code versus Cursor versus the Claude web app, it behaves very differently. A lot of the behavior that makes these agents good comes down to the implementation of the agent on top of the model.</p> <p>It’s the same with handwriting you mentioned, or writing itself. It’s a way to train the mind, keep the mind sharp, think more clearly about what I’m writing. I think it’s true in some sense that at this level of abstraction, you’ll probably start to lose a little bit of intuition. But it frees up mental space for other things.</p> <p>For analysts—I’m not an analyst myself, I have to think about it from the outside—I could imagine this really frees up time. If SQL is not the main part of my work anymore… I’ve written a lot of SQL in my life and never gotten really good at it. I’m always dreading having to write a window function. I’d rather try to do the analysis in a slightly different way so I don’t have to write that stupid window function.</p> <p>With the agent, I can just let it do that. All the ideas I have of things that could be interesting, I can play through all of them with very little effort. Once I’m through that, I have mental capacity left to think about what other data sources could be interesting to integrate—something I wouldn’t have had time for if I’d spent all my time just figuring out the SQL syntax.</p> <p>It has the potential to reshape the scope of the analyst’s work. At the same time, I could start thinking more about the actual business question.</p> <p>We were talking to a company working on a text-to-SQL agent. One interesting story: their customers were using the agent to help business users refine their questions. The agent was basically a thinking partner. The user would say, “I kind of want to look into that, but I don’t even know if we have data for that.” The agent would go look. At the end, the output was a very refined question—and then they sent it to the human analyst to write the SQL. But this whole part of refining the question was also taking up a lot of time.</p> <p><strong>Tim:</strong> I’ve found in Claude Code you can use the Ask User Question tool to get it to help me think through what problem I’m trying to solve. It surfaces things I know I should be thinking about but don’t always think about. That’s a very helpful way to work.</p> <p><strong>Till:</strong> Indeed.</p> <hr> <h3 id="skills-and-agent-customization">Skills and Agent Customization </h3><p><strong>Tim:</strong> Let’s talk about skills—this idea Anthropic came up with and is now making available to others. What are agent skills providing in this realm? You and I talked a little about being able to encode business-specific things.</p> <p><strong>Till:</strong> First of all, I think skills are—just like MCP itself and Claude Code and Sonnet—I have the feeling Anthropic really understands their users well. I really like skills because they fill this gap of shareable prompts and context that was difficult before.</p> <p>When I think about agents, MCPs, and skills and how they work together: agents and tools are pretty clear. MCPs give the agent the capability to use tools—which could be a database, with tools like “run this query” or “list all tables.” There’s only a limited amount of context the MCP server comes with: the set of tools, short descriptions of what they do, and some clients support an initial system prompt.</p> <p>In that prompt you can explain things—we explain specifics of DuckDB SQL syntax, for example. But when you go beyond that, especially if you want to customize things, the MCP server doesn’t give you that option unless you fork it and change the system prompt.</p> <p>Skills, especially because they’re shareable in Claude Code—that’s what I really like. I feel like overall in the skill ecosystem, what’s missing is a general sharing mechanism. There’s an effort that Val did, there’s this NPX skill distribution mechanism. I also saw there’s something coming in the MCP spec, potentially a proposal for providing an MCP service to surface skills.</p> <p>Currently the biggest hurdle is distribution of skills. But once that’s solved, they really let you customize the behavior of the agent to a useful extent. Skills can be not only prompts but also scripts. Coding agents with skills that contain scripts can be really powerful and provide better guardrails.</p> <p>If the agent figures out the user has a specific intent and there’s a script for that in the skill, it can use the script instead of writing its own Python or its own solution—which might be wrong. These skills provide better guardrails. And they’re plug and play.</p> <p>Jacob, one of our DevRel folks, released an Observable dashboard skill. If I load this skill in Claude Code, the agent suddenly has the capability to write very well-designed dashboards following a specific design principle. It also gives the agent knowledge of how to weave in MotherDuck for live-querying dashboards. Skills are pretty cool and useful for customization of agent behavior.</p> <hr> <h3 id="memory-and-the-future-of-agents">Memory and the Future of Agents </h3><p><strong>Tim:</strong> Another thing people are talking about more—we discussed it in my last podcast—is memory as a useful tool.</p> <p><strong>Till:</strong> Yeah.</p> <p><strong>Tim:</strong> I’m not sure everybody agrees on what memory means.</p> <p><strong>Till:</strong> Indeed. I almost think of skills as a very primitive form of memory. If the agent had a way to also author skills—it’s basically just markdown—that’s a very simple form of memory. Could be a markdown file that the agent writes and can read.</p> <p>I think memory is extremely interesting and will help a lot with making agents better at what they’re doing. The current frontier of developments in the AI space is not in the model space anymore—it’s more on the agent layer. Memory will be a very important part of the agent layer.</p> <p>If you have to explain your agent over and over again how not to make the same stupid mistake, that’s very annoying. If I have a skill I can plug in that already explains to the model how not to do that, great. But it’s still some manual effort. Memory, there are a lot of things still to be developed and explored.</p> <p>It’s not only RAG anymore, or just saving a markdown file. I’m very curious where this is going. I don’t really have a good answer for how memory looks like today beyond markdown files. I’m sure there’s a lot of interesting research at labs and in industry on memory architectures. We’ll see it.</p> <hr> <h3 id="looking-forward">Looking Forward </h3><p><strong>Tim:</strong> That brings me to a great concluding question. If we look back two years, we didn’t have agent tools particularly—we didn’t have Claude Code for sure, maybe Cursor was starting to be around. What do you think the next two years looks like? At MotherDuck or in the industry in general?</p> <p><strong>Till:</strong> This is almost impossible to answer. Thinking about how the world looked two years ago, and assuming things are maybe exponentially accelerating—where are we going to be? I have no idea.</p> <p>As I said, I think short to midterm, these agent architectures and memory are probably a space where developments will happen.</p> <p>I wish context windows would become bigger. That could also solve a bit of the memory and retrieval problems.</p> <p>I also find interesting things like diffusion models or new architectures for LLMs. Faster inference—if you’re working with Claude Code, it’s great, but it can be a little bit slow.</p> <p>If I get off this call, I’ll have 20 more things on my mind, but right now I’m blank.</p> <p><strong>Tim:</strong> The less you say now, the better you’ll look in two years.</p> <p><strong>Till:</strong> I guess so. But thanks so much for having me. It was fun.</p> <p><strong>Tim:</strong> Any places people can find you online besides motherduck.com?</p> <p><strong>Till:</strong> People can go to my LinkedIn. I’m not active on Twitter or X. LinkedIn is a great place to connect.</p> <p><strong>Tim:</strong> Thanks for joining me on Into the Hopper.</p> </article> <article> <h1>Build for the Model Six Months from Now</h1> <p>Fri, 30 Jan 2026 00:00:00 +0000</p> <p>Boris Cherny, who built Claude Code at Anthropic, <a class="link" href="https://www.youtube.com/watch?v=PTkE7eaPJvY" target="_blank" rel="noopener" >recently shared</a> advice he received from Anthropic co-founder Ben Mann:</p> <blockquote> <p>Don’t build for the model of today, build for the model 6 months from now.</p> </blockquote> <p>When Cherny first developed Claude Code, it wasn’t a great product. The models weren’t capable enough, and he only used the tool for about 10% of his own coding. But Mann pushed him to trust the scaling laws and design for where models would be, not where they were.</p> <p>Six months later, with the release of Claude 3.5 Sonnet and Opus, the models caught up to the tool’s design. Cherny’s usage jumped to 80-90%.</p> <p>This is a useful mental model for anyone building AI-powered tools: if you only design for today’s capabilities, your product may be obsolete by the time it ships.</p> </article> <article> <h1>The Evolution of AI Agents with Ben Labaschin</h1> <p>Fri, 09 Jan 2026 16:41:00 -0500</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/EconoBen" target="_blank" rel="noopener" >Ben’s Twitter</a></li> <li><a class="link" href="https://econoben.dev/" target="_blank" rel="noopener" >Ben’s Website</a></li> <li><a class="link" href="https://www.workhelix.com" target="_blank" rel="noopener" >WorkHelix</a></li> <li><a class="link" href="https://www.oreilly.com/library/view/what-are-ai/9781098159726/" target="_blank" rel="noopener" >What Are AI Agents - O’Reilly Book</a></li> <li><a class="link" href="https://www.oreilly.com/library/view/managing-memory-for/9798341661257/" target="_blank" rel="noopener" >Managing Memory for AI Agents - O’Reilly Book</a></li> <li><a class="link" href="https://github.com/steveyegge/beads" target="_blank" rel="noopener" >Bead Development Tool</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> <h2 id="summary">Summary </h2><p>In this episode of <em>Into the Hopper</em>, I sit down with Ben Labaschin, a Principal Machine Learning Engineer at WorkHelix and author of the O’Reilly books <em>What Are AI Agents</em> and <em>Managing Memory for AI Agents</em>.</p> <p>We dive deep into the current state of AI engineering, moving beyond the hype to discuss the “brass tacks” of developer workflows. We cover how AI is reshaping personal projects, the shift from “coder” to “conductor,” and the specific tool stacks (like Beads and Spec-Driven Development) that Ben uses to manage agent memory effectively.</p> <h2 id="transcript">Transcript </h2><p><strong>Tim:</strong> Welcome to the Into the Hopper podcast. Today, I’m joined by Ben Labaschin, a principal machine learning engineer at WorkHelix, who has been instrumental in the company’s growth from seed stage to its $75 million Series A valuation.</p> <p>Ben is the author of O’Reilly Books, <em>What Are AI Agents</em> and <em>Managing Memory for AI Agents</em>. His work spans the full stack from building enterprise causal inference architectures and async-parallelized LLM APIs to delivering over $8 million in savings for global logistics through optimized machine learning systems.</p> <p>Beyond the Code, Ben is a published researcher focusing on firm-level exposure to LLMs and the economic impact of those technologies on labor. Welcome, Ben.</p> <p><strong>Ben:</strong> Hey, man. I appreciate it. Thanks for having me, Tim.</p> <p><strong>Tim:</strong> You and I have known each other a while through the internet, though I think this is really the first time we’ve kind of talked one-on-one.</p> <p><strong>Ben:</strong> Yeah. You know, the internet can be really easy to communicate, but it’s great to be able to actually speak to you in person.</p> <p><strong>Tim:</strong> Yeah. We’re both former NormConf speakers. Shout out to Vicki Boykis and speaking on our NormConf official Shure microphones, which sound good.</p> <p><strong>Ben:</strong> It’s one of the best things to come from the conference other than all the awesome talks.</p> <hr> <h3 id="ai-workflows-and-developer-experience">AI Workflows and Developer Experience </h3><p><strong>Tim:</strong> This is an interesting podcast for me because I’ve gotten more and more interested in AI workflows and how AI is impacting the developer experience. I did three interviews last year. And obviously, it’s been a wild ride of 2025 and so many changes from what we were doing.</p> <p>I interviewed Ravi Modi and at the time he talked about how his workflow was still copying and pasting code snippets into ChatGPT and that was his iterative process. I believe from talking to him he’s given up on that. But that’s what a lot of us were doing at least 18 months ago. And now, I think agents are taking over the world. You’re an interesting case because not only are you using these tools as a developer, but your company is also looking at these tools from a different angle.</p> <p><strong>Ben:</strong> That’s right. I mean, I’ve gotten the benefit of seeing both sides of things. We kind of started doing that work before AI agents really were a thing. So that’s been a fascinating experience to see how companies were anticipating what was coming versus when it actually came.</p> <p>The long and short of that is that companies really weren’t interested in being told, “Hey, there’s this new paradigm coming, you should prepare for it.” When it actually happened, they weren’t prepared for it. And now they’re starting to adapt to it. So yeah, I get to use the tools and then I get to see how individuals actually use them at enterprise companies and sort of trying to measure the impact of that usage.</p> <p><strong>Tim:</strong> What’s the TLDR of what WorkHelix provides for the companies?</p> <p><strong>Ben:</strong> Basically what we do is we do a deep dive into a company’s labor system and say, <em>“Here’s where your workers are. Here are some opportunities to leverage AI. Oh, you’re already leveraging AI? Let’s take that work that you’re doing and measure quantifiably the impact that’s having in real dollars on your company.”</em></p> <p>Because the philosophy is: <strong>if you can’t measure it, then you can’t really manage it.</strong> And we’re trying to help companies manage it because, quite frankly, I don’t think there’s a real grasp on how to manage using these tools, what’s effective, and what’s not.</p> <hr> <h3 id="measuring-ai-impact-on-productivity">Measuring AI Impact on Productivity </h3><p><strong>Tim:</strong> Are AI agents making developers more efficient?</p> <p><strong>Ben:</strong> It’s a really loaded conversation. Constantly you’ll see published papers that seemingly contradict each other. One paper says, “Game’s over. All jobs are going to be replaced.” Another says, “We have evidence here that work can’t be replaced.” You can probably guess that it’s somewhere in the middle.</p> <p>If you measure it as <strong>more lines of code</strong>, then the answer is yes. More lines of code are being produced at a rapid speed. But is it effective? By effective, I mean, you’re seeing more LLM-generated code being deployed into production where end users are actually using it.</p> <p>At a personal level, I would say 100%. I am far more effective as a developer with an LLM than not. And the way that I measure that is: <em>How many personal projects am I actually finishing in a given year?</em></p> <p><strong>Tim:</strong> Yeah, I think that’s something a lot of us are seeing—the ability to work on personal projects.</p> <hr> <h3 id="personal-projects-and-ai-tools">Personal Projects and AI Tools </h3><p><strong>Tim:</strong> My output on personal projects has gone down rapidly with my number of children. But two things are being powerful:</p> <ol> <li><strong>Cloud Code on the phone:</strong> The ability to code on the go is so cool.</li> <li><strong>Codex/Agents grinding:</strong> I just see on Twitter people talking about how Codex can just grind away on things. The ability to give it a task and just let it go for a few hours has been very cool.</li> </ol> <p>I just deployed last night. I was thinking, it would be cool if I could write more applications that I just deploy on my local network. We are both home lab nerds. I told Codex: <em>“Get me a VPS, install Coolify on it, and put it behind my Tailscale network.”</em></p> <p>It probably took 15 minutes of interacting with it for what would have been hours of work. That to me is so fun on the personal side.</p> <p><strong>Ben:</strong> It is. And I think that there’s this universal pressure that we all know exists in this field of wanting to stay up to date and wanting to build more. I know the feeling of not feeling like I can take myself away from a problem. And now it’s like, I can check my phone and it’s working and then I can get back to being present.</p> <p>However, I am multitasking in ways that I never did before. I have five different terminals open, perhaps more, of different projects. No one is forcing me to do that… but context switching and having all of these things you’re attending to in your mind is probably not super healthy.</p> <p><strong>Tim:</strong> The observation folks have had is that we’re becoming more like managers or conductors. I look at my manager’s calendar and she’s in meetings all day while in three different Slack discussions. That’s kind of more how we’re all operating now.</p> <p><strong>Ben:</strong> I think the expectations are slowly going to change for workers too. I haven’t felt it yet, but as a hard worker, if I can do seven things at once, suddenly people start to say, <em>“Well, this is just the rate that Ben works.”</em></p> <p>The rubber will meet the road when the quality of work starts going down or things start getting messed up. That’s when people start saying, “Okay, we got to make sure that there’s not too much being worked on simultaneously.”</p> <p><strong>Tim:</strong> I agree. One of the things we’re still emphasizing a lot is human review. I’ve been working on a Claude skill to review code. Using it to look for the things that <em>I</em> want to look for that are relevant to our team—it’s finding bugs that I never would have found.</p> <p>I’ve been pondering: How analogous is this to the transition off of punch card machines 40 years ago? You could bemoan that we missed the time when we had to really think through the instructions before putting them into the machine. Now, you get into a terminal and you can iterate fast.</p> <p><strong>Ben:</strong> Looking back at economic history is generally very helpful. I look back at the 18th and 19th centuries to understand technological shifts.</p> <p>I think the <strong>printing press</strong> is a good analogy. It was a democratization of information. I feel agents are parallel to that because it’s a democratization of being able to <em>build</em> things. I don’t know what the consequences are going to be in 20 years, but I think they’ll be seismic.</p> <hr> <h3 id="specialization-in-the-age-of-ai">Specialization in the Age of AI </h3><p><strong>Ben:</strong> One thing I think we wanted to talk about is this idea of specialization. Where does specialization go with the advent of agents? I think it’s being flattened. I’m not a front-end engineer, but I can now do front-end work because agents are getting better at it.</p> <p>However, I would argue that specialization is still going to be valuable. Just because you can generate art with Midjourney doesn’t mean the meaning of a human artist painting a picture isn’t valuable.</p> <p><strong>Tim:</strong> If you think about the <strong>“T-shaped engineer”</strong> idea—breadth and depth—you wonder, does this just enable us to have a bigger T?</p> <p>I just merged a front-end task today using the Claude front-end skill on a Next.js project. I don’t know JavaScript particularly well, and I definitely don’t know Next.js. I implemented something that looks really nice and does what I want. This was previously an infinite-length task for me—go learn all the mechanics of Next.js.</p> <p><strong>Ben:</strong> Opus 4.5 has been one of those game-changing moments. It’s more expensive, yes, but it can do more with less, so it starts becoming cheaper because you’re not coding as much.</p> <p>But there is a detriment. In the past, you would have struggled through AWS documentation or CSS, and by failing, you learned. Junior engineers might leverage these tools to get ahead, but as a consequence, they might not be super knowledgeable about the depth of these things themselves.</p> <p><strong>Tim:</strong> It’s easy to have rose-colored glasses. The joke from 2010 to 2021 was that we just copy and paste things from Stack Overflow. In a lot of ways, we were really bad large language models!</p> <p>You have to continue to be <strong>deliberate to learn</strong>. Yesterday I did some Kubernetes ops work. At the end, I asked my Cloud Code session: <em>“Teach me about all the things that we just did. Walk me through the steps.”</em></p> <p><strong>Ben:</strong> I completely agree. The tactics you’re using to get that learning might look different, but you have to be deliberate.</p> <hr> <h3 id="toolset-evolution-over-time">Toolset Evolution Over Time </h3><p><strong>Tim:</strong> Let’s step back and talk about what your toolset has been. How has that changed over the last year?</p> <p><strong>Ben:</strong> My toolset has certainly changed. I used to focus on prompts. Now, I have an <code>agents.md</code> or a <code>claude.md</code> that I symlink so they are the same for different models.</p> <p>If there are three things I use, it’s:</p> <ol> <li><strong>GitHub:</strong> For that local vs. remote tracking of memory.</li> <li><strong>Beads:</strong> For consolidating how local memory is being used and breaking those down into atomic tasks.</li> <li><strong>Spec-Driven Development:</strong> I think the most important thing you can do to improve the memory and agentic power of your tool is to simply break down the problem into a spec that the tool can always return to.</li> </ol> <p><strong>Tim:</strong> I do a lot of spec-driven stuff too. One of the Claude Code developers shared a skill just for building a spec by using the “Ask User Question” skill. It forces me to think through things.</p> <p><strong>Ben:</strong> I have a rule in my <code>claude.md</code>. I say:</p> <ul> <li>There should be an <code>architecture.md</code> (diagrams, high-level goals).</li> <li>There should be a log that it returns to.</li> <li>Before we do anything, read the architecture, read the readme, read the log.</li> <li>Create <strong>Beads</strong> tasks from our next step in the architecture document.</li> </ul> <p>It tends to be very effective.</p> <p><strong>Tim:</strong> I keep coming back to Fred Brooks in <em>The Mythical Man-Month</em>:</p> <blockquote> <p>“I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it.”</p> </blockquote> <p>I’m spending more time articulating what I want than I’ve ever done in my entire career.</p> <p><strong>Ben:</strong> I just realized… I would always say, <em>“Being in the code helps me figure out what I’m going to do.”</em> I’m so the opposite now. That’s crazy.</p> <p><strong>Tim:</strong> And at the same time, it allows you the freedom to come up with a plan that you can throw away without being wed to it because you didn’t spend hours typing it.</p> <p><strong>Ben:</strong> That’s so astute. My ego is less invested in this code because I put less of myself into the code.</p> <hr> <h3 id="resources--closing">Resources & Closing </h3><p><strong>Tim:</strong> This was great. Can you tell us a little bit about your books with O’Reilly?</p> <p><strong>Ben:</strong> I’ve written two publications:</p> <ul> <li><strong>What Are AI Agents</strong></li> <li><strong>Managing Memory for AI Agents</strong></li> </ul> <p>My focus for the last year and a half has been agent memory. If we can work with their memory effectively, I believe we can get what we want done faster.</p> <p><strong>Tim:</strong> You can also find more about Ben at <a class="link" href="htps://econoben.dev" >econoben.dev</a>. Thanks for joining us!</p> </article> <article> <h1>Lessons from Using Claude Code Effectively</h1> <p>Thu, 08 Jan 2026 08:00:00 +0000</p> <p>I’ve been relying on Claude Code more and more since March 2025, not just for development but for <a class="link" href="https://tdhopper.com/blog/how-im-using-ai-at-the-end-of-2025/" target="_blank" rel="noopener" >all things computer automation</a>. Developers are transitioning to be the conductor of the orchestra more than flute player. I believe 2025 marked a fundamental shift in software development. These tools are changing what our jobs are.</p> <p>Here are some practical patterns I’ve learned over the last year:</p> <h2 id="let-it-drive-your-git">Let It Drive Your Git </h2><p>I’ve embraced letting Claude handle many of my git operations. It’s great at finding the right changes to commit and <a class="link" href="https://github.com/tdhopper/dotfiles2.0/blob/master/.claude/skills/creating-pull-requests/SKILL.md" target="_blank" rel="noopener" >opening PRs</a>. It’s also amazing at more complex tasks like cherry-picks, rebases, splitting large features into logical PRs.</p> <p>I recently had Claude split a large feature into four separate, logically sequenced PRs in a single session. The PRs were much cleaner than I would have done myself, and I <a class="link" href="https://www.johndcook.com/blog/2015/12/22/automate-to-save-mental-energy-not-time/" target="_blank" rel="noopener" >saved mental energy</a>.</p> <h2 id="start-fresh-when-stuck">Start Fresh When Stuck </h2><p>When Claude starts giving consistently bad answers, don’t keep pushing it. Clear the context and start a new chat. A fixated agent won’t suddenly become unfixated. Starting fresh gives you better results faster than trying to course-correct an existing conversation.</p> <h2 id="use-it-for-system-automation-not-just-code">Use It for System Automation, Not Just Code </h2><p>Claude Code is a misnomer; it’s a computer automation tool. It’s far better at using CLI tools than I’ll ever be. I use it for git operations, generating complex shell commands, and even working with cloud CLI tools to quickly learn data patterns from blob storage. Tasks that used to take me time and energy to piece together now happen in seconds.</p> <h2 id="build-skills-and-slash-commands-for-repeated-tasks">Build Skills and Slash Commands for Repeated Tasks </h2><p>I’ve developed a <a class="link" href="https://github.com/tdhopper/dotfiles2.0/blob/master/.claude/skills/creating-pull-requests/SKILL.md" target="_blank" rel="noopener" >skill for opening pull requests</a> with well-structured, useful messages. I also use a slash command for <a class="link" href="https://github.com/tdhopper/dotfiles2.0/tree/eb26bddb146951c51d94a0ed0cc56cdb5279f8b9/.claude/skills/reviewing-code" target="_blank" rel="noopener" >code reviews</a> that focuses on correctness, performance impact, and unnecessary complexity. These tools make repeated workflows consistent and fast.</p> <h2 id="fast-feedback-is-essential">Fast Feedback is Essential </h2><p>The value of fast feedback can’t be overstated. Being able to execute code and validate assumptions quickly makes development fundamentally different. Agents benefit from this in the same way as humans; the more they can validate the work, the better their results.</p> <p>I think of Claude’s capability the way a <a class="link" href="https://www.gcc.edu/Home/Staff-Directory/Staff-Detail/gary-l-thompson" target="_blank" rel="noopener" >favorite math professor</a> once described his advantage: “I’m not smarter than you. I can just recover from mistakes faster than you.” Agents recover from errors at speeds humans can’t match (and, usually, don’t get as frustrated); feedback enables recovery.</p> <h2 id="use-it-to-navigate-unfamiliar-code">Use It to Navigate Unfamiliar Code </h2><p>Agents excel at building understanding of complex codebases. They can grep, search, and trace dependencies far faster than I can manually. When I joined a new project, Claude helped me understand where components fit together and how data flowed through the system.</p> <h2 id="define-the-problem-before-starting">Define the Problem Before Starting </h2><p>The hardest part is still <a class="link" href="https://tdhopper.com/blog/no-silver-bullet/" target="_blank" rel="noopener" >understanding what problem you’re actually solving</a>. I’ve wasted plenty of agent time by not thinking through what I wanted before asking. Spending five minutes clarifying the goal (sometimes <a class="link" href="https://github.com/tdhopper/dotfiles2.0/blob/master/.claude/commands/spec.md" target="_blank" rel="noopener" ><em>with the agent</em></a>) saves thirty minutes of iterating on the wrong solution.</p> <h2 id="prototype-aggressively-throw-away-freely">Prototype Aggressively, Throw Away Freely </h2><p>I prototype more now than I ever did before. The cost of trying something dropped dramatically. I’ll have Claude generate a quick script or CLI to test an idea, knowing I’ll probably throw it away. This changes how I approach problems: I can test assumptions with real implementations.</p> </article> <article> <h1>How I'm Using AI at the End of 2025</h1> <p>Wed, 31 Dec 2025 12:00:00 +0000</p> <h2 id="chat">Chat </h2><p>I subscribe to <a class="link" href="https://openai.com/chatgpt/pricing/" target="_blank" rel="noopener" >ChatGPT Plus</a> and using 5.2 Thinking for most of my chat sessions. According to my 2025 stats, I had 1,884 chats and 7,661 messages this year. I use this for researching things, asking home and car repair questions, as a sounding board for ideas, workout ideas, creating recipes, and answering my 8-year-old’s many questions (“what was the longest war in history”).</p> <p>At work, I have access to <a class="link" href="https://openai.com/chatgpt/pricing/" target="_blank" rel="noopener" >ChatGPT Pro</a> through work and occasionally use the 5.2 Pro for deeper thinking (e.g., I recently asked “What were the best in class ai models at the end of 2025”).</p> <h2 id="image-generation">Image Generation </h2><p><a class="link" href="https://blog.google/technology/ai/nano-banana-pro/" target="_blank" rel="noopener" >Nano Banana Pro</a> has become my only tool for image generation. It’s just <em>so good</em>. I don’t subscribe to <a class="link" href="https://gemini.google.com" target="_blank" rel="noopener" >Gemini</a> personally, but I find the free tier gives me enough for things I want to generate (usually silly).</p> <h2 id="code">Code </h2><p>A year ago, I had not used a coding agent. I mostly relied on copying and pasting code into ChatGPT and auto complete in <a class="link" href="https://github.com/features/copilot" target="_blank" rel="noopener" >Github Copilot</a> to write code for me.</p> <p>Early in the year, I started using <a class="link" href="https://www.cursor.com" target="_blank" rel="noopener" >Cursor</a> at home and at work. By mid year, I had almost exclusively shifted to <a class="link" href="https://www.anthropic.com/claude/code" target="_blank" rel="noopener" >Claude Code</a> which I exclusively use in <code>--dangerously-skip-permissions</code> mode. <a class="link" href="https://www.anthropic.com/news/claude-opus-4-5" target="_blank" rel="noopener" >Opus 4.5</a> has made that an even better experience, and I find it reliably helps me refine what I’m asking for and then provide top tier implementations. I live in Claude Code these days.</p> <p>A lesson I’ve learned over the past 6 months is that Claude Code isn’t just good for code automation but also computer automation. It’s far better at using CLI tools than I will ever be, and it’s ability to combine them enables automation of lots of work. For example, I rarely use <code>git</code> directly any more and let Claude handle it for me (even <a class="link" href="https://github.com/tdhopper/dotfiles2.0/blob/master/.claude/skills/creating-pull-requests/SKILL.md" target="_blank" rel="noopener" >opening pull requests</a>).</p> <p>I subscribe to the $20/month pro plan and have <a class="link" href="https://support.claude.com/en/articles/12429409-extra-usage-for-paid-claude-plans" target="_blank" rel="noopener" >extra usage</a> enabled so I can pay for tokens after hitting the Pro limits.</p> <h2 id="writing">Writing </h2><p>As Robert Ghrist <a class="link" href="https://tdhopper.com/blog/writing-a-math-textbook-with-claude-with-professor-ghrist/" target="_blank" rel="noopener" >told me last year</a>, Claude has the best writing style. I often rely on it to help me refine things I’m saying.</p> <h2 id="learning">Learning </h2><p>More and more I’m using LLMs to help me learn things better from understanding research to code bases to random topics of interest.</p> <p>For random topics, this often starts with a ChatGPT Thinking session with web search enabled.</p> <p>For code, I have found Claude Code can do excellent breakdowns and patiently fields my questions in discovering corners I don’t understand.</p> <p><a class="link" href="https://notebooklm.google.com" target="_blank" rel="noopener" >NotebookLM</a> is amazing at helping me work through a variety of sources. Generating slides based on some documents or sources is a common starting point for my understanding these days and helps me refine the questions I’m asking. I also love generating a podcast with a prompt that guides them in the direction I’m interested in so I can learn on the go.</p> <h2 id="transcription">Transcription </h2><p>I’ve never been a bit voice to text person, because I mumble with a choppy cadence. However, I’m learning the modern tools do not care how annoying I am. I can stumble through my words for 15 minutes and they carefully transcribe and punctuate.</p> <p>I have been experimenting with <a class="link" href="https://wisprflow.ai/" target="_blank" rel="noopener" >Whispr Flow</a> for personal uses; this has largely been using their iOS keyboard for vastly superior text-to-speech that the built in Apple Option.</p> <p>I have <a class="link" href="https://whispermemos.com/" target="_blank" rel="noopener" >Whisper Memos</a> connect to my iPhone action button. This is a great one-trick-pony tool that starts recording on launch and then can email the transcript somewhere; mine goes to my <a class="link" href="https://getdrafts.com/" target="_blank" rel="noopener" >Drafts inbox</a>.</p> <p>I just installed <a class="link" href="https://sotto.to/" target="_blank" rel="noopener" >Sotto</a> on my work machine this week which provides local model voice transcription. I’m interested in using this for interacting with Slack and Claude Code, but I haven’t made much progress yet.</p> </article> <article> <h1>The Ph.D. Blueprint: Five Questions to Design Your Future</h1> <p>Wed, 10 Dec 2025 00:00:00 +0000</p> <p>Years ago, I published a <a class="link" href="https://tdhopper.com/phd" >collection of interviews</a> with people I respected about whether a young person should pursue a Ph.D. I still hear from people who found them helpful.</p> <p>This is a slide deck version of the key questions from those interviews. If you’re wrestling with this decision, I hope it gives you a useful framework.</p> <figure><a href="page-02.png"></a> </figure> <figure><a href="page-03.png"></a> </figure> <figure><a href="page-04.png"></a> </figure> <figure><a href="page-05.png"></a> </figure> <figure><a href="page-06.png"></a> </figure> <figure><a href="page-07.png"></a> </figure> <figure><a href="page-08.png"></a> </figure> <figure><a href="page-09.png"></a> </figure> <figure><a href="page-10.png"></a> </figure> <figure><a href="page-11.png"></a> </figure> <figure><a href="page-12.png"></a> </figure> <figure><a href="page-13.png"></a> </figure> <figure><a href="page-14.png"></a> </figure> <figure><a href="page-15.png"></a> </figure> <hr> <p>Want to hear directly from Ph.D. holders about their experiences? Check out the full interview series at <a class="link" href="http://tdhopper.com/phd" target="_blank" rel="noopener" >Should I Get a Ph.D.?</a></p> </article> <article> <h1>Streaming Webcam to Browser with MediaMTX</h1> <p>Mon, 10 Nov 2025 00:00:00 +0000</p> <p>I wanted to view my webcam feed from a browser on my local network. This turns out to be useful for testing video conferencing setups, monitoring a room, or checking camera angles before a recording. The challenge is that browsers can’t directly access webcam RTSP streams, but they do support WebRTC.</p> <p>After some tinkering, I found a clean solution: <a class="link" href="https://ffmpeg.org/" target="_blank" rel="noopener" >ffmpeg</a> captures the webcam and streams it via RTSP, while <a class="link" href="https://github.com/bluenviron/mediamtx" target="_blank" rel="noopener" >MediaMTX</a> converts that RTSP feed to WebRTC that browsers can consume.</p> <h2 id="installation">Installation </h2><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">brew install ffmpeg mediamtx </span></span></code></pre></td></tr></table> </div> </div><h2 id="setup">Setup </h2><p>You need to run two processes in parallel.</p> <h3 id="1-start-mediamtx-server">1. Start MediaMTX server </h3><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">mediamtx </span></span></code></pre></td></tr></table> </div> </div><p>This starts the MediaMTX server which will handle the protocol conversion and serve the web interface.</p> <h3 id="2-stream-webcam-with-ffmpeg">2. Stream webcam with ffmpeg </h3><p>In a separate terminal:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ffmpeg -f avfoundation -framerate <span class="m">30</span> -i <span class="s2">"0"</span> <span class="se">\ </span></span></span><span class="line"><span class="cl"> -vcodec libx264 -preset ultrafast -tune zerolatency <span class="se">\ </span></span></span><span class="line"><span class="cl"> -f rtsp rtsp://127.0.0.1:8554/webcam </span></span></code></pre></td></tr></table> </div> </div><p>The <code>-f avfoundation</code> flag is macOS-specific for capturing video devices. On Linux, you’d use <code>-f v4l2</code> instead. The <code>"0"</code> refers to the first video device (your default webcam). The <code>ultrafast</code> preset and <code>zerolatency</code> tuning minimize encoding overhead for real-time streaming.</p> <h2 id="viewing-the-stream">Viewing the Stream </h2><p>Once both processes are running, open your browser and navigate to:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">http://127.0.0.1:8889/webcam/ </span></span></code></pre></td></tr></table> </div> </div><p>The feed should appear in your browser. You can access this URL from any device on your local network by replacing <code>127.0.0.1</code> with your machine’s IP address.</p> <p>MediaMTX handles all the WebRTC negotiation and serves a basic web player interface. If you need to customize the player or embed it elsewhere, MediaMTX also exposes a <a class="link" href="https://github.com/bluenviron/mediamtx#webrtc-api" target="_blank" rel="noopener" >WebRTC API</a> you can integrate with.</p> </article> <article> <h1>Interview with the Pybites podcast</h1> <p>Wed, 30 Jul 2025 09:18:00 -0400</p> <p><a class="link" href="https://www.linkedin.com/in/bbelderbos/" target="_blank" rel="noopener" >Bob Belderbos</a> invited me on the <a class="link" href="https://www.pybitespodcast.com/1501156/episodes/17574426-198-tim-hopper-on-uv-and-smarter-python-development" target="_blank" rel="noopener" >Pybites podcast</a> to talk about my career, the <a class="link" href="https://pydevtools.com/handbook/" target="_blank" rel="noopener" >Python Developer Tooling Handbook</a>, <a class="link" href="https://pydevtools.com/handbook/reference/uv/" target="_blank" rel="noopener" >uv</a>, <a class="link" href="https://photos.tdhopper.com/" target="_blank" rel="noopener" >photography</a>, and more. Bob was a great interviewer and I hope you enjoy.</p> <ul> <li><a class="link" href="https://www.pybitespodcast.com/1501156/episodes/17574426-198-tim-hopper-on-uv-and-smarter-python-development" target="_blank" rel="noopener" >Listen on pybitespodcast.com</a></li> <li><a class="link" href="https://open.spotify.com/episode/7xyb2HUcqPEpvLYo8qjQrV?si=b828375cdfc243fa" target="_blank" rel="noopener" >Listen on Spotify</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/198-tim-hopper-on-uv-and-smarter-python-development/id1545551340?i=1000719733285" target="_blank" rel="noopener" >Listen on Apple Podcasts</a></li> <li><a class="link" href="https://www.youtube.com/watch?v=5DT_zL7SiAI" target="_blank" rel="noopener" >Watch on Youtube</a></li> </ul> </article> <article> <h1>How Developers are Using LLMs</h1> <p>Tue, 11 Mar 2025 12:12:00 +0000</p> <p>I’m very interested in <em>how</em> people are using LLMs to augment their work, especially in software development.</p> <p>I did three interviews with friends who are using them in different ways.</p> <ul> <li><a class="link" href="https://tdhopper.com/blog/building-a-startup-with-ai-with-jowanza-joseph/" target="_blank" rel="noopener" >Jowanza Joseph</a> is building a startup where <a class="link" href="https://www.cursor.com" target="_blank" rel="noopener" >Cursor</a> is doing much of the legwork.</li> <li><a class="link" href="https://tdhopper.com/blog/ai-assisted-programming-with-ravi-mody/" target="_blank" rel="noopener" >Ravi Mody</a> primarily interacts with web-based chat clients like <a class="link" href="https://chatgpt.com" target="_blank" rel="noopener" >ChatGPT</a> to write his code.</li> <li><a class="link" href="https://tdhopper.com/blog/writing-a-math-textbook-with-claude-with-professor-ghrist/" target="_blank" rel="noopener" >Robert Ghrist</a> “directed” <a class="link" href="https://claude.ai" target="_blank" rel="noopener" >Claude</a> on how to write a beautiful linear algebra textbook in his style.</li> </ul> <p>I’ve enjoyed several other pieces lately on how others are using LLMs:</p> <ul> <li><a class="link" href="https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/" target="_blank" rel="noopener" >Harper Reed</a> posted a detailed account of his workflow to robustly generate new code bases. I’ve been playing with his provided prompts and am impressed.</li> <li><a class="link" href="https://simonwillison.net/2025/Mar/11/using-llms-for-code/" target="_blank" rel="noopener" >Simon Willison</a> wrote about using LLMs to write code, especially enabling him to build things <em>he wouldn’t bother with otherwise</em>. He shows concrete examples with his <a class="link" href="https://tools.simonwillison.net" target="_blank" rel="noopener" >LLM generated tools page</a>.</li> <li><a class="link" href="https://youtu.be/EWvNQjAaOHw?si=mKi9NeISLy0DGEbT" target="_blank" rel="noopener" >Andrej Karpathy</a> published a 2-hour long video that is the best introduction I’ve seen to the current state of LLMs and what they can do.</li> </ul> <p>I’d love to hear about your use of LLMs. Please drop me a note in the form below.</p> </article> <article> <h1>Writing a Math Textbook with Claude with Professor Ghrist</h1> <p>Sat, 18 Jan 2025 10:03:00 +0000</p> <p>In this episode, Professor Robert Ghrist from the University of Pennsylvania discusses his beautiful new linear algebra book created with the help of the Claude LLM in just 55 days. Professor Ghrist explains how he used Claude to assist with the book’s outline, writing style, formatting, and consistency, emphasizing his role as a director guiding the LLM.</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www2.math.upenn.edu/~ghrist/" target="_blank" rel="noopener" >Professor Robert Grist</a> from the University of Pennsylvania.</li> <li><a class="link" href="https://x.com/robertghrist" target="_blank" rel="noopener" >Professor Ghrist’s Twitter</a></li> <li>Professor G’s <a class="link" href="https://x.com/robertghrist/status/1874105560641220830" target="_blank" rel="noopener" >Twitter thread</a> about composing a linear algebra book with the assistance of the Claude LLM.</li> <li><a class="link" href="https://www2.math.upenn.edu/~ghrist/FLCT/" target="_blank" rel="noopener" >Funny Little Calculus Text (FLC)</a></li> <li><a class="link" href="https://www2.math.upenn.edu/~ghrist/notes.html" target="_blank" rel="noopener" >Elementary Applied Topology</a></li> <li><a class="link" href="https://www2.math.upenn.edu/~ghrist/preprints/LAEF.pdf" target="_blank" rel="noopener" >PDF of the new linear algebra book</a></li> <li><a class="link" href="https://amzn.to/4apUkbe" target="_blank" rel="noopener" >Order a print copy of the linear algebra book from Amazon</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>The Sony A7 iii after 5 years</h1> <p>Sun, 12 Jan 2025 00:00:00 +0000</p> <p>Over the last 10 years, I’ve become more interested in the art of photography. What started with me try to get better photographs of waterfalls than I could with my iPhone five has led to an era where I mostly photograph my four children (ages one to seven).</p> <p>I’ve used a lot of different cameras over the years<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, but the one that has consistently brought me the most joy and satisfaction is my Sony a7 III. I’ve had the camera for five years, and it consistently helps me make delightful and beautiful images of my kids that I will cherish for the rest of my life.</p> <p>When my second child was almost due, some friends gave me money as a gift, and I used that to move from my a6500 setup to a used Sony a7 III. While it isn’t the camera I use the most since my iPhone de jour is always closest at hand, I’m always glad to have pulled it out. While modern iPhones have incredible cameras that make beautiful images in the right conditions, the conditions I’m photographing my kids in are rarely ideal, and the photos I take with my iPhone often disappoint me with their lack of sharpness, motion blur, and lack of subject separation.</p> <p>My full-frame Sony mirrorless, on the other hand, rarely disappoints.</p> <p> </p> <h2 id="fast-and-responsive">Fast and Responsive </h2><p>When I’m not photographing my kids, my other main photography interest is <a class="link" href="https://photos.tdhopper.com/great-blue-herons" target="_blank" rel="noopener" >wildlife</a>. Wildlife photography presents a unique set of challenges - conditions are rarely ideal, and subjects move unpredictably. As a wildlife photographer, I’m constantly pursuing what Jay Maisel calls <a class="link" href="https://www.jaymaisel.com/products/light-gesture-and-color" target="_blank" rel="noopener" >Gesture, Light, and Color</a> in his book. You have to stay ready, alert to fleeting opportunities.</p> <p>Photography kids is exactly the same.</p> <p>Having a fast and responsive camera isn’t everything - the tool doesn’t make the art - but it opens doors to capturing those precious moments that might otherwise slip away. The Sony a7 III excels in this regard in several key ways:</p> <p>First, it boots almost instantly. While many cameras need several seconds to wake up, my Sony (like other prosumer/professional cameras) is ready to shoot the moment I flip the switch. This responsiveness has saved countless shots that slower cameras would have missed. The quick recovery from sleep mode is equally valuable.</p> <p>Beyond pure speed, the camera’s physical controls are thoughtfully designed for rapid adjustments. When shooting in aperture priority mode, I can quickly dial in aperture and exposure compensation. If I need to freeze fast action, switching to shutter priority and adjusting speed happens in seconds. The customizable buttons let me access my most-used settings without diving into menus.</p> <h2 id="amazing-autofocus">Amazing Autofocus </h2><p>Another aspect of speed is the camera’s autofocus. Because my kids are on the move, I almost always shoot in continuous autofocus. The face detection does a great job keeping track of my kids, and I consistently get sharp images. Autofocus tends to be great on modern cameras, but I love having a camera where it excels.</p> <p>The camera also also has eye detection autofocus, though it requires pressing a separate button unlike more modern cameras. I have my AEL button set to eye autofocus, and I instinctively press it when trying to get a portrait.</p> <p>Autofocus is one area where cameras are improving every generation, and I look forward to benefiting from that next time I upgrade</p> <h2 id="large-sensor">Large Sensor </h2><p>Unlike all the other cameras I’ve used, the Sony a7 III is a full-frame camera. This large sensor allows the camera to capture more light and produce cleaner images in low light settings. While my iPhone often disappoints indoors and in the evenings, my Sony–especially paired with a fast prime lens–consistently produces sharp and beautiful images, even at ISO 12800.</p> <p>The large sensor also allows for more subject separation (bokeh) than smaller sensors. When I’m photographing my kids, I often want to blur the background to make the subject stand out. The Sony a7 III does this beautifully.</p> <p> </p> <h2 id="durable-and-reliable">Durable and Reliable </h2><p>The Sony a7 III is a professional-grade camera. It’s built to last. While I try not to abuse it, it’s been banged around on hikes, sandy at the beach, and bounced around in my backpack. Over five years later, you’d have no idea other than some surface wear on the body.</p> <p>Because the camera is weather-sealed, I can take it out in the rain or snow without worrying about damaging it. Last week, I took my five year old out in freezing rain and snow and was able to capture a few beautiful (low-light!) images of him. When I came home, I just wiped the camera down with a towel and didn’t have to worry about it.</p> <p>The camera also has excellent battery life, unlike some compact and mirrorless cameras I’ve used. The battery lasts for hundreds of shots, and I can go weeks without charging it. I almost never have a trip or outing that requires charging. With a USB-C port, the camera is easy to charge whenever I need to.</p> <h2 id="image-quality">Image Quality </h2><p>The Sony a7 III produces beautiful images. The colors are vibrant, the images are sharp, and the dynamic range is excellent. The camera produces images that are a joy to look at and share. Without fail, my favorite photos of my kids are the ones I’ve taken with my Sony.</p> <h2 id="silent-shooting">Silent Shooting </h2><p>A small benefit of the mirrorless design is the ability to shoot totally silently. This isn’t a feature I knew I wanted, but it has been valuable on many occasions. Sometimes when photographing kids, it’s helpful to be able to take a photo without them knowing.</p> <p> </p> <h2 id="conclusion">Conclusion </h2><p>Capturing the fleeting moments of my children’s life is a joy and a privilege. Like most parents of our age, I mostly use my iPhone to capture these moments. However, almost without exception, the photos I take with my Sony a7 III are the ones I cherish the most. The camera is fast, responsive, and reliable. It produces beautiful images that I will keep looking at for the rest of my life.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>Cameras I’ve used since 2014: iPhones from 5 up to 16 Pro Max, <a class="link" href="https://www.dpreview.com/products/olympus/compacts/oly_tg3" target="_blank" rel="noopener" >Olympus Tough TG-3</a>, <a class="link" href="https://www.dpreview.com/products/sony/slrs/sony_nex5t" target="_blank" rel="noopener" >Sony NEX-5T</a>, <a class="link" href="https://www.dpreview.com/reviews/canon-eos-7d-mark-ii" target="_blank" rel="noopener" >Canon 7D II</a>, <a class="link" href="https://www.dpreview.com/reviews/sony-alpha-a6500" target="_blank" rel="noopener" >Sony a6500</a>, <a class="link" href="https://www.dpreview.com/reviews/sony-cybershot-dsc-rx100-v-review" target="_blank" rel="noopener" >Sony RX100V</a>, <a class="link" href="https://www.dpreview.com/reviews/fujifilm-x100v-review" target="_blank" rel="noopener" >Fuji X100V</a>, <a class="link" href="https://www.dpreview.com/reviews/sony-a7-iii-review" target="_blank" rel="noopener" >Sony a7 III</a> <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>AI interpretation of tdhopper.com</h1> <p>Tue, 31 Dec 2024 10:54:00 +0000</p> <p>I dumped all my posts from the last 14 years into <a class="link" href="https://claude.ai" target="_blank" rel="noopener" >Claude</a> and asked it to make any conclusions about me from them. These are spot on:</p> <blockquote> <ol> <li><strong>Evolution in Technical Identity</strong>: Your writing shows an ongoing tension between pure mathematics and practical engineering. Early posts focus heavily on mathematical concepts, but over time you seem to embrace a more pragmatic engineering mindset while still maintaining deep appreciation for theoretical foundations.</li> <li><strong>Teaching as Processing</strong>: You often write about technical concepts right after learning them yourself, using blogging/teaching as a way to solidify your own understanding. This suggests you process and internalize knowledge best by trying to explain it to others.</li> <li><strong>Meta-Learning Focus</strong>: Many of your posts aren’t just about technical topics, but about HOW to learn technical topics. You seem particularly interested in the process of skill acquisition and knowledge transfer.</li> <li><strong>Honesty About Failure</strong>: Unlike many technical bloggers, you’re remarkably open about professional setbacks and rejections. Your post about being rejected from data science jobs stands out as unusually candid for the field.</li> <li><strong>Long-Form Thinking</strong>: You maintain ideas and threads over very long periods. For example, you’ve written multiple follow-ups and updates to your “How I Became a Data Scientist” post over several years as your perspective evolved.</li> <li><strong>Documentation as Service</strong>: Many of your technical posts seem motivated by wanting to document solutions to problems you struggled with, specifically to help others avoid the same struggles. This suggests a strong service orientation in your technical writing.</li> <li><strong>Cross-Domain Connection Making</strong>: You often draw interesting parallels between seemingly unrelated domains - connecting operations research to data science, or academic math to practical programming.</li> </ol> </blockquote> </article> <article> <h1>Accessing my home server around the world with custom domain names</h1> <p>Fri, 20 Dec 2024 11:47:00 +0000</p> <p>Last year, I invested in a <a class="link" href="https://www.synology.com/" target="_blank" rel="noopener" >Synology NAS</a> as a home server and backup drive. Like many Synology users, I’ve quickly found it has many more uses than I initially expected. Beyond the usual file storage and Time Machine backups, my Synology has become the backbone of my home technology.</p> <p>On my NAS, I use <a class="link" href="https://https://pi-hole.net" target="_blank" rel="noopener" >Pihole</a> for network-wide ad blocking. A Webdav server syncs my <a class="link" href="https://devonthink.com" target="_blank" rel="noopener" >Devonthink</a> databases to manage my research and documents. I back up my Apple Photos library on <a class="link" href="https://www.synology.com/en-us/dsm/feature/photos" target="_blank" rel="noopener" >Synology Photos</a>. I’m transitioning my home camera setup to <a class="link" href="https://www.synology.com/en-global/surveillance" target="_blank" rel="noopener" >Surveillance Station</a> (so all my footage is now local). Using Synology’s Docker support, I run <a class="link" href="https://openaudible.org" target="_blank" rel="noopener" >OpenAudible</a> to manage my audiobook collection. I run <a class="link" href="https://www.synology.com/en-us/dsm/feature/mailplus" target="_blank" rel="noopener" >Synology MailPlus</a> to back up my email. And there’s more!</p> <p>Of course, now that I have all these services running, I want to access them from anywhere. Synology offers the <a class="link" href="https://kb.synology.com/en-us/DSM/help/DSM/AdminCenter/connection_quickconnect?version=7" target="_blank" rel="noopener" >QuickConnect</a> service, but I wanted something more secure and independent.</p> <p>While browsing Synology’s subreddit and Facebook Groups, I discovered Tailscale, a mesh VPN service for solving my exact problem. After (easily) installing Tailscale on my Synology, personal computers, and phone, my devices are securely connected from anywhere in the world as if they were on the same local network. With Tailscale, my NAS has a private, static IP address through which I can access the services via their dedicated ports.</p> <p>To make my services more accessible, I purchased a domain name through <a class="link" href="https://cloudflare.com" target="_blank" rel="noopener" >Cloudflare</a>. I created a subdomain with an A record pointing to my always-on Mac Mini’s Tailscale IP address for each of my services. This IP address is accessible from anywhere but only through my devices authenticated with Tailscale.</p> <p>I run a <a class="link" href="https://caddyserver.com" target="_blank" rel="noopener" >Caddy server</a> on my Mac Mini as a reverse proxy to map the subdomains to my Synology services. The magic of Caddy is that it automatically manages SSL certificates through Let’s Encrypt, giving me https access to my services.</p> <p>Caddy was painless to install with Homebrew and easy to configure with a Caddyfile. Here’s an example of my Caddyfile:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">cat Caddyfile </span></span><span class="line"><span class="cl">{ </span></span><span class="line"><span class="cl"> email <MY_EMAIL> </span></span><span class="line"><span class="cl">} </span></span><span class="line"><span class="cl">*.mydomain.com mydomain.com { </span></span><span class="line"><span class="cl"> tls { </span></span><span class="line"><span class="cl"> dns cloudflare <CLOUDFLARE_API_TOKEN> </span></span><span class="line"><span class="cl"> } </span></span><span class="line"><span class="cl">} </span></span><span class="line"><span class="cl">pihole.mydomain.com { </span></span><span class="line"><span class="cl"> reverse_proxy 192.168.68.21:8765 </span></span><span class="line"><span class="cl">} </span></span><span class="line"><span class="cl">webdav.mydomain.com { </span></span><span class="line"><span class="cl"> reverse_proxy https://192.168.68.21:5006 { </span></span><span class="line"><span class="cl"> transport http { </span></span><span class="line"><span class="cl"> tls_insecure_skip_verify </span></span><span class="line"><span class="cl"> } </span></span><span class="line"><span class="cl"> } </span></span><span class="line"><span class="cl">} </span></span></code></pre></td></tr></table> </div> </div><p>The last piece of the puzzle involved configuring the Synology to know these subdomains. In the DSM Control Panel, under the Login Portal’s Applications tab, I set up the specific subdomains that tell Synology how to launch applications like MailPlus and Synology Photos. With this final step complete, I don’t need to remember IP addresses and ports. I can access all my services through simple, memorable domain names from anywhere in the world.</p> <p>This setup has transformed how I interact with my home server, making it both more powerful and more convenient to use. The combination of Tailscale’s security, Cloudflare’s domain management, and Caddy’s reverse proxy capabilities has created a easy to configure and user-friendly system that serves all my home server needs.</p> </article> <article> <h1>Building a Startup with AI with Jowanza Joseph</h1> <p>Tue, 17 Dec 2024 12:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/jowanza" target="_blank" rel="noopener" >Jowanza’s Twitter</a></li> <li><a class="link" href="https://jowanza.com" target="_blank" rel="noopener" >Jowanza’s website </a></li> <li><a class="link" href="https://www.parakeetfinancial.com/" target="_blank" rel="noopener" >Parakeet Financial</a></li> <li><a class="link" href="https://tdhopper.com/blog/ai-assisted-programming-with-ravi-mody/" target="_blank" rel="noopener" >AI Assisted Programming with Ravi Mody</a></li> <li><a class="link" href="https://cursor.sh/" target="_blank" rel="noopener" >Cursor Editor</a></li> <li><a class="link" href="https://www.magicpatterns.com/" target="_blank" rel="noopener" >Magic Patterns</a></li> <li><a class="link" href="https://v0.dev/" target="_blank" rel="noopener" >V0 by Vercel</a></li> <li><a class="link" href="https://www.crewai.com/" target="_blank" rel="noopener" >Crew AI</a></li> <li><a class="link" href="https://github.com/features/copilot" target="_blank" rel="noopener" >GitHub Copilot</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>A builder of systems, a path to unveil</h1> <p>Thu, 05 Dec 2024 10:18:00 +0000</p> <p>I recently joined the Spotify <a class="link" href="https://research.atspotify.com/audio-intelligence/" target="_blank" rel="noopener" >Audio Intelligence</a> team as a machine learning infrastructure engineer. At our team offsite, we did lightning round introductions of ourselves, so I used <a class="link" href="https://www.linkedin.com/company/aichatgpt/" target="_blank" rel="noopener" >ChatGPT</a> and <a class="link" href="https://www.linkedin.com/company/tad-ai/" target="_blank" rel="noopener" >tad.ai</a> to generate a song about my career history.</p> <div class="video-wrapper"> </iframe> </div> <blockquote> <p>Started at RTI, solving problems with care,<br> Social media models and GIS in the air. <br> Then Parsley called, real-time streams to refine, <br> With Python and Storm, data aligned.</p> <p>From research to engineering, skills took flight, <br> Building platforms through day and night. <br> Each role a step, each tool a stage, <br> A career evolving, page by page.</p> <p>At Cylance, you scaled machine learning with ease, <br> Serverless pipelines and malware’s disease. <br></p> <p>D.T.N. brought platforms, Airflow in play, <br> Cloud tools for data, clearing the way. <br> Varo’s features took center stage, <br> With Tecton and Airflow, systems engaged.</p> <p>Fraud detection, insights in view, <br> Scaling platforms, something new. <br> From data science to ML core, <br> Building systems that teams adore.</p> <p>Now at Spotify, the journey goes on, <br> Tim’s career—focused, strong. <br> Each milestone builds the next design, <br> A career in motion, sharp and aligned.</p> <p>Tim’s journey tells a thoughtful tale, <br> A builder of systems, a path to unveil.</p> </blockquote> </article> <article> <h1>Lessons from Developing a Machine Learning Feature Platform</h1> <p>Thu, 14 Nov 2024 12:00:00 -0500</p> <p>Much of my career has been focused on helping machine learning researchers get the data they need, where they need it, and when they need it. Over the past three years, I helped lead the development of a new machine learning feature platform at the bank startup where I worked. Our platform could serve a variety of machine learning and analytics applications by enabling creation of custom data transformations to generate batch and real time data for model training and inference.</p> <p>Here are some lessons from my experience:</p> <h2 id="point-in-time-correctness-is-harder-than-it-looks">Point-in-time correctness is harder than it looks </h2><p>Training data must reflect what was knowable at prediction time, not what you know now. This seems obvious until you face late-arriving events, slowly changing dimensions, or aggregations that span the temporal boundary. A customer’s balance “as of yesterday” needs yesterday’s knowledge, not today’s corrected ledger. Build point-in-time correctness into the platform from day one. Retrofitting is expensive and painful.</p> <h2 id="event-time-vs-processing-time-matters">Event time vs processing time matters </h2><p>Teams need a common understanding of feature availability and event time versus processing time semantics. When you query for “the features as of timestamp T,” are you getting the data that was available at time T, or the data whose events occurred at time T? This distinction is fundamental to correct temporal joins and real-time feature computation.</p> <h2 id="reduce-skew-by-reducing-code-duplication">Reduce skew by reducing code duplication </h2><p>Many machine learning applications require offline data retrieval in batch to train the models but then up-to-date, realtime features for model inference. Because of technical challenges, these two sources are often provided through unique code paths. As soon as feature logic is reimplemented, you open the door to subtle (or not so subtle) bugs that can skew the features and the resulting model predictions. If you’re doing online and offline retrieval, do whatever you must to have single code paths for the features.</p> <h2 id="unified-batch-and-streaming-requires-operational-maturity">Unified batch and streaming requires operational maturity </h2><p>The reality is that most unified systems (Flink, Beam, etc.) still require different execution engines and operational expertise. The goal of avoiding multiple code paths for feature logic is achievable, but don’t underestimate the operational complexity of running both a batch system and a streaming system in production.</p> <h2 id="feature-discoverability-unlocks-reuse">Feature discoverability unlocks reuse </h2><p>A centralized feature store becomes valuable not just for machine learning <a class="link" href="https://tdhopper.com/talks/streaming-machine-learning/" target="_blank" rel="noopener" >but in rules engines and analytics</a> as well. Well documented and discoverable data in the feature store and the upstream data warehouse are essential for teams to find and reuse features instead of recreating them. This multiplies the value of every feature created and reduces the maintenance burden.</p> <h2 id="governance-reins-in-complexity">Governance reins in complexity </h2><p>You need clear rules about who can modify which data, who can change feature names, and who can create new features in specific domains. Without governance, the system quickly becomes unmanageable. The feature store platform team should democratize feature creation while maintaining guardrails that prevent chaos.</p> <h2 id="metadata-and-lineage-are-debugging-lifelines">Metadata and lineage are debugging lifelines </h2><p>Tracking where features come from and how they’re built is essential for debugging, compliance, and understanding downstream impacts. When (not <em>if</em>) something goes wrong, you need to be able to trace the problem back to its source quickly.</p> <h2 id="monitoring-and-alerting-are-essential">Monitoring and alerting are essential </h2><p>As with governance, monitoring ensures the quality and reliability of the platform which is essential for building trust of your internal users and solving your business problems. Detecting drift, staleness, missing updates are essential. Without ML-aware monitoring, subtle issues can degrade model performance for weeks before anyone notices.</p> <h2 id="ml-platforms-arent-just-data-engineering">ML platforms aren’t just data engineering </h2><p>Feature platforms have specific concerns that distinguish them from general data engineering platforms. The focus on data leakage prevention, point-in-time correctness, and training-serving consistency requires specialized thinking. You can’t simply treat ML data like any other data pipeline.</p> <h2 id="tooling-enforces-what-platforms-guarantee">Tooling enforces what platforms guarantee </h2><p>There’s enormous value in good developer tooling: precommit hooks, tests, and CI systems that do automated checks. These tools enforce style guidelines and catch basic errors before production. But be clear about responsibilities: the platform team ensures infrastructure correctness (joins, freshness, consistency) while feature owners ensure semantic correctness (business logic, data quality). Confusing this boundary creates a liability gap.</p> <h2 id="enforcement-creates-tension">Enforcement creates tension </h2><p>Requiring teams to use the platform (not deploying models that bypass it) creates tension with customer-centricity. The way to resolve this: make the platform so valuable, reliable, and easy to use that teams want to adopt it, then enforce adoption to prevent the chaos of fragmented tooling. Build carrots first, then use the stick.</p> <h2 id="cost-management-means-thinking-about-serving">Cost management means thinking about serving </h2><p>In batch feature creation, your users will need to learn to define their features efficiently to avoid throwing away cost on expensive compute. On the other hand, the storage and compute for real-time lookups typically dwarf batch join costs. Without careful attention to serving costs and careful selection of what needs to be real-time versus what can be precomputed, the computational demands can blow your budget.</p> <h2 id="customer-centricity-is-non-negotiable">Customer-centricity is non-negotiable </h2><p>Platform teams must stay close to their end users. Build a tool that actually solves the problems they have, not the problems you think they should have. Listen to their pain points, understand their workflows, and iterate based on real usage patterns. A technically perfect platform that nobody wants to use is a failure.</p> </article> <article> <h1>My summer doing math (?) research</h1> <p>Thu, 19 Sep 2024 13:20:00 +0000</p> <p>I started college as a physics major, but I switched to computer science my sophomore year when I realized I didn’t want to take more lab classes. After a year as a CS major, I realized I loved the theory classes but didn’t care as much about the practical and programming portions of the curriculum. Going into my junior year, I switched—for the last time—to major in math.</p> <p>During the winter of my junior year, I applied to a bunch of math <a class="link" href="https://new.nsf.gov/funding/opportunities/reu-research-experiences-undergraduates" target="_blank" rel="noopener" >Research Experience for Undergraduates</a> programs at schools around the country. I was thrilled to be accepted to the program at Rochester Institute of Technology organized by <a class="link" href="https://www.rit.edu/science/directory/dansma-darren-narayan" target="_blank" rel="noopener" >Darren Narayan</a> for the summer of 2007.</p> <p>I headed to Rochester starry eyed about spending the summer doing math research. On my first day, I found out I’d be working with the inimitable mathematician <a class="link" href="https://en.wikipedia.org/wiki/Stanis%C5%82aw_Radziszowski" target="_blank" rel="noopener" >Stanisław P. Radziszowski</a> on computational combinatorics with <a class="link" href="https://www.linkedin.com/in/evanheidtmann/" target="_blank" rel="noopener" >another student researcher</a>, Evan Heidtmann. The first question Dr. Radziszowski had for us was “How are your programming skills?” This was not what I expected–or wanted–to hear at a math research program.</p> <p>We spent the summer investigating the <a class="link" href="https://en.wikipedia.org/wiki/Ramsey%27s_theorem" target="_blank" rel="noopener" >Ramsey number</a> $R_4(C_4)$. This is the smallest number $n$ such that a <a class="link" href="https://en.wikipedia.org/wiki/Complete_graph" target="_blank" rel="noopener" >complete graph</a> with $n$ vertices where the edges are colored by four colors is <em>guaranteed</em> to have a monochromatic cycle with 4 vertices.</p> <p>At the time, the number was known to be either 18 or 19. Due to combinatorial explosion, it’s not possible to enumerate every graph 4-coloring for graphs of this size. A graph with 18 vertices has 153 edges which could be colored in $4^{153}$ ways (ignoring isomorphisms). This is a very, very big number.</p> <p>Since the number was one of two options, one way to prove it had to be 19 would be to find a complete graph with 18 vertices where the edges are colored by four colors while not containing a 4-cycle subgraph in one color. Evan and I spent the summer hunting for this mythical graph coloring by writing C code to generate and check graphs.</p> <p>Before showing up in Rochester, I had no experience writing C code. Fortunately, my brilliant and patient research partner Evan did, and he quickly got me up to speed enough to manipulate and generate graphs using <a class="link" href="https://users.cecs.anu.edu.au/bdm/nauty/" target="_blank" rel="noopener" >nauty</a>.</p> <p>I had also never used version control, but Evan was running a version control (I don’t recall which one) platform on his home server, so we were able to easily version and collaborate on our code.</p> <p>We also had access to fifty Unix machines in the RIT computing lab, so this summer introduced me to the Unix command line, shell scripting, and distributed computing. I wrote bash scripts that would send our processing jobs out to these machines and aggregate results back to a host node.</p> <p>Well, we failed to solve our problem. It turns out the solution was actually <a class="link" href="https://utilitasmathematica.com/index.php/Index/article/view/481" target="_blank" rel="noopener" >published by Chinese mathematicians</a> while we were working on it. $R_4(C_4)=18$, so the graph we looked for didn’t even exist.</p> <p>Nonetheless, that experience ended up being one of the most important of my higher education. I learned some of the most fundamental skills of my career (version control, working in terminal, writing fast code, and persevering through computer challenges), and I learned to enjoy writing code.</p> <p>I didn’t write much code between 2007 and starting my operations research master’s degree research in 2011, but when I started again, the skills and tenacity I developed at that REU set me up for success in grad school and joining industry as a data scientist in 2012. I wouldn’t have picked that research project had I been given a choice; I’m so glad I wasn’t asked.</p> </article> <article> <h1>AI Assisted Programming with Ravi Mody</h1> <p>Wed, 18 Sep 2024 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.linkedin.com/in/ravi-mody/" target="_blank" rel="noopener" >Ravi Mody on LinkedIn</a></li> <li><a class="link" href="https://chatgpt.com/" target="_blank" rel="noopener" >ChatGPT</a></li> <li><a class="link" href="https://claude.ai/" target="_blank" rel="noopener" >Claude.ai</a></li> <li><a class="link" href="https://www.barebones.com/products/bbedit/" target="_blank" rel="noopener" >BBEdit</a></li> <li><a class="link" href="https://www.cursor.so/" target="_blank" rel="noopener" >Cursor</a></li> <li><a class="link" href="https://www.drewconway.com/zia/2013/3/26/the-data-science-venn-diagram" target="_blank" rel="noopener" >Drew Conway’s Data Science Venn Diagram</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> <h2 id="summary">Summary </h2><p>In this episode of <em>Into the Hopper</em>, I sit down with Ravi Mody, a data science leader with 17 years of experience across companies like Spotify and Daily Harvest. We dig into how large language models—tools like ChatGPT, Claude, and Cursor—are changing the day-to-day reality of programming.</p> <p>Ravi shares his unique perspective as someone who took a four-year break from hands-on coding while in management at Spotify, only to return right as LLMs were taking off. We discuss moving from syntax-level assistance to function-level code generation, the dangers and opportunities for junior developers, and why we’re both spending less time on Stack Overflow.</p> <h2 id="transcript">Transcript </h2><p><strong>Tim:</strong> Welcome to episode nine of the Into the Hopper podcast, back after maybe an 18-month hiatus. My name is Tim Hopper. I’m joined today by my friend Ravi Mody. Ravi has 17 years of data science and leadership experience across a variety of companies like Spotify and Daily Harvest.</p> <p>Ravi and I have been discussing how large language models are impacting how we program. We’re seeing more and more tools like Cursor, and Copilot has actually been out for three years now, which is wild to me. Copilot in particular has become a valuable part of my daily workflow. I’m interested in how others are using these tools too. Sometimes it’s hard to know how broad the impact is, or if it’s just people on social media talking about it. Ravi’s own development has been really impacted, and I asked him to come chat about what that looks like day to day.</p> <p><strong>Ravi:</strong> Tim, thanks so much for having me. This is definitely a topic I think about a lot.</p> <p><strong>Tim:</strong> Do you want to expand on your intro anymore? I summarized your very long resume in one sentence.</p> <p><strong>Ravi:</strong> Yeah, I mean, when I talk about programming, I’m largely coming at it from the perspective of building machine learning systems. That’s my specialty—usually recommender systems specifically. So some of my use of LLM coding is probably very domain-specific.</p> <p>Another piece of context: I’ve been programming since I was about 10, but I took a four-year break when I was at Spotify. I was there until February of 2024, and for those four years, I didn’t really program. I was a people manager. It was an amazing job, but when I joined Raya, my current company, I was rusty at programming. So the timing of this new generation of LLMs was really interesting.</p> <p><strong>Tim:</strong> That’s actually really helpful context. The timing is good. At the most basic layer, these tools are just being that mental assist and not requiring you to remember all the little things about syntax—things that might have been easy to forget over four years.</p> <p><strong>Ravi:</strong> For sure. On the syntax side, I save a ton of time compared to Stack Overflow or even going through docs. I still think it’s valuable to have some muscle memory around common functions, but Cursor and these other tools make it so easy to just say what you want and get functional code out of it.</p> <hr> <h3 id="current-toolset">Current Toolset </h3><p><strong>Tim:</strong> Why don’t we start with an overview of your current tools? Obviously for all of us, these tool chains are changing quickly. People are jumping back and forth between Anthropic and OpenAI as their chat tools improve. What’s your current setup?</p> <p><strong>Ravi:</strong> I have access to ChatGPT, Claude 3.5, and Cursor. I go back and forth between them. The honest truth is I’m actually very comfortable with a basic text editor. I’m not a backend engineer—I’m really building ML models. So I’ve always been most comfortable in almost a notepad-style of programming. I use BBEdit.</p> <p><strong>Tim:</strong> BBEdit’s your primary editor?</p> <p><strong>Ravi:</strong> Yeah, it has been for most of my career.</p> <p><strong>Tim:</strong> I don’t think you’ve ever mentioned that. That’s interesting.</p> <p><strong>Ravi:</strong> It’s kind of embarrassing, honestly. I don’t like to tell people. I think it makes you a different kind of programmer if you’re not using a real IDE.</p> <p><strong>Tim:</strong> A lot of people have really relied on it. So BBEdit doesn’t currently have any copilot or cursor-type integration in the editor—you’re going to the chat interfaces directly?</p> <p><strong>Ravi:</strong> Yeah. I’m honestly not even sure if it does. I’m not a power user of BBEdit. I literally just use it as a notepad with syntax highlighting.</p> <p><strong>Tim:</strong> That’s wild. I’m very interested to hear that. So on a day to day, what are you turning to AI tools to assist with?</p> <hr> <h3 id="thinking-at-the-function-level">Thinking at the Function Level </h3><p><strong>Ravi:</strong> One thing this means is that I’m not really getting code completion. Sometimes I’ll use Cursor because there is a little bit of friction otherwise. But I’ve never found that the speed of my typing is what really slows me down on projects. It’s usually one level of abstraction higher—taking a more complex idea and working at the level of functions and classes.</p> <p>For example, if I need a function that transforms a dataset, normally I would have cleared an hour of my calendar, designed it out, whiteboarded it, and then typed it. The typing is like 10% of that time. Where I’m most comfortable is co-programming with the chat program. I say what I need, look at the code it generates, and essentially do a code review on it. I’m usually generating at the function level.</p> <p><strong>Hours become five to ten minutes now.</strong> That’s where I’ve been finding my productivity multiples. It’s not completing a little piece of syntax—it’s thinking at the function and class level.</p> <p><strong>Tim:</strong> That’s interesting. I do a lot of that, especially for utilities—trying to transform a file from one format to another, or restructuring a CSV. Just “give me a DuckDB command that’s going to restructure this the way I want.”</p> <p><strong>Ravi:</strong> One of my favorite things now is exactly that. My least favorite programming was always changing formats or saving objects. I love watching it churn out the code I hated writing—“save this JSON file to this folder and then load it from that folder.” I’m trying to automate away the parts I just did not enjoy about the job.</p> <p><strong>Tim:</strong> It’s interesting historically how the ability to do that has been such a superpower. Going back to Drew Conway’s data science Venn diagram from 13 years ago—that hacking skill, being able to manipulate data. So many people just don’t have that paradigm. It’ll be interesting to see if this opens the door for more people to do that kind of thing.</p> <p>I was just doing something similar for personal reasons. My county lets you download all the real estate records in one massive 400-megabyte Excel file. It’s impossible to work with, but DuckDB is perfect for this. In the past, I’ve spent so much time hacking around with stuff like that. I kind of enjoy it in some ways, but now I just describe what I want and get those DuckDB commands out. It’s reinvigorating those little side trails that I did a lot of in the past but now—with a busy job and four kids—I just don’t want to spend time on anymore.</p> <p><strong>Ravi:</strong> Exactly. If I’m going to take two hours on a data project, how do I want to spend that time? I used to love being in the low-level code, but one of the things I’ve realized in my career is <strong>the higher level of abstraction you can think, the more productive you’ll be</strong>. Forcing myself to assume it can take care of the basic code, and then thinking one level of abstraction higher about how all the pieces connect—that’s been the biggest game changer for me.</p> <hr> <h3 id="code-completion-and-reduced-mental-overhead">Code Completion and Reduced Mental Overhead </h3><p><strong>Tim:</strong> I’m actually surprised you’re not using it more at the code completion level. That’s been the biggest change for me. I’ve always relied on some level of code completion in an editor, even just really dumb things like completing variable names. But I find it’s so good at recognizing patterns.</p> <p>In Python, for example, there’s no great auto-importer. You use some new method from another package, and I can type it in, then in VS Code you can say “try to fix this error that this import doesn’t exist”—and it’s really bad at detecting that. But with Copilot, I jump up to my imports and Copilot almost always knows exactly what import I was trying to add. I don’t have to type out “from pandas import something.”</p> <p><strong>Ravi:</strong> There’s also something almost magical about when it types exactly what you were going to type. I do pull up Cursor every now and then when I need to get lower level, and at that point the code completion is really valuable.</p> <p><strong>Tim:</strong> The other thing is if you’re drawing some kind of structure. I often have something written out in JSON or YAML and I want to write that as Python objects. I’ll paste it into my editor, comment it out, start typing the Python, and it figures out really quickly how to translate that structure. I’ve spent many hours writing regex to do that kind of thing. Now it just does it.</p> <p><strong>Ravi:</strong> One thing I’ve been really enjoying is—and this is partially because of the chat interface—I’ll write a simple function and then, you know, in Python the typing and doc strings take quite a bit of time to type out. So what I’ve sometimes done is have a skeleton of a function and throw it into Claude, saying “This is what I’m trying to get out of this. Give me any advice, but also just write all of this boilerplate stuff.”</p> <p>I think of that as code completion at the function level. I’m asking it to think cohesively about that function: “This is what the function needs to do, this is the context of why I’m doing it.” Something that would have taken me half an hour takes a minute now.</p> <p><strong>Tim:</strong> I do that a lot with doc strings. I like to add them in theory, but I actually hate writing them. You can go to the top of a function and Copilot can usually figure out fairly accurately a doc string and even generate examples. I’m not writing big libraries used by lots of people—it’s little internal things. How much is it worth spending time on? But when it can do it for you, it’s pretty nice.</p> <hr> <h3 id="implications-for-junior-developers">Implications for Junior Developers </h3><p><strong>Tim:</strong> You’re in a senior position in your current role but still doing heavy IC work. How are you thinking through these tools in the hands of someone straight out of college? I know that’s been a concern people have had.</p> <p><strong>Ravi:</strong> That’s a great question. This AI stuff, at least at this point where it’s still error-prone and can’t think about large systems cohesively—it’s a huge double-edged sword. Clearly a lot of us are finding huge speed-ups in our programming, but it also makes it really easy to not look at your code at all, to just generate something, run it, and if it works, move on.</p> <p>This is particularly dangerous for people who don’t have the experience of understanding edge cases where things may fail, or understanding inefficiencies in the code.</p> <p>I mentioned earlier that I always code review anything it generates. I do a little mini PR. I go through it. Even with Claude, which is one of the more advanced LLMs right now, almost always it’s going to do something where I’m like, “Hey, take a look at that.”</p> <p>With more junior employees or people straight out of college who don’t have that experience, code review will still be important. As I start to hire people for whom this is a bigger risk, I’m probably going to try to set up something where they make it clear when something was LLM-generated. During code review we can pay more attention to that.</p> <p><strong>At this moment in time, we cannot 100% trust LLMs, and we need to embrace that it’s not completely doing our job right now.</strong></p> <p><strong>Tim:</strong> At the same time, there’s really no going back. We can’t pretend people aren’t going to use these tools. Being explicit about it is going to be essential—being open about it and trying to give people good guidance.</p> <p>We’ve had the same concerns over the years about people copying and pasting code snippets from Stack Overflow that they didn’t understand. I’ve done it many times. In some ways, we’re just speeding up the ability to do that, with some possible other risks. Teams and leaders are going to have to be very open about it. That idea of indicating it in code review is good—teaching people to look over their own code, not just take the generated code for granted.</p> <p><strong>Ravi:</strong> That’s a great point—we’ve been doing this forever with Stack Overflow. We copy and paste code in. It’s always been best practice to understand the code, to even cite where you got it from.</p> <hr> <h3 id="shadow-llm-use">Shadow LLM Use </h3><p><strong>Tim:</strong> There’s an interesting issue of companies being hesitant about this. There are legitimate legal concerns and related issues, but at every company now there’s going to be this shadow LLM use where you can’t talk about it because maybe you’re not allowed to use it. That has to be an incentive for companies to figure it out. They can do everything they want to lock things down, but if people can’t do it on their work computers, they’re pulling up their phones and using ChatGPT.</p> <p><strong>Ravi:</strong> Absolutely. It’s so interesting to see how different companies are approaching this. Some are fully embracing it—internal tools that make it easy to use LLMs and ensure the code’s not getting shared back inappropriately. But definitely there are companies just trying to ban it, and I think that’s a huge mistake. You’re going to make your employees less productive, plus they’re just going to go off and do it on the side anyway.</p> <p><strong>Tim:</strong> I know for a fact that’s happening. I talk to a lot of people doing that kind of thing. There’s basically no way to stop it unless you’re putting people in an air-gapped lab. If they have their phones, they have access to these tools. People are using Slack to copy and paste something and then open it up on their phone.</p> <p>I work at a bank where we have particular data privacy risks, but I’ve been very impressed—they’ve been making an effort. We explored Amazon’s CodeWhisperer offering, did a company-blessed trial, but it ended up not being very good. Now we’re doing a trial with Copilot, and I’ve been able to be on it. I love having access to it at work. I get a lot of speed-up from it, and just the ability to free up brain cycles is really nice.</p> <p>Imagining going to another company with total prohibition just feels crazy to me. I assume companies are going to start seeing pushback in hiring—people are going to ask, “Can we use these tools?”</p> <p><strong>Ravi:</strong> As you said, there’s no going back. Increasingly it’s just going to become a normal part of the dev experience. It would be the equivalent of saying, “Hey, you can’t use an IDE to program.”</p> <hr> <h3 id="where-llms-fall-short">Where LLMs Fall Short </h3><p><strong>Tim:</strong> Have you come across patterns or particular problems where the tools have fallen short? Things you know you aren’t going to be able to get answers for?</p> <p><strong>Ravi:</strong> All the time. I don’t think the current best-in-class LLMs are capable of reasoning at a system level very well. As the codebase I share gets larger and more complex, as I try to fill the context with that, it starts losing track of the bigger picture. If you think about what an LLM is doing, they’re not really designed to think at that system level. They’re still really good at thinking in terms of paragraphs instead of books, for example.</p> <p>I think this is changing quickly. I’m really curious about how chain of thought changes this. I’m sure there’s a near-term future where LLMs can actually run code and see what happens. But one big limitation right now is that system level. I’d like to think in terms of abstractions and go up levels. I’m really excited for the day where I can think in terms of classes working together instead of going into the class and working on that.</p> <p><strong>Tim:</strong> We probably should have opened this by explaining what models we’re on now for people listening in the future.</p> <p><strong>Ravi:</strong> Yeah, Claude 3.5 and ChatGPT-4 and O1.</p> <p><strong>Tim:</strong> The thing that’s been interesting to me recently—I don’t get to use Cursor in a work context, only in personal projects—but Cursor has this Composer element where you’re giving it free rein to understand the whole repository and make changes across multiple files.</p> <p>I have personal website projects I’d like to do, but I don’t have front-end knowledge and very limited JavaScript knowledge. I’ve been trying to hack around with what people are doing—bootstrapping an entire project in Cursor and having it generate and update stuff. Every time I’ve done it, I’ve hit a wall where it starts cycling through wrong solutions to something.</p> <p><strong>Ravi:</strong> Yeah.</p> <p><strong>Tim:</strong> I haven’t spent much time on it—literally hours. But I’ve gotten way further than I ever would in my free time trying to do it by hand. I spent many Christmas breaks over the years trying to teach myself enough JavaScript to get something running. I’ve gotten to where I actually have a functional React app, for example. But then interacting with data storage layers—Supabase, SQLite cloud, these pseudo-serverless free-tier cloud storage things—it ends up falling on its face.</p> <p>The point is that it’s trying to look at the bigger picture in a way that Copilot has been able to do for a bit, but Cursor is trying to take that to another level. Even if it isn’t perfect at writing new code, just the ability to open a repository and have an LLM understand it and explain it to you—<strong>how many of us have spent so many hours coming into a company, trying to understand the codebase?</strong> I think it’s going to be really valuable to just open it up and ask your robot, “What is this codebase? What parts are important? How do I even run this?”</p> <p><strong>Ravi:</strong> I love it for that. One thing I’ll often do in a Claude session is load it up with multiple parts of my code and give a brief description: “I’m building a recommender system, here’s some context about it.” It can often understand the individual pieces and how they roughly go together. I can ask it to give me code to run something or explain how a functionality works.</p> <p>But there is a major danger zone in current LLMs, and that’s the context limit. What I call it is <strong>the LLM goes off the rails</strong>. Its error rate or intelligence starts going down at some point. It’s important to understand the limitations. LLMs do struggle with larger and larger contexts. You load a lot of code, and if you load too much, it starts struggling.</p> <p>Also, as you talk to it more and iterate more, it starts struggling. I ran an experiment a couple of days ago where I had a function I was trying to speed up. I could already see three places I could optimize it—almost like an interview question. I asked, “Tell me some ways I can speed it up.” It started rewriting it. I could run the code and see the time—maybe it starts at 200 milliseconds, gets to 180, 150. It’s actually doing something.</p> <p>Then I keep telling it, “Can you make it faster?” And it’s always like, “Of course!” But then you see the time actually start going up—250, 300 milliseconds. There’s a certain overconfidence in these LLMs. It’s like, “Yeah, I can do whatever you ask me.”</p> <p><strong>Tim:</strong> That’s an area where it requires not checking out. Your brain has to keep working. It’s going to be interesting, and we’re at such an early stage.</p> <p><strong>Ravi:</strong> It’ll be interesting to listen to this in a few years and just be like, “Wow, I can’t believe how basic it was.”</p> <hr> <h3 id="concerns-about-skill-atrophy">Concerns About Skill Atrophy </h3><p><strong>Tim:</strong> Do you have any concerns about losing particular skills? You’re thinking about generating code at the function level—are you worried it’s making you more rusty? Or are you comfortable with the knowledge and experience you have being able to dive in when you need to?</p> <p><strong>Ravi:</strong> That’s a good question. I feel like I should be worried about it. But honestly, because I don’t think algorithms are getting worse, there are certain things that I’m slowly deciding I no longer need to know. I now trust it to abstract some things away from me.</p> <p>A concrete example is data visualization. When I moved from R to Python, ggplot was torn away from me. I loved ggplot—it was a grammar of graphics. Once you know the language, everything is intuitive. I never found that with Python. With Python, data visualization was always “look at the docs, read the docs, understand it” with any of the popular libraries.</p> <p>Now I find LLMs are perfectly capable of taking a description of what I want—something I would have mapped into ggplot language—and building the code. I’m okay with being rusty at putting a visualization into code. Increasingly, these small three or four line pieces of code, I’m going to trust them. I can go back and read it and understand what I was doing, but I’m okay not knowing that syntax. I’m okay just looking at the visualization and saying “cool, it did what I wanted.”</p> <p><strong>Tim:</strong> The flip side is that you also have the ability to ask the LLM to go back and explain what you did.</p> <p><strong>Ravi:</strong> That’s a great point.</p> <p><strong>Tim:</strong> I’ve done that with SQL queries. It doesn’t always get it perfectly, but I deal with a lot of SQL in my current role and it can be hard to understand sometimes. Just getting the assistant to paste something in and say “can you add comments to this, help me explain it”—without changing anything—is really powerful.</p> <p>Again, it’s really hard to know how this is going to impact us in the long run. We have the great benefit of having learned the foundations. To younger listeners, I would encourage you to build your foundations and don’t punt too much on what you know. But at the same time, we’ve got these assets that are huge, and don’t ignore that either.</p> <p><strong>Ravi:</strong> I completely agree. I don’t want to sound like an old dude lecturing young people, but there’s a real danger that you get mid-level in your career and it turns out you don’t really know what you’re doing. I think that wasn’t really possible when we were starting off. That’s becoming increasingly possible. You can ship error-prone code. You can not actually be able to explain your code. There are so many danger points.</p> <hr> <h3 id="the-foundation-of-math-and-learning">The Foundation of Math and Learning </h3><p><strong>Tim:</strong> My background was in math. There’s been a lot of discussion recently about LLMs for math, and Terry Tao, one of the greatest living mathematicians, has actually been digging into LLMs and is very optimistic about their benefits for math.</p> <p>I imagine having studied math—it’s in many ways the extension of your teacher in seventh grade telling you not to use the calculator for everything. Those hours spent toiling away in the library struggling with proofs, where now an undergrad math curriculum… an LLM could probably solve almost every proof you have set before you. Learning how to reason through that was what made me successful in math, but I see that as a huge benefit to my whole career—just the way I learned how to solve math problems.</p> <p>I don’t really know what the solution is, but I’m glad I didn’t have what could be a crutch. At the same time, if I were a professional mathematician today, I would absolutely be looking at LLMs—if nothing else, to help with my LaTeX typesetting. But even problem solving. There’s no reason you wouldn’t want to use that to advance problem solving. It’s an interesting balance we’re going to have to figure out.</p> <p>Have you looked at any of the AI code review tools yet?</p> <p><strong>Ravi:</strong> I probably should. What’s out there?</p> <p><strong>Tim:</strong> I haven’t used them either. I’ve only heard people talk about them. There are integrations with GitHub that go in and add comments. I don’t know how you tune those, but I think it’ll be interesting in the future.</p> <p>One thing I’m really excited about—hopefully going to have another episode soon with a mutual friend of ours who’s running a small company and has been able to do way more because of LLM-assisted development. I think he’s only himself and one other engineer.</p> <p>I’m really interested in how this is going to enable entrepreneurs and small teams to develop things. Code review is really valuable, and if an LLM is good enough to provide useful feedback as an individual developer—you don’t always need code review, but it’s easy to miss things.</p> <p><strong>Ravi:</strong> One of the nice things about code review is that with code generation, if it generates wrong code with edge cases, that can be really dangerous. But with code review, the false positives are not a big deal. If it says “this function doesn’t look right” but it’s actually right, maybe you wasted 30 seconds. I can imagine it will be much more thorough than the typical PR review I’m used to—“looks good to me” or “hey, why’d you do it this way?”</p> <p><strong>Tim:</strong> And the possibility of being more thorough. People used to put in things like McCabe complexity checks—very formulaic. The ability for something to think with more context and say “maybe this is too complex and we could split it up” would be really cool. The next step beyond that is maybe automated suggested changes—open a PR into your PR with recommended changes.</p> <p><strong>Ravi:</strong> Just click through the changes. It’s interesting to see how we’re slowly iterating towards more of our code being written and reasoned about by AI. We’re clearly not there, but I think we’ll keep seeing iteratively the LLMs being able to take over more of it. I’m very curious if there’s an endpoint, or maybe 10 years from now they’re just doing our jobs for us. But we’re just in the early stages where it’s a huge boost to our productivity, and we’re still ultimately in charge of our code.</p> <hr> <h3 id="looking-forward">Looking Forward </h3><p><strong>Tim:</strong> I was going to conclude by asking if you had any forecasts. Any hope of 10 years seems like an insane time window, but two-year or five-year predictions?</p> <p><strong>Ravi:</strong> I think there’s probably a ton of low-hanging fruit specifically in our space. One thing I mentioned earlier is LLMs actually running code. I’m sure that’s being done. But I think as we treat LLMs more as an agent in a codebase, even the models that exist now can probably do more.</p> <p>Clearly the models are getting better. We’re throwing a ton of money into making these models reason better. It seems like the more computation we throw at it, the better it gets. I’m sure we’ll hit limits with the current technology—I’m deeply skeptical that we’re going to hit general intelligence with what we have now. But for the last two years, we’ve seen consistent, huge jumps in performance. I believe very much in momentum. I think we’re on the path towards LLMs owning more of a codebase. We’re not there yet, but I think we’re moving in that direction.</p> <p><strong>Tim:</strong> It’ll be fun to see. It’s very interesting, and very hard to make predictions. There are a lot of crazy predictions out there.</p> <p><strong>Ravi:</strong> Again, I don’t want to come back and listen to this and be like, “Wow, I was so wrong.”</p> <p><strong>Tim:</strong> Could we have even imagined this? You’ve been doing machine learning a few years longer than me, but in 2010, could we have imagined this? It seems wild. A “language model” back then was a Markov chain model. They were so dumb. You’re thinking about language in terms of bag of words. Even 10 years ago, I was working on really advanced Bayesian models for language, and they had such a simplistic understanding of the language.</p> <p>Just the ability to generate text that even looks sensible, much less actually is sensible, is wild. It’s going to be really cool to see what that turns into.</p> <p><strong>Ravi:</strong> Even just five years ago, GPT-2 was released almost exactly five years ago. Everybody thought it was kind of a toy or a joke. That was actually very much the same technology—it’s wild to see what five years of just pushing on the same concepts has gotten us, let alone 10 years ago when it was just a joke.</p> <p><strong>Tim:</strong> Very cool. All right, well if you don’t have anything else to add, we’ll wrap up there. This has been a great discussion. I appreciate your perspective.</p> <p><strong>Ravi:</strong> Likewise. Thanks for having me.</p> </article> <article> <h1>How to use streaming machine learning for real-time data</h1> <p>Thu, 08 Aug 2024 00:00:00 +0000</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>How I am Productive</h1> <p>Mon, 13 May 2024 14:14:00 +0000</p> <p>I consider myself a productive and organized person. At a minimum, people come to me for things because they know I get things done. This doesn’t come naturally to me, and it’s a skill I’ve developed over the past 15 years.</p> <p>Here are some of the things I do to be productive:</p> <h2 id="write-things-down">Write things down </h2><p>I loosely follow the <a class="link" href="https://getdrafts.com" target="_blank" rel="noopener" >Getting Things Done</a> method developed by David Allen. The most important aspect to me has been capturing tasks into a single inbox. Capturing tasks means I don’t try to keep them in my head (which is unreliable and burns a lot of brain power trying to remember things). Capturing to a single inbox means tasks won’t get lost in piles of Post-Its and napkins.</p> <p>I use the <a class="link" href="https://getdrafts.com" target="_blank" rel="noopener" >Drafts</a> app on iOS and macOS, which lets me capture notes with zero friction. If it’s a task I’m writing down, I use a Drafts action to send it to my <a class="link" href="https://www.omnigroup.com/omnifocus" target="_blank" rel="noopener" >Omnifocus</a> inbox. I regularly triage my inbox to organize my tasks into relevant projects and contexts (e.g., tasks that can be done on the computer and tasks to do at home).</p> <h2 id="using-nagging-reminders">Using nagging reminders </h2><p>Tasks that <em>must</em> be done at a certain time aren’t fit for my Omnifocus task management systems. For example, my trash must go out on Monday nights, and I need to be reminded until it happens. I love the <a class="link" href="https://www.dueapp.com" target="_blank" rel="noopener" >Due app</a> on iOS, which will remind me repeatedly until the task is complete.</p> <h2 id="tame-email">Tame email </h2><p>I keep my email tidy. A big part of this is using <a class="link" href="https://www.sanebox.com" target="_blank" rel="noopener" >Sanebox</a> to keep unimportant emails out of my inbox; Sanebox automatically detects, labels, and archives inbox clutter like receipts, marketing, and newsletters. These typically don’t require any action and can be triaged as necessary. I aggressively hit the “Unsubscribe” button on things I don’t need to be bothered with (including most marketing).</p> <p>When an email can be handled quickly (within a minute or two), I try to handle it as soon as I receive it. If an email requires work to follow up on, I forward it to my <a class="link" href="" >Omnifocus</a> inbox and make it a task.</p> <p>When this works well, my email inbox is normally empty.</p> <h2 id="forget-passwords">Forget passwords </h2><p>I have hundreds (thousands?) of logins to websites and apps. Even if I could use the same password for all of them, this is a notorious security risk.</p> <p>Every password I use lives in my <a class="link" href="https://1password.com" target="_blank" rel="noopener" >1Password</a> account. I also use it to manage 2-factor authentication pins, credit cards, WiFi passwords, and personal information (social security numbers, driver’s license scans, etc.). Using the 1Password keyboard shortcuts, I can quickly log in to any of my accounts.</p> <p>I know my 1Password master passwords and almost no other passwords to my accounts. They’re typically long, randomly generated mixtures of characters.</p> <p>1Password also lets me have a shared vault with my wife, which is invaluable for our shared services.</p> <h2 id="keep-slack-under-control">Keep Slack under control </h2><p>Like many tech workers, much of my day is spent in Slack. A few years ago, I stumbled upon <a class="link" href="https://randsinrepose.com/archives/how-i-slack-2022/" target="_blank" rel="noopener" >Michael Lopp’s guide to optimizing Slack</a>. His approach greatly improved my experience with Slack and kept it from being a constant distraction. <a class="link" href="https://randsinrepose.com/archives/how-i-slack-2022/" target="_blank" rel="noopener" >Go read his post.</a></p> <h2 id="dont-waste-15-minutes">Don’t waste 15 minutes </h2><p>Like the next guy, I’m often inclined to spend 15 minutes of downtime scrolling Instagram. However, I once heard advice from a productive person that a key to productivity is not wasting those short intervals.</p> <p>When I have free minutes (sitting in a waiting room or during a short interval between meetings), I look for a task I can knock off my to-do list. I’m regularly surprised that daunting tasks can be finished in little time.</p> <h2 id="dont-waste-idle-brain-cycles">Don’t waste idle brain cycles </h2><p>Some tasks (like writing this blog post) are hard to complete in one sitting. I have difficulty developing new ideas in front of my computer with all its distractions. On the other hand, I’m a <a class="link" href="https://tdhopper.com/blog/on-inspiration/" target="_blank" rel="noopener" >strong believer in the power of noodling on ideas</a>.</p> <p>When developing ideas, I prime my brain by dumping my thoughts on the topic into a Drafts app note and then thinking about it as I go about my day. When I have things to add, I pull up the note and jot down my ideas.</p> </article> <article> <h1>Good programming languages are nice but not vital</h1> <p>Wed, 03 Apr 2024 11:46:00 +0000</p> <blockquote> <p>When someone says, “I want a programming language in which I need only say what I wish done,” give him a lollipop. (Alan Perlis, <a class="link" href="https://cpsc.yale.edu/epigrams-programming" target="_blank" rel="noopener" >Epigrams in Programming</a>)</p> </blockquote> <p>People love to talk trash about programming languages on Twitter. Every day there’s a new viral tweet <a class="link" href="https://x.com/holdenmatt/status/1774866242282672365?s=20" target="_blank" rel="noopener" >about</a> <a class="link" href="https://twitter.com/markopolojarvi/status/1753668713671475288?s=20" target="_blank" rel="noopener" >the</a> <a class="link" href="https://twitter.com/nominalthoughts/status/1750431818883694935?s=20" target="_blank" rel="noopener" >inadequacies</a> <a class="link" href="https://twitter.com/andrewwhite01/status/1734707815145422943?s=20" target="_blank" rel="noopener" >of</a> <a class="link" href="https://twitter.com/Josh_Merfeld/status/1701563560315547807?s=20" target="_blank" rel="noopener" >Python</a>.</p> <p>I have worked with many programming languages and studied others in school, but I’ve spent most of the last ten years deep into Python. Of course, Python has warts and wrinkles; I’m intimately familiar with many of them! At the same time, people worldwide effectively use Python to solve all kinds of problems. As it turns out, many common frustrations can be set aside with a little effort (e.g., by integrating modern tooling like <a class="link" href="https://github.com/astral-sh/ruff" target="_blank" rel="noopener" >Ruff</a>).</p> <p>I’m regularly convinced that Brian Kernighan knew everything there is to know about software engineering practice, and he wrote it down before I was born in 1986. He doesn’t miss the mark in his <a class="link" href="https://archive.org/details/programming-style" target="_blank" rel="noopener" >1979 paper with Plauger</a>:</p> <blockquote> <p>…many people try to excuse badly written programs by blaming inadequacies of the language that must be used. We have seen repeatedly that even Fortran can be tamed with proper discipline. The presence of bad features is not an invitation to use them, nor is the absence of good features an excuse to avoid simulating them as cleanly as possible. Good languages are nice, but not vital.</p> </blockquote> <p>People forget that two of the top three most visited websites (Youtube and Facebook) were originally implemented in PHP, a language rarely considered a Platonic ideal.</p> <p>Bjarne Stroustrup, creator of C++, says, “There are only two kinds of languages: the ones people complain about and the ones nobody uses.” He’s right, and I’m guessing we’ll never move entirely beyond that (even as language improvement continues). I, for one, am willing to embrace this and learn to do the best with the tools we have.</p> </article> <article> <h1>Announcing the Python Developer Tooling Handbook</h1> <p>Sat, 10 Feb 2024 16:31:16 +0000</p> <p>I’ve been a professional Python developer for over a decade. Like many Python developers, I’ve faced the many challenges of Python packaging and dependency management. I’ve also come to love helping developers find the right tools to make their work easier and more productive.</p> <p>I’m excited to announce that I’m writing the <a class="link" href="https://pydevtools.com/" target="_blank" rel="noopener" >Python Developer Tooling Handbook</a>, a free ebook on Python developer tooling. This handbook covers a wide range of topics, including build tools, linting, formatting, dependency management, virtual environments, and more.</p> <p>The book is currently a work in progress, and I’m excited to share it with you when it’s ready. If you want to be notified as soon as it’s released, please consider signing up for my mailing list <a class="link" href="https://buttondown.email/pdth?tag=github" target="_blank" rel="noopener" >here</a>.</p> </article> <article> <h1>On Inspiration</h1> <p>Mon, 05 Feb 2024 23:48:54 +0000</p> <p>Good ideas are hard to come by, but I find I’m often looking in the wrong places.</p> <p>When I’m trying to develop new ideas, sitting at my desk staring at my computer (and all the world’s distractions) seems to be the worst place. Instead, I like to prime my mind by writing down whatever thoughts I have on the topic and then trying to ruminate as I go about life.</p> <p>I’m inspired by stories of brilliant people coming up with ideas at surprising times:</p> <ul> <li><a class="link" href="http://garfield.library.upenn.edu/classics1981/A1981MS54100001.pdf" target="_blank" rel="noopener" >Hirotugu Akaike said</a> the idea for his eponymous <a class="link" href="https://en.wikipedia.org/wiki/Akaike_information_criterion" target="_blank" rel="noopener" >Akaike Information Criterion</a> while taking his seat on the bus.</li> <li>Grant Wood, most famous for painting American Gothic, once said “All the really good ideas I’ve ever had came to me while I was milking a cow.”</li> <li><a class="link" href="https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-88-9068" target="_blank" rel="noopener" >Stanisław Ulam, a key physicist in the Manhattan Project, recounted how he thought</a> of Monte Carlo methods while home sick and playing solitaire. (Monte Carlo methods essential in modern artificial intelligence model training!)</li> <li><a class="link" href="https://www.worldscientific.com/doi/10.1142/9789814434430_0002" target="_blank" rel="noopener" >Paul Dirac, a founder of quantum mechanics, said,</a> “I would like to mention that I found the best ideas usually came, not when one was actively striving for them, but when one was in a more relaxed state.”</li> </ul> <p>Of course, this method can be abused. I had a classmate in grad school who used to use “you can’t force inspiration” as an excuse to never do any work. I try to balance these anecdotes with Stephen King’s proverb: “Amateurs sit and wait for inspiration, the rest of us just get up and go to work.”</p> </article> <article> <h1>Annual Accomplishments</h1> <p>Tue, 10 Jan 2023 20:47:17 +0000</p> <h2 id="2025">2025 </h2><ul> <li> <p>Launched <a class="link" href="https://pydevtools.com" target="_blank" rel="noopener" >pydevtools.com</a>, the first comprehensive resource on Python developer tooling; now over over 43k words and growing!</p> </li> <li> <p>Made it <a class="link" href="https://news.ycombinator.com/item?id=44977645" target="_blank" rel="noopener" >to the front page of Hacker News</a></p> </li> <li> <p>Led adoption of AI-assisted software development on my team</p> </li> <li> <p>Interviewed Prof. Robert Ghrist on <a class="link" href="https://tdhopper.com/blog/writing-a-math-textbook-with-claude-with-professor-ghrist/" target="_blank" rel="noopener" >Writing a Math Textbook with Claude</a></p> </li> <li> <p>Interviewed by <a class="link" href="https://tdhopper.com/blog/interview-with-the-pybites-podcast/" target="_blank" rel="noopener" >Pybites Podcast</a></p> </li> <li> <p>Hit 380 lbs for 5 reps on the trap bar (highest weight post-back surgery)</p> </li> <li> <p>Celebrated 10th wedding anniversary</p> </li> <li> <p>Open sourced <a class="link" href="https://github.com/tdhopper/wt" target="_blank" rel="noopener" >wt</a> and <a class="link" href="https://github.com/python-developer-tooling-handbook/uvhow" target="_blank" rel="noopener" >uvhow</a></p> </li> <li> <p>Traveled to Stockholm and Berlin for work</p> </li> <li> <p>Received encouraging peer feedback highlighting:</p> <ul> <li> <p>eagerness to jump in and solve problems</p> </li> <li> <p>deep Python expertise and ability to make complex topics accessible</p> </li> <li> <p>pragmatic, non–yak-shaving engineering approach</p> </li> <li> <p>initiative in connecting with researchers and identifying challenges</p> </li> </ul> </li> </ul> <h2 id="2024">2024 </h2><ul> <li> <p>Survived painful, herniated disc and went through successful surgery</p> </li> <li> <p>Got back to lifting weights after 5 month break due to a herniated disc</p> </li> <li> <p>Down to the fourth belt hole on a belt I couldn’t wear a year ago</p> </li> <li> <p>Saw huge growth and adoption of feature store platform at Varo</p> </li> <li> <p>Landed amazing job at Spotify</p> </li> <li> <p><a class="link" href="https://tdhopper.com/talks" target="_blank" rel="noopener" >Spoke at fintech_devcon, my tech first conference in 5 years</a></p> </li> <li> <p>Saw my first <a class="link" href="https://www.instagram.com/p/DEP8ZzFuXiF/" target="_blank" rel="noopener" >total eclipse</a></p> </li> <li> <p><a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >Recorded two interesting interviews on AI-assisted software development</a></p> </li> <li> <p>Back to blogging at <a class="link" href="https://tdhopper.com" target="_blank" rel="noopener" >tdhopper.com</a> a bit</p> </li> <li> <p>Redesigned <a class="link" href="https://tdhopper.com" target="_blank" rel="noopener" >tdhopper.com</a> with <a class="link" href="https://gohugo.io" target="_blank" rel="noopener" >Hugo</a> and <a class="link" href="https://decapcms.org" target="_blank" rel="noopener" >Decap CMS</a></p> </li> </ul> <h2 id="2023">2023 </h2><ul> <li> <p>Had 4th kid and watched them all thrive (this was most of my year…)</p> </li> <li> <p>285 lbs bench press, 425 lbs deadlift, 405 lbs squat</p> </li> <li> <p>Most impactful year of work yet</p> </li> <li> <p>Started handbook on Python developer tooling (more soon, hopefully)</p> </li> </ul> <h2 id="2022">2022 </h2><ul> <li> <p>Had 3rd kid and kept them all alive</p> </li> <li> <p>Moved with three small children</p> </li> <li> <p>Entered my first strongman competition (with a 10 day old baby)</p> </li> <li> <p>Lead integration of Tecton’s feature store for my team (and learned a ton on the way)</p> </li> <li> <p><a class="link" href="https://www.youtube.com/watch?v=6flt_3yMNb0" target="_blank" rel="noopener" >Spoke at normconf</a></p> </li> </ul> <h2 id="2021">2021 </h2><ul> <li> <p><a class="link" href="https://photos.tdhopper.com/2021-best-of" target="_blank" rel="noopener" >Took pictures I’m proud of</a></p> </li> <li> <p><a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >Recorded 3 great episodes of my podcast</a></p> </li> <li> <p>Survived Covid-19 in January</p> </li> <li> <p>Took over massive new responsibilities on my team after the departure of our infrastructure engineers</p> </li> <li> <p>Accepted new job as a staff software engineer</p> </li> <li> <p><a class="link" href="https://tdhopper.com/blog/organizing-research-data-with-intake/" target="_blank" rel="noopener" >Developed transformative tool for my team</a></p> </li> <li> <p>Helped underpaid colleague get a new job w/ a big raise</p> </li> <li> <p><a class="link" href="https://pypi.org/project/intake-pattern-catalog/" target="_blank" rel="noopener" >Released my company’s first ever open source library</a></p> </li> <li> <p>Deadlifted 405 lbs</p> </li> </ul> <h2 id="2020">2020 </h2><ul> <li> <p>Survived pandemic</p> </li> <li> <p>Left a disappointing job and found a great new one</p> </li> <li> <p>Completed challenging 1st year of 2nd child</p> </li> <li> <p><a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >Started a podcast</a></p> </li> <li> <p>Recruited my former coworker Brenton to work with me</p> </li> <li> <p><a class="link" href="https://photos.tdhopper.com/2020-best-of" target="_blank" rel="noopener" >Took pictures I’m proud of </a></p> </li> <li> <p><a class="link" href="https://resume.tdhopper.com" target="_blank" rel="noopener" >Built a new website</a> for my resume (with automatically generated PDF version!)</p> </li> <li> <p>Learned some new things (Go, Python dashboarding, xarray library, weather forecast data)</p> </li> </ul> <h2 id="2019">2019 </h2><ul> <li> <p>Had a second son</p> </li> <li> <p>Moved my team completely off Python 2</p> </li> <li> <p>Set a new record for length of time at a company</p> </li> <li> <p>Made <a class="link" href="https://freespace.tdhopper.com" target="_blank" rel="noopener" >freespace.tdhopper.com</a></p> </li> <li> <p><a class="link" href="https://github.com/python/cpython/pull/11847" target="_blank" rel="noopener" >Had a PR merged into CPython</a></p> </li> <li> <p>Bench pressed 235 and deadlifted 375</p> </li> <li> <p>Took a long talked about trip to Yellowstone with my dad</p> </li> <li> <p><a class="link" href="https://www.nature.com/articles/d41586-019-02046-0" target="_blank" rel="noopener" >Had a tweet quoted in a Nature article</a></p> </li> <li> <p><a class="link" href="https://photos.tdhopper.com/2019-best-of" target="_blank" rel="noopener" >Took some photos I’m proud of</a></p> </li> </ul> <h2 id="2018">2018 </h2><ul> <li> <p>Made <a class="link" href="http://photos.tdhopper.com" target="_blank" rel="noopener" >photos.tdhopper.com</a></p> </li> <li> <p>Spoke at Scipy, BSides Ashville, Pydata DC, Triangle Juypter Day, and the Southern Data Science Conference</p> </li> <li> <p>Bench-pressed 200 lbs and deadlifted 300 lbs for the first time</p> </li> <li> <p>Moved my static sites from AWS to Netlify</p> </li> <li> <p>Presented at 6 of my teams bi-weekly knowledge shares</p> </li> <li> <p>Sold my team on Pytest</p> </li> <li> <p>Learned a bunch about malware</p> </li> <li> <p>Learned more about network communications</p> </li> <li> <p>Started <a class="link" href="http://instagram.com/doyouplayball" target="_blank" rel="noopener" >instagram.com/doyouplayball </a></p> </li> <li> <p>Published <a class="link" href="https://tdhopper.com/blog/faq/" >A Subjective and Anecdotal FAQ on Becoming a Data Scientist</a></p> </li> <li> <p>Bought my first house</p> </li> </ul> <h2 id="2017">2017 </h2><ul> <li> <p>Made <a class="link" href="https://pythonplot.com" target="_blank" rel="noopener" >pythonplot.com</a></p> </li> <li> <p>First baby</p> </li> <li> <p>Better understood Linux user account model</p> </li> <li> <p>Wrote <a class="link" href="https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs/" >Some Reflections on Being Turned Down for a Lot of Data Science Jobs</a></p> </li> <li> <p>Second time on <a class="link" href="https://adversariallearning.com" target="_blank" rel="noopener" >Adversarial Learning</a></p> </li> <li> <p>Use Ansible effectively</p> </li> <li> <p>Two trips to Grayson Highlands State Park in Virginia</p> </li> </ul> </article> <article> <h1>Five semesters of linear algebra and all I do is solve Python dependency problems</h1> <p>Thu, 15 Dec 2022 00:00:00 +0000</p> <div class="video-wrapper"> </iframe> </div> <p>This is a reflection I’ve carried around for years: the evolution of my career interests and how they relate (or don’t relate) to my core passion, which has always been math. It’s an overly introspective look at how my love for mathematics led me toward data science and machine learning, only to find myself 10 years later doing basically no pure math in my job.</p> <p>If you’re like me, and I know there are others out there, this one’s for you.</p> <h3 id="the-gap-between-expectation-and-reality">The Gap Between Expectation and Reality </h3><p>When I was a grad student studying machine learning 10 or 12 years ago, I thought my time would be spent choosing kernels for Support Vector Machines (SVMs). Today, I can confidently say I have <em>never</em> needed to choose an SVM kernel.</p> <p>Of course, what I <em>want</em> people on Twitter to think I do is work on deep learning and large language models. What I actually do as a Staff Software Engineer at Varo Bank is often search Stack Overflow for how to solve Python dependency issues, like installing TensorFlow on an M1 Mac. My biggest contributions, in fact, are often figuring out how to get pip to work the way we need it to.</p> <p>I help build and maintain our feature platform for the data science team at Varo Bank, which involves making machine learning features available for model training and inference. This system design work is great, but a large part of my career has been about trying to master the Python language and its ecosystem and using that knowledge to assist my teams. That means a lot of debugging people’s pip issues, Googling for exceptions, and encouraging better software engineering practices. This work is satisfying, but it’s certainly not where I expected to end up.</p> <h3 id="following-the-math">Following the Math </h3><p>I really do love math. I loved linear algebra from the get-go; it came pretty naturally to me. I’ve always loved algorithms, algorithmic thinking, and mathematical models. This goes back a long way. When I was eight years old, my mom wrote in a Christmas letter that “Math is entertainment for him”.</p> <p>My passion for math led me to study it as an undergrad at Grove City College. I initially started as a Physics major, dreaming of becoming a professor, but quickly switched to Computer Science when I realized I enjoyed the math side of physics more than the lab work.</p> <p>However, even as a computer science student, I realized I preferred the abstract, theoretical side (like theory of computation class) over programming. Programming felt like a tool, not the main interest. By the end of my sophomore year, I realized math was truly for me and changed my major. I graduated in 2008 with highest honors in math, having taken my first two semesters of linear algebra plus numerical linear algebra.</p> <p>Around 2006, while an undergrad, I found a <a class="link" href="https://www.bloomberg.com/news/articles/2006-01-22/math-will-rock-your-world" target="_blank" rel="noopener" >compelling article in <em>Business Week</em></a>. I remember this quote from James R. Schatz, chief of mathematical research group at the NSA: “There has never been a better time to be a mathematician”. The article went on to describe what would later be called data science, discussing how companies needed skilled mathematicians and computer scientists to sort through their “swelling oceans of data”. To me, that sounded perfect: a job using the math I loved.</p> <h3 id="the-academic-detours-and-the-realization">The Academic Detours and the Realization </h3><p>My academic path continued. I entered the University of Virginia as a PhD student in the history of mathematics (a third semester of linear algebra was included in that program). But I quickly understood that the program was setting me up only for an increasingly unattractive career in academia.</p> <p>More critically, I started to realize that my math skills alone probably weren’t going to be valuable to anybody outside of instruction. I was unskilled despite having a math degree and being good at math, as almost no one was hiring for that directly.</p> <p>I found Operations Research (OR) next, which seemed like a perfect opportunity to merge math, real-world problems, and computation. I became a PhD student at North Carolina State University. This was 2010–2012, the era when data science was blowing up on Twitter, offering a clear career path using math and computation to solve real problems.</p> <p>During grad school, I interned at Kiva Systems (now Amazon Robotics) in the algorithms research group. I thought I had hit “the glory land” where I could focus on pure algorithms and optimization problems. However, I was deeply discouraged to find my boss—a PhD who studied vehicle routing problems—spending a lot of his time on the <em>lowly work</em> of getting Java code ready for production. Right then, I committed myself: I would never fall down the route of just becoming a software developer; I was going to do math in my career.</p> <h3 id="the-shift-to-ml-engineering-the-dirty-work">The Shift to ML Engineering (The Dirty Work) </h3><p>After leaving grad school, I took a “perfect job” in 2015, working on a DARPA project implementing inference for Bayesian non-parametric models. It involved deriving equations, learning interesting statistics and math, and reading complex papers. But after nine months, the project ended abruptly.</p> <p>Back on the job market, I had a realization: there isn’t an enormous market for algorithm implementers who get to do only the pure work. A great implementation, like those in scikit-learn, PyTorch, or even decades-old Fortran implementations, can be used over and over again.</p> <p>I shifted back toward traditional data science roles, deciding that if I couldn’t be responsible for the underlying algorithms, I could focus on the math modeling side. But in many companies, especially smaller ones where I worked, the engineering barriers to using data science models were often momentous. The math modeling became one of the least important parts because of challenges like political tensions, organizational issues, and massive engineering deficits in getting anything into production.</p> <p>I got so frustrated with the engineering deficits that I decided to solve those problems first. I became what we now call a Machine Learning Engineer. My goal was to be the change I wanted to see: a data scientist who didn’t have to fight with Python dependency issues, who could instead spend time thinking about deep learning architectures or SVM kernels. I decided to do the engineering, the Dirty Work, to enable data scientists to focus on what I thought was the interesting work.</p> <h3 id="finding-joy-in-logic-and-software">Finding Joy in Logic and Software </h3><p>Here’s the unexpected twist: I started to learn that there is something quite enjoyable about writing great software and getting it ready for production. As a software developer, I still get to originate ideas in my head and write them out with my fingertips. Unlike pure math work, I get to see that work scale to millions of people, providing value for a company and its customers. It’s satisfying to do something useful.</p> <p>I don’t regret studying math at all. I followed my passions, even though I didn’t really have a plan. But reflecting on the 2006 quote that there was “never been a better time to be a mathematician,” I think that was misleading.</p> <p>There has never been a better time to be competent and good at math, but being a mathematician on its own is often a poor career path. Today, I tell most people that math skills without programming skills in 2022 are going to be all but worthless.</p> <p>While I still cherish the ideas from the three linear algebra texts I studied over the years, I rarely reference them in my day-to-day work. However, I’ve learned that software provides many of the same joys we find in mathematics:</p> <ul> <li>Problem solving</li> <li>Abstractions</li> <li>Worlds generated just by logic</li> <li>Reproducibility</li> <li>Endless areas of study</li> </ul> <p>For me, the beauty of software was often hidden behind the frustrations of things like pip resolution and impossible exceptions. But I’ve learned to uncover the joy of creating things with software, just as in math you can create things with ideas.</p> <p>My career has been a crazy, unpredictable road, and I suspect it will continue to be that way. I’ve spent far less time thinking about Support Vector Machine kernels and far more time solving Python dependency issues, but it’s turned out all right.</p> </article> <article> <h1>No Silver Bullet</h1> <p>Fri, 18 Nov 2022 20:15:17 +0000</p> <p>In 1986, Fred Brooks published “No Silver Bullet—Essence and Accident in Software Engineering” where he argues that there is no silver bullet that “to make software costs drop as rapidly as computer hardware costs do”.</p> <p>Software progress doesn’t follow Moore’s Law.</p> <p>Managers characterize software projects as “usually innocent and straightforward, but… capable of becoming a monster of missed schedules, blown budgets, and flawed products.”</p> <p>Ouch.</p> <p>The beginning of hope [in creating software], however, is realizing “that progress would be made stepwise, at great effort, and that a persistent, unremitting care.”</p> <p>Software’s difficulty has two aspects “essence, the difficulties inherent in the nature of software, and accidents, those difficulties that today attend its production but are not inherent.”</p> <p>⚠️Brooks’ central thesis:</p> <blockquote> <p>I ʙᴇʟɪᴇᴠᴇ ᴛʜᴇ ʜᴀʀᴅ ᴘᴀʀᴛ ᴏғ ʙᴜɪʟᴅɪɴɢ sᴏғᴛᴡᴀʀᴇ to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation.</p> </blockquote> <p>▶ Essential challenges of software have 4 aspects: complexity, conformity, changeability, and invisibility.</p> <ol> <li>Complexity: “Software entities are more complex for their size than perhaps any other human construct because no two parts are alike.” Because complexity is essential to software, in trying to abstract away complexity, you “often abstract away its essence.”</li> <li>Conformity: Our software is often forced to conform to other interfaces (perhaps because it comes after hardware, or perhaps simply because its “perceived as the most conformable”). “This complexity cannot be simplified out by any redesign of the software alone.”</li> <li>Changeability: “Aʟʟ sᴜᴄᴄᴇssғᴜʟ sᴏғᴛᴡᴀʀᴇ ɢᴇᴛs ᴄʜᴀɴɢᴇᴅ.” When software succeeds, people push the boundaries of what it was designed for. And the physical world around software changes, so software must adapted (e.g. a new model of hardware is released).</li> <li>Invisibility: “The reality of software is not inherently embedded in space.” Representations like directed graphs generally prove inadequate for fully representing software.</li> </ol> <p>Progress has been made in removing 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑎𝑙 difficulties of software:</p> <ol> <li>High-level languages</li> <li>Time sharing systems</li> <li>Unified programming environments</li> </ol> <p>Brooks saw other things coming that could improve software building, but he was skeptical about how monumental they would be: artificial intelligence, graphical programming, better code editors (e.g. language specific features), faster workstations.</p> <p>Some means of attacking the 𝑒𝑠𝑠𝑒𝑛𝑡𝑖𝑎𝑙 complexity</p> <ul> <li> <p>Buy vs. build: “The cost of software has always been development cost, not replication cost.” Thus, if we 𝑐𝑎𝑛 buy it, we probably should.</p> </li> <li> <p>Requirements refinement and rapid prototyping: “Tʜᴇ ʜᴀʀᴅᴇsᴛ sɪɴɢʟᴇ ᴘᴀʀᴛ ᴏғ ʙᴜɪʟᴅɪɴɢ ᴀ sᴏғᴛᴡᴀʀᴇ sʏsᴛᴇᴍ ɪs ᴅᴇᴄɪᴅɪɴɢ ᴘʀᴇᴄɪsᴇʟʏ ᴡʜᴀᴛ ᴛᴏ ʙᴜɪʟᴅ.” “Therefore, the most important function that the software builder performs for the client is the 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑣𝑒 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑎𝑛𝑑 𝑟𝑒𝑓𝑖𝑛𝑒𝑚𝑒𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠.”</p> </li> </ul> <p>Brooks again:</p> <blockquote> <p>I would go a step further and assert that it is really impossible for a client, even working with a software engineer, to specify completely, precisely, and correctly the exact requirements of a modern software product before trying some versions of the product.</p> </blockquote> <p>To serve iterative requirements gathering, a goal should be getting something working as quickly as possible.</p> <p>(A side benefit of this is it improves morale for developers. “Enthusiasm jumps when there is a running system, even a simple one.”)</p> <p>Great software design comes from great designers. Software is a creative process. “Sound methodology can empower and liberate the creative mind; it cannot inflame or inspire the drudge.” Organizations would do well to focus on identifying great software designers and working hard to “grow great designers”.</p> <p>Brooks:</p> <blockquote> <p>Skepticism is not pessimism, however. … A disciplined, consistent effort to develop, propagate, and exploit these innovations should indeed yield an order-of-magnitude improvement. There is no royal road, but there is a road.</p> </blockquote> <p><a class="link" href="http://worrydream.com/refs/Brooks-NoSilverBullet.pdf" target="_blank" rel="noopener" >Read all of Brooks’ “No Silver Bullet” here</a>.</p> <hr> <p><small><a class="link" href="https://flickr.com/photos/eschipul/4160817135" title="silver bullet" target="_blank" rel="noopener" >silver bullet</a> flickr photo by <a class="link" href="https://flickr.com/people/eschipul" target="_blank" rel="noopener" >eschipul</a> shared under a <a class="link" href="https://creativecommons.org/licenses/by-sa/2.0/" target="_blank" rel="noopener" >Creative Commons (BY-SA) license</a></small></p> </article> <article> <h1>My first strongman competition</h1> <p>Fri, 20 May 2022 12:30:00 +0000</p> <p>I started lifting weights for the first time in my life four years ago, as a 32-year-old. Over the last few years, I’ve gotten interested in Strongman as a sport through YouTube. Yesterday, I entered my first strongman competition.</p> <p>I competed not having touched a weight in 2.5 weeks and generally not having trained as I had hoped due to the event being two weeks after my third child’s birth. Nevertheless, I had a lot of fun and learned much about competing.</p> <h2 id="events">Events </h2><h3 id="farmers-carry-medley">Farmer’s Carry Medley </h3><p>There were three farmer’s implements, and each had to be carried 40 feet. They weighed 210, 230, and 260 lbs per hand. I finished in 33.93 seconds and took 6th place! Before the competition, I’d never picked up anything close to 520 lbs.</p> <h3 id="axle-deadlift-ladder">Axle Deadlift Ladder </h3><p>There were three axle bars on wagon wheels (an 18" deadlift) weighing 435, 485, and 535 lbs. The first two bars had to be lifted once, and the third bar could be lifted for reps within the 60-second window.</p> <p>I wasn’t able to lift the lightest bar. I had intended to do a lot of deadlift training in January and February, but it didn’t happen. After the farmer’s carry, my hamstrings were so tight that I could barely get down to the bar. I knew from trying to warm up I wouldn’t be able to budge the bar. However, I did manage to get it an inch or two off the ground (which is something).</p> <h3 id="sandbag-race">Sandbag Race </h3><p>There were two sandbags (200 and 225 lbs). Each had to be carried around a cone 40’ away and back to the start line.</p> <p>I’d never touched a bag weighing more than 150 lbs. I finished in 46.54 seconds and was got 10th place. I was pleased with my result, but I could have trimmed time off if I had more practice picking up big bags.</p> <h3 id="viking-press">Viking Press </h3><p>The Viking press was a natural log mounted on a pivot. It was supposed to be equivalent to a 225-lb overhead press.</p> <p>I was apprehensive because I had hurt my shoulder training for it, and I rarely do heavy overhead pressing. However, I was able to lift the trunk without shoulder pain.</p> <p>I got eight reps and was pleased with that. My final rep felt like my spine was about to pop, so I gave up.</p> <h3 id="sandbag-battle">Sandbag Battle </h3><p>The final event was loading increasingly heavy sandbags over a 4’ rail. After each competitor had a turn, the bag weight was increased by 25 lbs.</p> <p>It started with a 200-lb bag, which I had no trouble lifting. I also lifted the 225-lb bag fine. With the 250-lb bag, I struggled to get it to my lap, but I got it over the bar at the last second.</p> <p>I gave the 275-lb bag a good shot, but it couldn’t fit in my lap. Its weight is much higher than the 150-lb bag I trained with, but it is also a lot bigger, and I didn’t know how to get a grip on it. With some practice, I should be able to get to a 325-lb bag.</p> <p><figure><figcaption> <h4>Sandbag race</h4> </figcaption> </figure> <figure><figcaption> <h4>Farmer's carry</h4> </figcaption> </figure> </p> </article> <article> <h1>Archives</h1> <p>Sun, 06 Mar 2022 00:00:00 +0000</p> </article> <article> <h1>Apache Pulsar with Jowanza Joseph</h1> <p>Mon, 07 Feb 2022 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.youtube.com/results?search_query=jowanza+joseph+pulsar" target="_blank" rel="noopener" >Jowanza’s Pulsar talks</a></li> <li><a class="link" href="https://twitter.com/jowanza" target="_blank" rel="noopener" >Jowanza’s Twitter</a></li> <li><a class="link" href="https://jowanza.com" target="_blank" rel="noopener" >Jowanza’s website </a></li> <li><a class="link" href="https://pulsar.apache.org/" target="_blank" rel="noopener" >Pulsar homepage</a></li> <li><a class="link" href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying" target="_blank" rel="noopener" >The Log: What every software engineer should know about real-time data’s unifying abstraction</a> by Jay Kreps</li> <li><a class="link" href="https://streamnative.io/" target="_blank" rel="noopener" >StreamNative</a> - Hosted Pulsar</li> <li><a class="link" href="https://www.datastax.com/products/astra-streaming" target="_blank" rel="noopener" >Datastax Astra Streaming</a> - Hosted Pulsar</li> <li><a class="link" href="https://www.cncf.io/" target="_blank" rel="noopener" >Cloud Native Computing Foundation</a></li> <li><a class="link" href="https://pulsar.apache.org/docs/en/functions-overview/" target="_blank" rel="noopener" >Pulsar Functions</a></li> <li><a class="link" href="https://pulsar.apache.org/docs/en/2.3.0/io-overview/" target="_blank" rel="noopener" >Pulsar IO</a></li> <li><a class="link" href="https://pulsar.apache.org/docs/en/sql-overview/" target="_blank" rel="noopener" >Pulsar SQL</a></li> <li><a class="link" href="https://zookeeper.apache.org/" target="_blank" rel="noopener" >Apache ZooKeeper</a></li> <li><a class="link" href="https://bookkeeper.apache.org/" target="_blank" rel="noopener" >Apache BookKeeper</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Code Review Guidelines for Data Science Teams</h1> <p>Fri, 03 Dec 2021 15:37:00 +0000</p> <p>Over the last 6 years, I’ve been able to help the teams I have been part of develop guidelines for code review. Many teams require “code review” without putting any effort into establishing a common understanding of what that means. This post is adapted from proposed guidelines I prepared for one of my teams. Of course, you don’t have to adopt <em>my</em> guidelines for review, but I would encourage your team to set aside some time to make sure you all mean the same thing by “code review”.</p> <h2 id="what-is-a-code-review-for">What is a code review for? </h2><p>Code review has multiple benefits and objectives including:</p> <ul> <li> <p><strong>Code correctness</strong>: someone seeing your code with fresh eyes may help uncover bugs.</p> </li> <li> <p><strong>Code familiarity</strong>: reading one another’s code keeps everyone familiar with the codebase.</p> </li> <li> <p><strong>Design feedback</strong>: a constantly evolving code base is a fight against complexity; reviewers can guide one another on keeping the codebase coherent and maintainable.</p> </li> <li> <p><strong>Mutual learning</strong>: the reviewer and author will inevitably learn from one another.</p> </li> <li> <p><strong>Regression protection</strong>: future contributors to the code base have checks against breaking essential functionality; importantly, <em>this reduces fear of making necessary improvements to the code</em>.</p> </li> </ul> <h2 id="what-code-reviews-arent">What code reviews aren’t </h2><ul> <li> <p>An opportunity for the reviewer to impose their idiosyncrasies.</p> </li> <li> <p>An opportunity for the developer to push off responsibility (e.g. correctness) for their code to the reviewer.</p> </li> <li> <p>An opportunity to demand perfection (Per <a class="link" href="https://google.github.io/eng-practices/review/reviewer/standard.html" target="_blank" rel="noopener" >Google’s Code Review Guidelines</a>: <em>A key point here is that there is no such thing as “perfect” code—there is only better code</em>).</p> </li> </ul> <h2 id="opening-pull-requests">Opening Pull Requests </h2><ul> <li> <p>Take care to write informative commit messages. This helps your reviewer understand the decisions you made.</p> </li> <li> <p>Consider your contribution in the broader context of the code base. Do you need to take extra steps to make the code healthier and manage complexity?</p> </li> <li> <p>Keep pull requests short whenever possible. <code>git --shortstat origin/main</code> will show you the size of your branch’s diff from <code>main</code>; under 400 lines changed is a great goal.</p> </li> <li> <p>Write a pull request description that sets your reviewer up for success by helping them understand what the PR intends to accomplish.</p> <ul> <li>If you have a particularly complex PR, consider doing a code walk-through with a reviewer first.</li> </ul> </li> <li> <p>New code should ordinarily come with new tests.</p> </li> </ul> <h2 id="reviewing-pull-requests">Reviewing Pull Requests </h2><ul> <li> <p>Have a positive, constructive, helpful attitude.</p> </li> <li> <p>Wait for continuous integration tasks to complete. Let the author resolve any test failures before beginning your review.</p> <ul> <li>As much as possible, configure your continuous integration to enforce your team’s style guidelines and look for line-level bugs. In Python, this might include running <a class="link" href="https://mypy-lang.org/" target="_blank" rel="noopener" >mypy</a> and <a class="link" href="https://docs.astral.sh/ruff/" target="_blank" rel="noopener" >ruff</a>. Automation like this has multiple benefits: they’re often better than humans at this task, they reduce cognitive load on the reviewer, and they reduce interpersonal tension that results from interviewers nitpicking code.</li> </ul> </li> <li> <p>Things to evaluate:</p> <ul> <li> <p>Does the code appear to do what it claims to do? (This requires you understanding what the code claims to do; you may need to ask the code author to write a better description.)</p> </li> <li> <p>Was the new code put in the right place?</p> </li> <li> <p>Is the new code unnecessarily complex—or unnecessarily clever?</p> <ul> <li>“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” – Brian Kernighan</li> </ul> </li> <li> <p>Does the new code do all it can to avoid adding to the overall complexity of our codebase?</p> </li> <li> <p>Did the author write tests for the new code?</p> </li> </ul> </li> <li> <p>Clarify when a comment is minor or not essential for merging (for example, preface with “Nit:”).</p> </li> <li> <p>If a PR is too large for you to reasonably review, you can ask the author to split it into multiple PRs.</p> </li> </ul> <h2 id="additional-reading">Additional Reading </h2><ul> <li> <p><a class="link" href="https://google.github.io/eng-practices/review/reviewer/standard.html" target="_blank" rel="noopener" >Google’s excellent guide to code review</a> (Note: CL=change list=pull request)</p> </li> <li> <p><a class="link" href="https://docs.gitlab.com/ee/development/code_review.html" target="_blank" rel="noopener" >Gitlab’s Code Review Guidelines</a></p> </li> <li> <p><a class="link" href="https://www.michaelagreiler.com/code-review-best-practices/" target="_blank" rel="noopener" >Proven Code Review Best Practices from Microsoft</a></p> </li> <li> <p><a class="link" href="https://github.com/thoughtbot/guides/tree/master/code-review" target="_blank" rel="noopener" >Thoughtbot Code Review Guidelines</a></p> </li> <li> <p><a class="link" href="https://slack.engineering/how-about-code-reviews/" target="_blank" rel="noopener" >Code reviews at Slack</a></p> </li> <li> <p><a class="link" href="https://github.com/joho/awesome-code-review/blob/master/readme.md" target="_blank" rel="noopener" >Curated list of articles about code review from John Barton</a></p> </li> </ul> </article> <article> <h1>Organizing research data with Intake</h1> <p>Tue, 02 Nov 2021 12:20:00 +0000</p> <p>As someone providing engineering and infrastructure support for a small data science team, I’ve seen firsthand how the right tool can dramatically improve team productivity. Our team recently discovered <a class="link" href="https://intake.readthedocs.io/en/latest/" target="_blank" rel="noopener" >Intake</a>, a data cataloging tool by Anaconda that transformed how we work with Python-based data science projects.</p> <p>Like many data science teams, we deal with data arriving through various channels, e.g., batch data in AWS S3, email-attached spreadsheets, and transformations of existing datasets. Before implementing <a class="link" href="https://intake.readthedocs.io/en/latest/" target="_blank" rel="noopener" >Intake</a>, our data scientists faced challenges in accessing datasets. They needed to:</p> <ol> <li>Know the dataset existed</li> <li>Locate its specific storage location</li> <li>Understand the download process</li> <li>Determine which Python library and arguments were needed to open it</li> </ol> <p>This cognitive overhead was taking valuable time away from actual data science work.</p> <p>Intake offered us an elegant solution through its lightweight interface and plugin system. We packaged it into a pip-installable library containing our data catalog, allowing data scientists to access datasets easily. Instead of juggling multiple parameters and commands, they could now load data with a clean, intuitive syntax:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">catalog</span><span class="o">.</span><span class="n">category</span><span class="o">.</span><span class="n">dataset</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p>The beauty of this approach is that data scientists don’t need to know the underlying S3 infrastructure; they can focus purely on their analytical work.</p> <p>To make this system reliable, we implemented several key components:</p> <ul> <li>A YAML-based catalog definition system for organizing our data assets</li> <li>An internal Python package distributed through our private repository</li> <li>Nightly automated tests ensuring dataset accessibility</li> <li>A custom Intake plugin (which we open-sourced as <a class="link" href="https://github.com/DTN-Public/intake-pattern-catalog" target="_blank" rel="noopener" >intake-pattern-catalog</a>) for handling parameterized file paths and versioned files</li> </ul> <p>This seemingly simple change had profound effects on our workflow. It eliminated:</p> <ul> <li>Friction in data access</li> <li>Redundant boilerplate code</li> <li>Reliance on unwritten institutional knowledge</li> <li>The constant need to search StackOverflow for boto3 solutions</li> </ul> <p>Most importantly, it shortened feedback loops - a critical factor in data science productivity - allowing our team to focus on model development rather than data access logistics.</p> </article> <article> <h1>Ocean Engineering to Data Science</h1> <p>Wed, 15 Sep 2021 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/brentonmallen?lang=en" target="_blank" rel="noopener" >Brenton’s Twitter</a></li> <li><a class="link" href="https://instagram.com/brentonmallen?lang=en" target="_blank" rel="noopener" >Brenton’s Instagram</a></li> <li><a class="link" href="https://brentonmallen.com" target="_blank" rel="noopener" >Brenton’s Website</a></li> <li><a class="link" href="http://www.ome.fau.edu/undergraduate/ocean/" target="_blank" rel="noopener" >Ocean Engineering at Florida Atlantic University</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Hiring Data Scientists with Roy Keyes</h1> <p>Wed, 30 Jun 2021 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/roycoding?lang=en" target="_blank" rel="noopener" >Roy’s Twitter</a></li> <li><a class="link" href="https://leanpub.com/dshiring" target="_blank" rel="noopener" >Roy’s Book</a></li> <li><a class="link" href="https://www.amazon.com/Data-Scientists-Work-Sebastian-Gutierrez/dp/1430265981" target="_blank" rel="noopener" >Data Scientists at Work</a></li> <li><a class="link" href="https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century" target="_blank" rel="noopener" >Data Scientist: The Sexiest Job of the 21st Century</a> by Thomas H. Davenport and D.J. Patil</li> <li><a class="link" href="https://www.oreilly.com/library/view/building-data-science/BLDNGDST0001/" target="_blank" rel="noopener" >Building Data Science Teams</a> by DJ Patil (2011)</li> <li><a class="link" href="https://adversariallearning.com/episode-10-stories-of-degradation-and-humiliation.html" target="_blank" rel="noopener" >Adversarial Learning: Stories of Degradation and Humiliation</a> — Podcast about bad interview experiences</li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Weather Forecast Data Formats and Tools</h1> <p>Mon, 21 Jun 2021 00:00:00 +0000</p> <p>Weather forecasts (like WRF, HRRR, etc.) are high-dimensional gridded data. They include projection information and other metadata, and there are several file formats used to store them. Here’s what I’ve learned about these formats and the tools for working with them.</p> <h2 id="grib-gridded-binary-data">GRIB: GRIdded BInary Data </h2><p>The World Meteorological Organization (WMO) Commission for Basic Systems (CBS) created the GRIB format in 1985. GRIB1 was released in 1994, and GRIB2 was released in 2003. Most GRIB data you’ll encounter today is GRIB2.</p> <h2 id="netcdf">netCDF </h2><p>NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It was developed and is maintained at Unidata.</p> <p>NetCDF 1.0 was released in 1998, and it’s now up to version 4.8. The format is strictly backwards compatible and almost entirely cross-compatible between language APIs. The canonical implementations of the spec are C and Java libraries.</p> <p>You’ll often hear that netCDF files are much bigger than GRIB2 files. NetCDF4 does support zlib compression, though it may not compress as well as GRIB2. NetCDF4 files can be (and typically are) stored in HDF5 format, so you can inspect them with either <code>ncinfo</code> or <code>h5dump</code>.</p> <h3 id="the-data-model">The Data Model </h3><p>A netCDF file has three key components:</p> <ul> <li><strong>Dimensions</strong>: the shape of variables</li> <li><strong>Variables</strong>: the actual data</li> <li><strong>Attributes</strong>: metadata for the dataset or individual variables</li> </ul> <p>NetCDF-4 also adds groups with named subgroups.</p> <h3 id="ncdump-cli"><code>ncdump</code> CLI </h3><p><code>ncdump</code> dumps a netCDF file as plain text. Useful flags:</p> <ul> <li><code>-v</code>: select a variable</li> <li><code>-h</code>: header only</li> <li><code>-x</code>: XML output</li> </ul> <h3 id="ncgen"><code>ncgen</code> </h3><p><code>ncgen</code> turns Common Data Format (CDL) text into a netCDF binary file:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">ncgen -b example.cdl </span></span></code></pre></td></tr></table> </div> </div><p>Both <code>ncgen</code> and <code>ncdump</code> are part of the core netCDF library.</p> <h3 id="xarray">xarray </h3><p>xarray is a Python library built on the netCDF data model. Data is always loaded lazily from netCDF files: you can manipulate, slice, and subset Dataset and DataArray objects without loading array values into memory until you perform actual computation.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">xarray</span> <span class="k">as</span> <span class="nn">xr</span> </span></span><span class="line"><span class="cl"><span class="n">ds</span> <span class="o">=</span> <span class="n">xr</span><span class="o">.</span><span class="n">open_dataset</span><span class="p">(</span><span class="s1">'example.nc'</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">ds</span><span class="o">.</span><span class="n">to_dataframe</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p>NetCDF is the standard way to serialize data from xarray.</p> <h3 id="nco-netcdf-operators">NCO (NetCDF Operators) </h3><p>NCO is a collection of command-line tools for working with netCDF files. The tools take netCDF, HDF, and/or DAP files as input and can derive new data, compute statistics, print content, hyperslab, manipulate metadata, and output results in text, binary, or netCDF formats.</p> <h4 id="ncks-kitchen-sink"><code>ncks</code> (Kitchen Sink) </h4><p><code>ncks</code> combines most features of <code>ncdump</code> and <code>nccopy</code> with extra capabilities for extraction, hyperslabbing, subsetting, and translation:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Copy a file (the copy includes history)</span> </span></span><span class="line"><span class="cl">ncks example.nc copy.nc </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Select a variable</span> </span></span><span class="line"><span class="cl">ncks -v WIND_6000maboveground wrf.nc </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Print as CDL (like ncdump)</span> </span></span><span class="line"><span class="cl">ncks --cdl example.nc </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># Compress with level 5 compression (-4 required for netCDF4)</span> </span></span><span class="line"><span class="cl">ncks wrf.nc -4 -L5 wrf_compressed.nc </span></span></code></pre></td></tr></table> </div> </div><p>There’s also extensive functionality for concatenation, appending, summary statistics, and comparing files.</p> <h4 id="other-nco-tools">Other NCO Tools </h4><ul> <li><code>ncrcat</code>/<code>ncecat</code>: concatenation (which one to use depends on whether you already have a record dimension like time)</li> <li><code>ncra</code>: average over time</li> <li><code>ncwa</code>: weighted averages</li> </ul> <h3 id="converting-netcdf-to-grib">Converting netCDF to GRIB </h3><p>The CDO (Climate Data Operators) library can convert between formats:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cdo -f grb2 copy wrf.nc wrf_from_nc.gr2 </span></span></code></pre></td></tr></table> </div> </div><p>Fair warning: this doesn’t always work cleanly.</p> <h2 id="takeaways">Takeaways </h2><ul> <li>Complicated datasets are complicated!</li> <li>There is more than one way to do everything.</li> <li>With some basic commands, you can inspect any of these file types.</li> <li>Converting between formats is often straightforward.</li> <li>All of these tools are available through Conda and conda-forge.</li> </ul> <h2 id="resources">Resources </h2><ul> <li><a class="link" href="https://code.mpimet.mpg.de/projects/cdo" target="_blank" rel="noopener" >CDO (Climate Data Operators)</a></li> <li><a class="link" href="https://confluence.ecmwf.int/display/ECC/GRIB+tools+examples" target="_blank" rel="noopener" >GRIB tools examples</a></li> <li><a class="link" href="https://rda.ucar.edu/datasets/ds083.2/software/README_Formats.pdf" target="_blank" rel="noopener" >README on Formats (UCAR)</a></li> <li><a class="link" href="https://www.ftp.cpc.ncep.noaa.gov/wd51we/wgrib2/intro_grib2.pdf" target="_blank" rel="noopener" >Intro to GRIB2 (PDF)</a></li> <li><a class="link" href="https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/" target="_blank" rel="noopener" >GRIB2 Spec</a></li> <li><a class="link" href="https://www.unidata.ucar.edu/presentations/Rew/netcdf-intro-2014.pdf" target="_blank" rel="noopener" >NetCDF Intro (2014 Presentation)</a></li> <li><a class="link" href="http://research.atmos.ucla.edu/csi/GROUP/tips/NCO_basics_N.Berg2013.pdf" target="_blank" rel="noopener" >NCO Basics (PDF)</a></li> </ul> </article> <article> <h1>The Programming Book That Made My Career</h1> <p>Wed, 20 Jan 2021 13:58:16 +0000</p> <p>On January 7, 2011, <a class="link" href="https://twitter.com/tdhopper/status/23515755127508993" target="_blank" rel="noopener" >I tweeted</a>, “Trying to learn Python. We’ll see if this keeps up once classes start.”</p> <p>On January 20, 2011, ten years ago today, I bought Mark Lutz’s <a class="link" href="https://amzn.to/3sHDwXa" target="_blank" rel="noopener" >Learning Python</a> from O’Reilly Books. Over the next month, I read it on my Kindle on the Stairmaster at the school gym. It changed the course of my adult life: since reading that book, I’ve written Python on more days than I haven’t.</p> <p>Before discovering the data science Twitter community in late 2010, I’d only heard of Python through another student researcher in my 2007 <a class="link" href="https://www.nsf.gov/crssprgm/reu/" target="_blank" rel="noopener" >REU,</a> who had used Python for some graphics programming research. On Twitter, people like <a class="link" href="https://twitter.com/hmason" target="_blank" rel="noopener" >Hilary Mason</a>, <a class="link" href="https://twitter.com/JohnDCook" target="_blank" rel="noopener" >John Cook</a>, and <a class="link" href="https://twitter.com/fonnesbeck" target="_blank" rel="noopener" >Chris Fonnesbeck</a> talked warmly about using Python in their scientific work.</p> <p>In 2011, I was a first-year operations research student at North Carolina State University, and I realized that I should pick up programming again to improve my career prospects. I’d taught myself some PHP in high school (circa 2003). In undergrad, I did a computer science minor and learned a good bit of C++ and did a lot of Mathematica scripting in my math coursework, but from 2008 to 2011, I basically didn’t program.</p> <p><a class="link" href="https://twitter.com/tdhopper/status/132924915526533120" target="_blank" rel="noopener" >I struggled to use Python</a> in practice for ML/scientific computing in those days before wheels (binary installs); libraries like Scipy and Numpy required brittle compilation of C++ and Fortran dependencies. During the summer of 2011, I gave up using Python (<a class="link" href="https://twitter.com/tdhopper/status/73798291648811008" target="_blank" rel="noopener" >after segfaulting</a> <a class="link" href="https://orangedatamining.com/" target="_blank" rel="noopener" >Orange</a>) for my internship at <a class="link" href="https://en.wikipedia.org/wiki/Amazon_Robotics" target="_blank" rel="noopener" >Kiva Systems</a> and dove into R, where I could install packages more reliably and use <a class="link" href="https://ggplot2.tidyverse.org/" target="_blank" rel="noopener" >ggplot2</a>.</p> <p>For some reason (probably because of Twitter), I returned to Python a year later and used it to write my research code in grad school (despite my advisor’s wishes that I use C++). <a class="link" href="https://github.com/fonnesbeck/ScipySuperpack" target="_blank" rel="noopener" >Scipy Superpack</a> for installing the Scipy stack was invaluable (thankfully now replaced by Wheels and Conda).</p> <p>My experience with Python was a big reason I got hired at <a class="link" href="https://www.rti.org/" target="_blank" rel="noopener" >RTI International</a> when I left my PhD program in October. They were looking to reduce their SAS dependency (and costs) and wanted people experienced with open-source tools. I taught my colleagues a “Python for statisticians” seminar soon after joining RTI.</p> <p>From RTI, I joined Parsely, which was like a Python boot camp working with <a class="link" href="https://amontalenti.com/" target="_blank" rel="noopener" >Andrew Montalenti</a> and others. Parsely uses Python across their full stack, and it was an eye-opening and educational year for me.</p> <p>Since then, I’ve worked at a variety of companies where I’ve been able to use Python for training machine learning models, building machine learning platforms, <a class="link" href="https://dp.tdhopper.com" target="_blank" rel="noopener" >writing Gibbs samplers for nonparametric Bayes</a>, building data engineering pipelines, software testing, etc.</p> <p>I use Python almost every single day for work and a lot of personal projects. I’ve been able <a class="link" href="https://tdhopper.com/talks" >to speak</a> at 3 PyData conferences, a Scipy conference, and a number <a class="link" href="https://www.meetup.com/tripython/" target="_blank" rel="noopener" >Triangle Python Users Groups</a>. I had a <a class="link" href="https://github.com/python/cpython/pull/11847" target="_blank" rel="noopener" >contribution merged into CPython in 2019</a> and have contributed to many other open-source projects.</p> <p>I’m grateful to the countless people who have taught me (through tweets, code reviews, conference talks, etc) about Python and the many who have built the wonderful language with its incredible ecosystem of tools and packages that enable me and others to do so many things.</p> </article> <article> <h1>Feature Stores with Willem Pienaar</h1> <p>Sat, 16 Jan 2021 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/willpienaar?lang=en" target="_blank" rel="noopener" >Willem’s Twitter</a></li> <li><a class="link" href="https://www.twitter.com/adamlaiacano" target="_blank" rel="noopener" >Adam’s Twitter</a></li> <li><a class="link" href="https://www.youtube.com/watch?v=DaNv-Wf1MBA" target="_blank" rel="noopener" >Feast: feature store for Machine Learning (2020 talk)</a></li> <li><a class="link" href="https://feast.dev/" target="_blank" rel="noopener" >Feast</a></li> <li><a class="link" href="http://tecton.ai/" target="_blank" rel="noopener" >Tecton</a></li> <li><a class="link" href="https://en.wikipedia.org/wiki/Gojek" target="_blank" rel="noopener" >GoJek on Wikipedia</a></li> <li><a class="link" href="https://www.featurestore.org/" target="_blank" rel="noopener" >featurestore.org</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Evolution of a Data Scientist with Adam Laiacano</h1> <p>Sat, 10 Oct 2020 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.twitter.com/adamlaiacano" target="_blank" rel="noopener" >Adam’s Twitter</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>git check-ignore</h1> <p>Fri, 18 Sep 2020 13:10:00 +0000</p> <p><a class="link" href="https://git-scm.com/docs/git-check-ignore" target="_blank" rel="noopener" >Git has</a> a <code>git check-ignore -v <pathname></code> option which will explain to you why a given file is ignored by the repository. This is helpful for debugging <code>.gitignore</code> and other exclude options.</p> <p>I hope you never need this, but you’ll be glad if you do.</p> </article> <article> <h1>Customer Focused Data Science with Mike Rogers</h1> <p>Thu, 23 Jul 2020 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://www.talasecurity.io/" target="_blank" rel="noopener" >Tala Security</a></li> <li><a class="link" href="https://twitter.com/complicatedbull" target="_blank" rel="noopener" >Mike’s Twitter</a></li> <li><a class="link" href="https://www.linkedin.com/in/kfbr392/" target="_blank" rel="noopener" >Mike’s LinkedIn</a></li> <li><a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist" target="_blank" rel="noopener" >How I Became a Data Scientist Despite Being a Math Major</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Go from a Pythonista's Perspective</h1> <p>Mon, 20 Jul 2020 00:00:00 +0000</p> <p>Recently, I’ve had my first opportunity to dive into the <a class="link" href="https://golang.org" target="_blank" rel="noopener" >Go programming language</a>. Most of my career has been as a Python developer, but given how frequently they are <a class="link" href="https://getstream.io/blog/switched-python-go/" target="_blank" rel="noopener" >compared</a>, Go seemed like a natural language to try.</p> <p>Before I move on to other things, I wanted to capture my reflections on Go, particularly as it compares to my experience with Python.</p> <h2 id="the-good">The good </h2><h3 id="go-runs-really-quickly">Go runs really quickly </h3><p>Core to the promises of the Go language is its ability to compile and run quickly. It lives up to these promises! It felt like the only time I ever spent waiting for something to happen was when Go was downloading dependencies.</p> <h3 id="go-build-just-works"><code>go build</code> just works </h3><p>One of the notorious challenges of Python is with packaging and deployment. With Go, <code>go build</code> downloads dependencies and generates a binary that can be run anywhere.</p> <p>Even more remarkably, Go allows you to <a class="link" href="https://www.digitalocean.com/community/tutorials/building-go-applications-for-different-operating-systems-and-architectures" target="_blank" rel="noopener" >build binaries for multiple system architectures</a> without running that architecture. For example, a Mac user can build a Go binary that will run natively on Windows. Python wheels, on the other hand, require access to a given architecture to build (non-universal) binaries.</p> <h3 id="go-modules-is-a-big-improvement">Go Modules is a big improvement </h3><p>When I first tried Go several years ago, I didn’t last long beyond being told that all my source code and dependencies had to live in my GOPATH directory.</p> <p>With the recent introduction of <a class="link" href="https://blog.golang.org/using-go-modules" target="_blank" rel="noopener" >Go Modules</a>, this absurd restriction has been lifted. Go projects can live anywhere on your computer, and the Go compiler automatically downloads and installs your dependencies.</p> <h3 id="go-bans-unused-variables-and-imports">Go bans unused variables and imports </h3><p>The Go compiler fails to build code with unused imports or variables. I replicate this behavior in Python projects by running flake8 as a pre-commit or continuous integration check. Having it built into the compiler itself was a nice surprise.</p> <h3 id="vs-code-wins-again">VS Code wins again </h3><p>The Go extension for VS Code, like the Python extension, is powerful, helpful, and free.</p> <h3 id="garbage-collection-is-nice">Garbage collection is nice </h3><p>I can’t comment on the performance of the Go garbage collector, however I’m glad it has one. Go is heavily inspired by C, but it’s better for almost everyone by not making us manage our own memory.</p> <h3 id="json-marshaling-is-neat">JSON marshaling is neat </h3><p>For better or worse, JSON data makes the world go ‘round. A delightful aspect of Go is its ability to automatically convert JSON to Go structs and Go structs to JSON. You decorate your struct fields with tags that map to JSON keys, and the standard library handles the rest. Coming from Python, where you’re often writing serialization boilerplate or reaching for a third-party library like marshmallow, this felt refreshingly clean.</p> <h2 id="the-bad">The bad </h2><h3 id="tooling-is-rough-around-the-edges">Tooling is rough around the edges </h3><p>Go’s tooling <em>mostly</em> works well, but I found some rough edges. The <code>go get</code> command conflates downloading a dependency with installing a binary, which is confusing. And while <code>gofmt</code> is great in principle (I wish Python had a single blessed formatter<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>), the broader tooling ecosystem felt immature compared to what I’m used to in Python.</p> <h3 id="simple-things-are-hard">Simple things are hard </h3><p>In Python, if I want to filter a list, I write a list comprehension:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">evens</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">numbers</span> <span class="k">if</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">]</span> </span></span></code></pre></td></tr></table> </div> </div><p>In Go, this requires a manual for loop with an append:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="kd">var</span><span class="w"> </span><span class="nx">evens</span><span class="w"> </span><span class="p">[]</span><span class="kt">int</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">for</span><span class="w"> </span><span class="nx">_</span><span class="p">,</span><span class="w"> </span><span class="nx">x</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">range</span><span class="w"> </span><span class="nx">numbers</span><span class="w"> </span><span class="p">{</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">x</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nx">evens</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="nb">append</span><span class="p">(</span><span class="nx">evens</span><span class="p">,</span><span class="w"> </span><span class="nx">x</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="p">}</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w"> </span></span></span></code></pre></td></tr></table> </div> </div><p>This kind of thing comes up constantly. Operations that are one-liners in Python — filtering, mapping, flattening — all become multi-line loops in Go. I know Go values explicitness, but there’s a point where explicitness just means more code to read and more opportunities for off-by-one errors.</p> <h3 id="standard-data-structures-are-limited">Standard data structures are limited </h3><p>Go gives you slices and maps, and that’s about it. There’s no built-in set type, which surprised me. In Python, I use sets all the time for membership tests and deduplication. In Go, the idiomatic workaround is <code>map[string]bool</code> or <code>map[string]struct{}</code>, which works but reads like a hack.</p> <p>More broadly, the standard library doesn’t provide many of the collection types and utilities I take for granted in Python: no ordered dict, no default dict, no counter. You end up writing the same boilerplate data structure code from project to project.</p> <h3 id="is-the-type-system-all-its-chalked-up-to-be">Is the type system all it’s chalked up to be? </h3><p>Go’s <a class="link" href="https://golang.org/doc" target="_blank" rel="noopener" >own documentation</a> describes it as “a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language.” I think they’re right that it <em>feels</em> lightweight, but I’m not convinced Go’s type system catches substantially more bugs than Python with <a class="link" href="http://mypy-lang.org/" target="_blank" rel="noopener" >mypy</a> and a good linter.</p> <p>Go’s type system is simple, which is both its strength and weakness. You get basic type safety, but you don’t get sum types, pattern matching, or many of the features that make type systems in languages like Rust or Haskell genuinely powerful at preventing bugs. Meanwhile, mypy has gotten good enough that most of my Python type errors get caught before runtime anyway.</p> <h3 id="lack-of-generics">Lack of generics </h3><p>At the time of writing, Go has no generics.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> This means you can’t write a function that, say, finds the maximum element of any ordered type. Instead, you write the same function for <code>int</code>, <code>float64</code>, <code>string</code>, and so on — or you fall back on <code>interface{}</code> and lose the type safety that Go is supposed to provide.</p> <p>The lack of generics is the root cause of the “simple things are hard” problem. You can’t write a generic <code>filter</code> or <code>map</code> function, so you end up writing loops. It’s the most commonly cited frustration with Go, and for good reason.</p> <h3 id="testing-feels-painful">Testing feels painful </h3><p>Go’s built-in testing package is minimal to a fault. There’s no built-in assertion library, so instead of writing something like Python’s <code>self.assertEqual(got, expected)</code>, you write:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="k">if</span><span class="w"> </span><span class="nx">got</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="nx">expected</span><span class="w"> </span><span class="p">{</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nx">t</span><span class="p">.</span><span class="nf">Errorf</span><span class="p">(</span><span class="s">"got %v, want %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">got</span><span class="p">,</span><span class="w"> </span><span class="nx">expected</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w"> </span></span></span></code></pre></td></tr></table> </div> </div><p>For every. Single. Test. Case. It’s tedious and error-prone. Python’s pytest, with its simple <code>assert</code> statements and automatic diffs, is a world apart. I know third-party assertion libraries exist for Go, but the community seems to view them with suspicion.</p> <p>Table-driven tests are the idiomatic approach in Go, and they’re fine for simple cases. But the overall testing experience felt like a step backward from the Python ecosystem.</p> <h3 id="no-exception-handling">No exception handling </h3><p>Go replaces exceptions with multiple return values: functions return both a result and an error, and you check the error after every call. The result is code like this:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">result</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nf">doSomething</span><span class="p">()</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="p">}</span><span class="w"> </span></span></span></code></pre></td></tr></table> </div> </div><p>You see this pattern <em>everywhere</em>. It clutters the code and obscures the main logic. In Python, I can write the happy path clearly and handle exceptions where it makes sense. In Go, error handling is interleaved with every step of the business logic.</p> <p>I understand the argument for explicit error handling — unhandled exceptions are a real source of bugs. But Go’s approach trades one problem for another: instead of accidentally swallowing exceptions, you accidentally ignore returned errors. And the visual noise of <code>if err != nil</code> on every other line is real.</p> <h3 id="havent-gotten-into-concurrency">Haven’t gotten into concurrency </h3><p>Go’s goroutines and channels are supposedly the crown jewel of the language, and I believe it. Concurrency is the use case where Go’s design makes the most sense. Unfortunately, I haven’t had the chance to work on anything concurrency-heavy yet, so I can’t speak to this from experience. I suspect this is where Go would really shine compared to Python’s GIL-constrained threading.</p> <h2 id="conclusion">Conclusion </h2><p>Go is a fine language, and I can see why it’s popular for building networked services and infrastructure tools. The compilation speed, the deployment story, and the concurrency model are genuine strengths that Python can’t match.</p> <p>But for the kind of work I do — data processing, scripting, exploratory analysis — Python is still a better fit. The expressiveness of the language, the richness of the standard library, and the depth of the ecosystem (numpy, pandas, scikit-learn) make it hard to leave. Go felt like trading expressiveness for performance, and for many of my use cases, that’s not a trade I need to make.</p> <p>If you’re a Pythonista considering Go, I’d say: try it. You’ll appreciate some things about it, and it’ll give you a new perspective on your Python code. But don’t feel like you need to switch.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p><a class="link" href="https://github.com/psf/black" target="_blank" rel="noopener" >Black</a> is close to this, but it’s not universally adopted the way <code>gofmt</code> is for Go. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>Go generics were <a class="link" href="https://blog.golang.org/generics-next-step" target="_blank" rel="noopener" >proposed</a> and have been a long time coming. I look forward to seeing how they change the language. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Fizz Buzz Book with Joel Grus</h1> <p>Wed, 15 Jul 2020 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/" target="_blank" rel="noopener" >Fizz Buzz in Tensorflow</a></li> <li><a class="link" href="https://www.youtube.com/watch?v=o64FV-ez6Gw" target="_blank" rel="noopener" >Livecoding Madness - Let’s Build a Deep Learning Library</a></li> <li><a class="link" href="https://www.youtube.com/watch?v=7jiPeIFXb6U" target="_blank" rel="noopener" >I don’t like notebooks</a></li> <li><a class="link" href="https://www.youtube.com/watch?v=VPXXthwAdg8" target="_blank" rel="noopener" >Live Coding the Advent of Code</a></li> <li><a class="link" href="https://amzn.to/3j9SS1G" target="_blank" rel="noopener" >Data Science from Scratch Second Edition</a></li> <li><a class="link" href="http://fizzbuzzbook.com/" target="_blank" rel="noopener" >Fizz Buzz Book: Meditations on Python, mathematics, science, engineering, and design</a></li> <li><a class="link" href="https://twitter.com/joelgrus" target="_blank" rel="noopener" >Joel’s Twitter</a></li> </ul> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Goodnight Zoom</h1> <p>Tue, 23 Jun 2020 11:30:43 +0000</p> <p>In the great #general room There was idle chatter And a red build And a gif of- The cow jumping over the moon</p> <p>And there were three little interns sitting with mentors And two little commits And a backlog of tickets And a little blocker And a fresh Docker And a PM and a DM and a glass full of beer And a loud young salesman who was typing @here</p> <p>Goodnight Slack room Goodnight Zoom Goodnight manager hopping on Zoom Goodnight Jira And the red build Goodnight interns Goodnight mentors Goodnight AWS And goodnight Bash</p> <p>Goodnight PMs And goodnight DMs Goodnight little blocker And goodnight Docker Goodnight commits And goodnight tickets Goodnight TypeError: null is not an object Goodnight project And a loud young salesman who was typing “@here”</p> <p>Goodnight Github stars Goodnight asserts Good night pager duty alerts</p> </article> <article> <h1>Tips for Working from Home</h1> <p>Mon, 16 Mar 2020 08:04:00 +0000</p> <p>I’ve been working remotely for tech startups for six years, so I am sharing some tips for those working from home for the first time.</p> <ol> <li><strong>Start Working Early</strong><br> If you can’t find someone to chat with for 30 minutes while making coffee, consider starting your workday instead.</li> <li><strong>Simulate Office Noise</strong><br> If you miss the background noise of your office, try listening to 8 hours of power drill noises <a class="link" href="https://www.youtube.com/watch?v=mNu4cNgJ-Hs" target="_blank" rel="noopener" >here</a>. For a more realistic ambiance, you can overlay it with the most annoying laugh ever <a class="link" href="https://youtube.com/watch?v=9WcSaIAa5Ac" target="_blank" rel="noopener" >here</a>.</li> <li><strong>Afternoon Activities</strong><br> Without a ping-pong table at home, you might find yourself with nothing to do but work in the afternoons.</li> <li><strong>Familiarize Yourself with Remote Tools</strong><br> Programmers, take some time to get used to the essential tools for remote work, such as Jira, GitHub, Slack, and Zoom.</li> <li><strong>Engage in Local Politics</strong><br> If the quarantine lasts longer than expected and you start to miss petty workplace politics, consider joining your local HOA.</li> </ol> </article> <article> <h1>Ten Years of Data Science with Josh Wills and Oscar Boykin</h1> <p>Wed, 19 Feb 2020 00:00:00 +0000</p> <h2 id="listen">Listen </h2> <h2 id="links">Links </h2><ul> <li><a class="link" href="https://twitter.com/josh_wills/" target="_blank" rel="noopener" >Josh’s twitter</a></li> <li><a class="link" href="https://twitter.com/posco/" target="_blank" rel="noopener" >Oscar’s twitter</a></li> <li><a class="link" href="https://softwareengineeringdaily.com/2020/01/10/slack-data-platform-with-josh-wills/" target="_blank" rel="noopener" >Josh on Software Engineering Daily</a></li> <li><a class="link" href="https://github.com/twitter/scalding" target="_blank" rel="noopener" >Scalding</a></li> <li><a class="link" href="https://github.com/twitter/summingbird" target="_blank" rel="noopener" >Summingbird</a></li> <li><a class="link" href="https://github.com/apache/crunch" target="_blank" rel="noopener" >Apache Crunch</a></li> </ul> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.</p>— Josh Wills (@josh_wills) <a href="https://twitter.com/josh_wills/status/198093512149958656?ref_src=twsrc%5Etfw">May 3, 2012</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <h2 id="subscribe">Subscribe </h2><ul> <li><a class="link" href="https://tdhopper.com/podcast/feed" target="_blank" rel="noopener" >RSS Feed</a></li> <li><a class="link" href="https://podcasts.apple.com/us/podcast/into-the-hopper/id1499693201" target="_blank" rel="noopener" >Apple Podcasts</a></li> <li><a class="link" href="https://open.spotify.com/show/63NrgKMVb0VTwkklGboIjy" target="_blank" rel="noopener" >Spotify</a></li> <li><a class="link" href="https://overcast.fm/itunes1499693201/into-the-hopper" target="_blank" rel="noopener" >Overcast</a></li> </ul> </article> <article> <h1>Safety first?</h1> <p>Wed, 09 Oct 2019 09:13:00 +0000</p> <p>I’m always annoyed when the pre-flight instructions inform us that the airline’s main priority is our safety.</p> <p>Turns out Thomas Aquinas observed this 700 years ago:</p> <blockquote> <p>If the highest aim of a captain were to preserve his ship, he would keep it in port for ever.</p> </blockquote> </article> <article> <h1>My Tweet in BBC News</h1> <p>Mon, 01 Jul 2019 13:40:00 +0000</p> <p>A few weeks back, I was browsing Google Maps and noticed the word “DEMO” written in faint green letters in rural South Sudan. A journalist came across this tweet and <a class="link" href="https://www.bbc.com/news/technology-48444703" target="_blank" rel="noopener" >wrote about it for bbc.com</a>.</p> <blockquote class="twitter-tweet"><p lang="qme" dir="ltr">🤔 <a href="https://t.co/TOJvHsxdD9">https://t.co/TOJvHsxdD9</a> <a href="https://t.co/UtHG6Jpgxz">pic.twitter.com/UtHG6Jpgxz</a></p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/1133413170909982720?ref_src=twsrc%5Etfw">May 28, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>My Tweet in Nature</h1> <p>Mon, 01 Jul 2019 12:00:00 +0000</p> <p>Way back in 2015, I tweeted:</p> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Give a man a fish and you feed him for a day.<br><br>Write a program to fish for him and you maintain it for a lifetime.</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/661551195291574272?ref_src=twsrc%5Etfw">November 3, 2015</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <p>Today, that tweet appeared in <a class="link" href="https://www.nature.com/articles/d41586-019-02046-0" target="_blank" rel="noopener" >a Nature article about the challenges of releasing and maintaining open source software</a>.</p> <p>The article opens with the story of how the first-ever image of a black hole was made possible by open-source software like Matplotlib, yet just five days after that historic announcement, the NSF rejected a grant proposal to support that very ecosystem, “saying that the software lacked sufficient impact.”</p> </article> <article> <h1>Installing Python on Mohave with pyenv</h1> <p>Thu, 30 May 2019 00:00:00 +0000</p> <p>I use <a class="link" href="https://github.com/pyenv/pyenv" target="_blank" rel="noopener" >pyenv</a> to manage Python versions on my Mac. I recently have gotten errors like</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">WARNING: The Python sqlite3 extension was not compiled. Missing the SQLite3 lib? </span></span></code></pre></td></tr></table> </div> </div><p>and</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">zipimport.ZipImportError: can't decompress data; zlib not available </span></span></code></pre></td></tr></table> </div> </div><p>The solution seems to be setting <code>LDFLAGS</code> and <code>CPPFLAGS</code> to point to the sqlite3 and zlib libraries, e.g.:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">brew install sqlite3 </span></span><span class="line"><span class="cl">brew install zlib </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">LDFLAGS</span><span class="o">=</span><span class="s2">"-L/usr/local/opt/zlib/lib -L/usr/local/opt/sqlite/lib"</span> </span></span><span class="line"><span class="cl"><span class="nb">export</span> <span class="nv">CPPFLAGS</span><span class="o">=</span><span class="s2">"-I/usr/local/opt/zlib/include -I/usr/local/opt/sqlite/include"</span> </span></span><span class="line"><span class="cl">pyenv install 3.7.0 </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Access a IPython repl from pdb</h1> <p>Wed, 13 Feb 2019 00:00:00 +0000</p> <p>I’m a big fan of using <a class="link" href="https://docs.python.org/3/library/pdb.html" target="_blank" rel="noopener" >pdb</a>, the Python interactive debugger in conjunction with Pytest as I’m writing code.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> With the <code>--pdb</code> flag, you can have Pytest drop into pdb when a test fails. With <code>pytest.set_trace()</code>, you can selectively enter pdb while running your tests. (I use this where I might’ve just added print statements in the past.)</p> <p>A colleague uses IPython for most of his interactive development and asked me for help with some of the friction in his workflow. I recommended pdb, but he wanted to stick with the familiar IPython repl for much of his work.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> He figured how to <a class="link" href="https://ipython.readthedocs.io/en/stable/interactive/reference.html#embedding" target="_blank" rel="noopener" >drop into an IPython repl</a> from pdb with <code>from IPython import embed; embed()</code>. You can’t move up and down the call stack until you exit the repl, but you have access to all the local state you have in pdb.<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p> <p>To simplify this, I recommended leveraging two little-known pdb features: <code>alias</code> and <code>.pdbrc</code>. <a class="link" href="https://docs.python.org/3/library/pdb.html#debugger-aliases" target="_blank" rel="noopener" ><code>alias</code></a> allows you to set alias for statements in pdb; in this case <code>ipy</code> as an alias that drops into IPython:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">alias ipy from IPython import embed; embed() </span></span></code></pre></td></tr></table> </div> </div><p>pdb also as the ability to load a config file from the user’s home directory or the current working directory. Each line in the file is just a pdb statement. Thus, if you can create <code>~/.pdbrc</code> and add the alias statement above, the <code>ipy</code> command becomes available in every pdb session for your user.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I highly recommend pip installing <a class="link" href="https://pypi.org/project/pdbpp/" target="_blank" rel="noopener" >pdbpp</a>, which replaces the default pdb with an enhanced debugger with syntax highlighting, tab completion, and more. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>There’s also an alternative debugger for Python based on IPython: <a class="link" href="https://pypi.org/project/ipdb/" target="_blank" rel="noopener" >ipdb</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a> <a href="#fnref1:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Right Code, Right Place, Right Time</h1> <p>Tue, 29 Jan 2019 00:00:00 +0000</p> <p>I gave a talk at Pydata DC 2018 where I tried to articulate some reasons why companies building machine learning products under-invest in engineering and architecture. I’m very interested in feedback, pointers to other resources on this topic, and a general discussion about how to make more effective ML products.</p> <p><small><a class="link" href="https://docs.google.com/presentation/d/1CUfrWaET5cepR9KUXHMe9ChbenEaRAS-9KNyTxAb_HM/edit?usp=sharing" target="_blank" rel="noopener" >Open in Google Docs</a></small></p> <p>And the video:</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Problem of Programming</h1> <p>Thu, 20 Dec 2018 12:30:00 +0000</p> <p>Edsger W. Dijkstra in 1972:</p> <blockquote> <p>As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.</p> </blockquote> </article> <article> <h1>"I Like Notebooks": a response to Joel Grus</h1> <p>Tue, 13 Nov 2018 12:30:00 +0000</p> <p>At JuypterDay in the Triangle, I gave a response to <a class="link" href="https://www.youtube.com/watch?v=7jiPeIFXb6U" target="_blank" rel="noopener" >Joel Grus’s memorable “I Don’t Like Notebooks</a> presentation.</p> <p>It wasn’t recorded, but <a class="link" href="https://docs.google.com/presentation/d/1XmbeH_sdOKqhi05_FbH2EdRw948i8IvBz1PdfJGbhf4/edit?usp=sharing" target="_blank" rel="noopener" >here are my slides</a>:</p> </article> <article> <h1>Devops Empowered Data Science with Ansible</h1> <p>Thu, 12 Jul 2018 00:00:00 +0000</p> <p>I gave a talk at Scipy 2018 loosely based on my <a class="link" href="https://tdhopper.com/blog/automating-python-with-ansible/" >Ansible tutorial</a>. Here are my slides:</p> <p><small><a class="link" href="https://drive.google.com/open?id=1nsgG2YjfE5yWGdsxfzgqDj4SS1isyOIzBUjNa-r8ly0" target="_blank" rel="noopener" >Open in Google Docs</a></small></p> <p>And the video:</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Challenges in Applying Machine Learning to Cybersecurity</h1> <p>Sat, 23 Jun 2018 00:00:00 +0000</p> <p>I gave a talk called <em>Challenges in Applying Machine Learning to Cybersecurity</em> at <a class="link" href="http://bsidesasheville.com" target="_blank" rel="noopener" >BSides Asheville</a>. Here are my slides:</p> <div class="embed-responsive embed-responsive-16by9"> </div> <p>And the video:</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Wrong Figures, Right Answers?</h1> <p>Tue, 29 May 2018 00:00:00 +0000</p> <blockquote> <p>On two occasions I have been asked, — ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ In one case a member of the Upper, and in the other a member of the Lower, House [of British Parliament] put this question.</p> <p>I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.</p> </blockquote> <p>— Charles Babbage, <em>Passages from the Life of a Philosopher</em> (1864)</p> </article> <article> <h1>I Basically Can't Hire People Who Don't Know Git</h1> <p>Tue, 22 May 2018 00:00:00 +0000</p> <p>In 2014, <a class="link" href="https://twitter.com/seb_g" target="_blank" rel="noopener" >Sebastian Gutierrez</a> published a collection of interviews entitled <a class="link" href="https://amzn.to/2Lo0A7s" target="_blank" rel="noopener" >Data Scientists at Work</a>. My friend and former boss Eric Jonas posted <a class="link" href="http://ericjonas.com/datascientistsatwork.html" target="_blank" rel="noopener" >his interview on his website</a>. It’s full of gems.</p> <p>On engineering skills required for data science work, Eric says,</p> <blockquote> <p>On the industry side, I think that the ability to do software engineering is something that is very important, but isn’t really taught. You don’t actually learn it as a computer science undergraduate, and you certainly don’t learn it as a graduate student. So for me it’s very important that someone has learned it somehow—either by themselves or from someone else. <strong>I basically can’t hire people who don’t know Git.</strong></p> </blockquote> <p>On someone trained in pure mathematics learning to analysis of real-world data, Eric says:</p> <blockquote> <p>…data analysis is so much messier than actual math. I have friends who work on these topology-based approaches, and I’m like, “You realize these manifolds totally evaporate when you actually throw noise into the system. How do you think this is really going to play out here?” So I would much rather someone be computationally skilled. I’m willing to trade off what their Putnam score was for how many open source GitHub projects they’ve committed to in the past.</p> </blockquote> <p>I tried to argue this same point in <a class="link" href="https://tdhopper.com/how/" >an earlier post</a>.</p> <p>On applying academic research, Eric observes:</p> <blockquote> <p>For example, when I evaluate machine learning papers, what I am looking to find out is whether the technique worked or not. <strong>This is something that the world needs to know—most papers don’t actually tell you whether the thing worked.</strong> It’s really infuriating because most papers will show five dataset examples and then show that they’re slightly better on two different metrics when comparing against something from 20 years ago. In academia, it’s fine. In industry, it’s infuriating, because you need to know what actually works and what doesn’t.</p> </blockquote> <p>I have suggested before that we need a good website for sharing implementations of academic algorithms and providing a forum for discussion of whether or not the algorithm <em>actually works</em>.</p> <p>I highly recommend reading Eric’s <a class="link" href="http://ericjonas.com/datascientistsatwork.html" target="_blank" rel="noopener" >full interview</a>.</p> </article> <article> <h1>George Washington's Struggle for Inbox Zero</h1> <p>Tue, 06 Mar 2018 13:34:00 +0000</p> <blockquote> <p>Washington devoted far more time to the onerous task of draft letters than leading men into battle. Running an embryonic government, he protested to Congress that he and his aides “are confined from morn till eve, hearing and answering the applications and letters of one and another,” leaving him with “no hours for recreation.” He groaned at the huge stacks of correspondence and felt besieged by supplicants for various favors. At times the enormous quantity of paperwork must have seemed more daunting than British arms.</p><span class="cite"><span>― </span><span>Ron Chernow, </span><a href="https://amzn.to/3e3NppS"><cite>Washington: A Life</cite></a></span></blockquote> </article> <article> <h1>A Subjective and Anecdotal FAQ on Becoming a Data Scientist</h1> <p>Mon, 05 Mar 2018 16:59:00 +0000</p> <p><a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" >Three years ago</a>, I wrote a post called “How I Became a Data Scientist Despite Being a Math Major”.</p> <p>When I wrote the post, I thought I was explaining that that was about all I know about how someone can become a data scientist: that is, I shared my subjective experience. I intended to communicate my uncertainty about the path others should take. But, given the number of people who have read the post and emailed asking for my advice on becoming a data scientist, that message wasn’t clear. With most of these emails, I have felt bad because I simply don’t know the answers to the questions people have. But because so many people seem to have these questions, I decided to consolidate my uncertainty here. I hope you find it helpful.</p> <p>A note before the Q&A: one reason it is so hard to advise someone on becoming a “data scientist” is that “data scientist” is an ill-defined job title. At some companies, data scientists are building machine learning models and running them in a high performance production system. At other data companies, data scientists are business analysts running SQL queries and visualizing the results with Tableau. Some data scientist are doing complex experimental design while others are mostly moving files around S3 buckets. Some data science roles require deep domain expertise, others only need some programming skill and knowledge of Scikit-Learn. As you shape your own path to data science, take some time to think specifically about the kind of work you are interested in doing; this will shape your preparation.</p> <h2 id="should-i-get-a-masters-degree">Should I get a masters degree? </h2><p>My sense is that a good masters degree in a technical field is a valuable degree. It takes a fraction of the time a Ph.D. takes, yet you can learn a lot in 3 or 4 semesters of coursework, and the degree is viewed favoribly by employers. The quality and curriculum of masters programs varies wildly, so you’ll want to do your due diligence before diving in.</p> <p>That said, I’m wary of people going into deep debt for a masters degree. If you can get a teaching or research position that waives your tuition, do it. Consider going a <em>good</em> school on in-state tuition instead of a <em>great</em> school for $100,000. If you have publicly subsidized graduate education (as in many European countries), go for it!</p> <h2 id="will-a-masters-degree-help-me-becomebe-a-data-scientist">Will a masters degree help me become/be a data scientist? </h2><p>I don’t know, but it has helped me. As I said in <a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" >my earlier post</a>, I learned a lot about algorithms, probability models, math, and machine learning that has been invaluable. Grad school also gave me the time and inspiration to learn R and Python which have played important roles in my career?</p> <h2 id="should-i-get-a-masters-degree-in-computer-science-will-it-help-me-become-a-data-scientist">Should I get a masters degree in computer science? Will it help me become a data scientist? </h2><p>I think a masters degree in computer science will be likely to pay off in the long run. It may not help you get a job as a data scientist, but it would undoubtedly help you in a data science job.</p> <p>I have half of a computer science masters, and I sometimes wish I had finished it.</p> <h2 id="should-i-get-a-masters-degree-in-operations-research">Should I get a masters degree in operations research? </h2><p>I’m ambivalent about operations research. As a discipline, operations research has been in an identity crisis. As a curriculum, many operations research programs are full of content valuable for data scientists. As a signal to potential employers, operations research is relatively unknown and won’t mean as much as a degree in “machine learning” or similar.</p> <h2 id="did-you-feel-that-operations-research-was-too-theoretical-and-that-you-had-to-hustle-outside-the-classroom-and-build-statml-skills-separately">Did you feel that operations research was too theoretical and that you had to hustle outside the classroom and build stat/ML skills separately? </h2><p>Operations research programs have wildly different curriculums. Mine didn’t do a great job preparing me for real-world applications, but that might be better learned on the job anyway. My program allowed me to build statistics and ML expertise only because I had flexibility in the courses I could select; I was able to take a handful of stats/ML related classes. Other programs might not offer that.</p> <h2 id="should-i-get-a-masters-degree-in-statistics">Should I get a masters degree in statistics? </h2><p>If you have strong programming/software engineering skills (or have another means of building them), a statistics degree could be valuable; as with anything, I’m sure the quality of statistics masters degrees varies greatly, and it’s worth trying to find a good one.</p> <h2 id="should-i-get-a-phd">Should I get a Ph.D.? </h2><p>I don’t think it’s worth it for most people. It’s also not necessary for the vast majority of data science jobs. <a class="link" href="https://tdhopper.com/phd" target="_blank" rel="noopener" >I have a free ebook to help you answer this question.</a></p> <p>There are a lot of people with Ph.D.’s in data science roles. It’s possible that is more a result of the large difference in the number Ph.D.’s verses the small number of permanent faculty positions in American universities. People with Ph.D.’s need jobs; many have skills overlapping with data science; they make their way from academia to data science.</p> <h2 id="can-i-become-a-data-scientist-with-only-an-undergraduate-degree">Can I become a data scientist with only an undergraduate degree? </h2><p>Many others have. I have known several who don’t even have college degrees.</p> <h2 id="im-in-school-should-i-take-x-class">I’m in school should I take X class? </h2><p>If it’s linear algebra, definitely. Otherwise, I’m not sure. Among other reasons, I’ve been often surprised how classes I never expected to apply have helped me years later; it’s hard for me to know what other classes would’ve helped me had I taken them.</p> <p>Take the best professors you can (note I didn’t say <em>easiest</em> ). Talk to older students you admire about different classes and professors. Don’t let your schooling interfere with your education.</p> <h2 id="can-you-evaluate-my-qualifications-for-being-a-data-scientist">Can you evaluate my qualifications for being a data scientist? </h2><p>Not very well. In fact, <a class="link" href="https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs/" >I think it’s pretty challenging</a> even for people who interview data science candidates. <a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" >I have tried to share the things that helped qualify me</a>, and I imagine those things would be valuable for you as well.</p> <p>I would suggest trying to evaluate your qualifications against specific jobs (or job descriptions) you are interested in. Are you more interested in <a class="link" href="https://medium.com/@rchang/my-two-year-journey-as-a-data-scientist-at-twitter-f0c13298aee6" target="_blank" rel="noopener" >analysis or production systems</a>? Are you interested cybersecurity applications? Ad markets? Social good? Journalism? Finance? Healthcare? Self-driving cars? Find job postings for roles and look at the qualifications. Find people in these roles on Linkedin and look at their qualifications and job history.</p> <p>One other note: just because you’re qualified, doesn’t mean you will get job offers. Not getting an offer after interviewing <em>might</em> reflect more on poor interviewers than being a poor candidate.</p> <h2 id="how-do-i-show-on-my-resumedemonstrate-to-employers-that-i-am-qualified-to-be-a-data-scientist">How do I show-on-my-resume/demonstrate-to-employers that I am qualified to be a data scientist? </h2><p>My best advice is to work on interesting and relevant things and <a class="link" href="https://youtu.be/uRul8QdYvqQ" target="_blank" rel="noopener" >tell people about them</a>. I don’t know how to be more specific.</p> <h2 id="how-can-i-get-a-job-where-i-can-do-more-applied-math">How can I get a job where I can do more applied math? </h2><p>Even in data science jobs, a lot of the work is far removed from interesting math. My hypothesis is that relatively few people get to spend a substantial part of their job thinking about interesting math. I’ve <a class="link" href="https://github.com/datamicroscopes/lda/graphs/contributors" target="_blank" rel="noopener" >almost</a> never gotten to spend as much time doing math as <a class="link" href="https://twitter.com/tdhopper/status/684380622639333376" target="_blank" rel="noopener" >I would like</a>.</p> <h2 id="what-skills-would-you-recommend-i-develop-if-i-hope-to-become-a-data-scientist">What skills would you recommend I develop if I hope to become a data scientist? </h2><p>You can never be a good enough writer, communicator, software engineer, linear algebraist, or applied statistician. Tenacity is important too, though I’m not sure how you develop it.</p> <h2 id="am-i-doing-the-right-things-to-build-a-successful-career-in-data-science">Am I doing the right things to build a successful career in data science? </h2><p>I don’t know the answer to that question. I have tried to share the things that have been valuable for me in the preceding answer and <a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" >in my blog post</a>.</p> <h2 id="im-in-a-career-as-a-teacherdeveloperanalystetc-can-you-advise-me-on-how-i-can-transition-to-be-a-data-scientist">I’m in a career as a teacher/developer/analyst/etc. Can you advise me on how I can transition to be a data scientist? </h2><p>I have tried <a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" >to share the things that worked for me</a>; you might be able to emulate them, but I can’t guarantee they’ll work for you. I would encourage you to stay curious, keep learning, network (via the internet and face to face), and keep applying for jobs.</p> <h2 id="can-we-find-a-time-to-talk-on-the-phone-about-this">Can we find a time to talk on the phone about this? </h2><p>Unfortunately, I don’t have the time and energy to do this.</p> <hr> <p>Thanks to <a class="link" href="https://mobile.twitter.com/roycoding" target="_blank" rel="noopener" >Roy Keyes</a>, <a class="link" href="http://www.vickiboykis.com" target="_blank" rel="noopener" >Vicki Boykis</a>, and <a class="link" href="https://twitter.com/databozo" target="_blank" rel="noopener" >Justin Bozonier</a> for helpful feedback on a draft of this post.</p> </article> <article> <h1>Optimism is essential to scientific success</h1> <p>Tue, 13 Feb 2018 00:00:00 +0000</p> <blockquote> <p>I have always believed that scientific research is another domain where a form of optimism is essential to success: I have yet to meet a successful scientist who lacks the ability to exaggerate the importance of what he or she is doing, and I believe that someone who lacks a delusional sense of significance will wilt in the face of repeated experiences of multiple small failures and rare successes, the fate of most researchers.</p> </blockquote> <p>— Daniel Kahneman, [Thinking, Fast and Slow](Thinking Fast and Slow)</p> </article> <article> <h1>dothopper photography</h1> <p>Fri, 09 Feb 2018 00:00:00 +0000</p> <p><a class="link" href="https://photos.tdhopper.com" target="_blank" rel="noopener" >I created a new website</a> devoted to my photography.</p> <p>I named it <em>dothopper photography</em> in honor of my late grandmother, Dot Hopper, who fell in love with painting watercolors in retirement.</p> </article> <article> <h1>John Cook’s Ten Year Blogging Endeavour</h1> <p>Wed, 10 Jan 2018 01:28:00 +0000</p> <p>Ten years ago today, <a class="link" href="https://www.johndcook.com" target="_blank" rel="noopener" >John Cook</a> published his first blog post entitled <a class="link" href="https://www.johndcook.com/blog/2008/01/09/moores-law-and-software-bloat/" target="_blank" rel="noopener" >Moore’s law and software bloat</a>, a brief observation on how “Software bloat has increased at roughly the same rate as Moore’s law”.</p> <p>Since then, he’s written over 2,700 posts (nearly 1 per day) on math, computing, software development, statistics, science, and more. His posts are rarely long, but they always give me something to think about. Over the last six years since I discovered his blog, John has encouraged me to</p> <ul> <li><a class="link" href="https://www.johndcook.com/blog/2011/11/30/dead-authors/" target="_blank" rel="noopener" >engage with dead writers</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2010/11/24/fairy-dust-on-the-diploma/" target="_blank" rel="noopener" >not look to diplomas for magic</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2012/06/13/matrix-condition-number/" target="_blank" rel="noopener" >take care in inverting matrices</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2013/08/20/why-are-differentiable-complex-functions-infinitely-differentiable/" target="_blank" rel="noopener" >cherish analytic functions</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/11/07/creativity-and-criticism/" target="_blank" rel="noopener" >be slow to say “they has too much time on their hands”</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2011/02/28/programmers-without-computers/" target="_blank" rel="noopener" >second guess the need to always be in front of a screen to be a programmer</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2011/09/27/bayesian-amazon/" target="_blank" rel="noopener" >use Bayesian methods to better order search results of customer-rated items</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2011/11/21/career-advice-regarding-tools/" target="_blank" rel="noopener" >consider the benefits of portable, open-source tools</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/02/18/the-data-may-not-contain-the-answer/" target="_blank" rel="noopener" >have humility when analyzing data</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2013/01/12/elementary-statistics-book/" target="_blank" rel="noopener" >be skeptical of “elementary” statistical books</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2008/02/23/everything-begins-with-p/" target="_blank" rel="noopener" >not feel dumb for being frustrated with statistics notation</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/11/10/oftware-archeology/" target="_blank" rel="noopener" >think carefully about how <em>read</em> code</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/10/08/nasa-buggy-software/" target="_blank" rel="noopener" >reevaluate what it would take for software to be bug free</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2015/06/18/most-important-skill-in-software/" target="_blank" rel="noopener" >appreciate the massive importance of organization in software projects</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2016/05/15/bring-out-your-equations/" target="_blank" rel="noopener" >understand the real challenges of effectively applying mathematics</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2012/08/27/work-or-rest/" target="_blank" rel="noopener" >wonder if time-spent-at-desk is a good proxy for getting work done</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/06/04/software-challenges/" target="_blank" rel="noopener" >better understand why software projects fail</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2009/12/23/why-programmers-are-not-paid-in-proportion-to-their-productivity/" target="_blank" rel="noopener" >realize that it’s sometimes better not to produce more code</a></li> <li><a class="link" href="https://www.johndcook.com/blog/2015/12/22/automate-to-save-mental-energy-not-time/" target="_blank" rel="noopener" >realize automating might help me more by saving <em>mental energy</em> than by saving <em>time</em></a></li> </ul> <p>John earned his Ph.D. in applied math at UT Austin in 1992 with a dissertation on partial differential equations entitled <a class="link" href="http://catalog.lib.utexas.edu/search~S2?/acook/acook/1%2C109%2C119%2CB/frameset&FF=acook+john+douglas+++++1966&1%2C1%2C" target="_blank" rel="noopener" >Diffusion models with microstructure and secondary flux</a>. In the years since, he has worked as a math professor at Vanderbilt, a software developer, a research statistician at M.D. Anderson Cancer Center, and, since 2013, an independent consultant in statistics, mathematics, and <a class="link" href="https://www.johndcook.com/blog/expert-hipaa-deidentification/" target="_blank" rel="noopener" >HIPAA expert determination</a>. His breadth of career experience reflect his breadth of interests that make his blog what it is.</p> <p>Though John has (wisely) avoided labeling himself a data scientist (preferring <a class="link" href="https://www.johndcook.com/veryappliedmath.html" target="_blank" rel="noopener" >very applied mathematician</a>), his writings have shaped every aspect of my trajectory from being a operations research grad student to a practicing data scientist. John is an <a class="link" href="https://www.johndcook.com/blog/bayesian-consulting/" target="_blank" rel="noopener" >expert in Bayesian statistics</a>, yet has <a class="link" href="https://www.johndcook.com/blog/2018/01/01/making-sense-of-a-probability-problem-in-the-wsj/" target="_blank" rel="noopener" >taught me much humility</a> about the power of statistics. John is a<a class="link" href="https://www.johndcook.com/blog/stand_alone_code/" target="_blank" rel="noopener" > expert programmer</a>, yet <a class="link" href="https://www.johndcook.com/blog/2014/10/02/the-mistakes-of-others/" target="_blank" rel="noopener" >points me back to the importance of soft skills</a> for the success of software projects. John <a class="link" href="https://www.johndcook.com/blog/2017/12/12/efficiency-is-not-associative-for-matrix-multiplication/" target="_blank" rel="noopener" >loves applications of mathematics</a>, yet reminds me to step back and see the <a class="link" href="https://www.johndcook.com/blog/2018/01/04/new-prime-number-record-50th-mersenne-prime/" target="_blank" rel="noopener" >intrinsic beauty of mathematics</a>.</p> <p>The most common tags on his posts are <a class="link" href="https://www.johndcook.com/blog/tag/bayesian/" target="_blank" rel="noopener" >bayesian</a>, <a class="link" href="https://www.johndcook.com/blog/tag/creativity/" target="_blank" rel="noopener" >creativity</a>, <a class="link" href="https://www.johndcook.com/blog/tag/differential-equations/" target="_blank" rel="noopener" >differential equations</a>, <a class="link" href="https://www.johndcook.com/blog/tag/education/" target="_blank" rel="noopener" >education</a>, <a class="link" href="https://www.johndcook.com/blog/tag/history/" target="_blank" rel="noopener" >history</a>, <a class="link" href="https://www.johndcook.com/blog/tag/latex/" target="_blank" rel="noopener" >latex</a>, <a class="link" href="https://www.johndcook.com/blog/tag/math/" target="_blank" rel="noopener" >math</a>, <a class="link" href="https://www.johndcook.com/blog/tag/networks/" target="_blank" rel="noopener" >networks</a>, <a class="link" href="https://www.johndcook.com/blog/tag/number-theory/" target="_blank" rel="noopener" >number theory</a>, <a class="link" href="https://www.johndcook.com/blog/tag/probability-and-statistics/" target="_blank" rel="noopener" >probability and statistics</a>, <a class="link" href="https://www.johndcook.com/blog/tag/productivity/" target="_blank" rel="noopener" >productivity</a>, <a class="link" href="https://www.johndcook.com/blog/tag/programming/" target="_blank" rel="noopener" >programming</a>, <a class="link" href="https://www.johndcook.com/blog/tag/python/" target="_blank" rel="noopener" >python</a>, <a class="link" href="https://www.johndcook.com/blog/tag/quotes/" target="_blank" rel="noopener" >quotes</a>, <a class="link" href="https://www.johndcook.com/blog/tag/scipy/" target="_blank" rel="noopener" >scipy</a>, and <a class="link" href="https://www.johndcook.com/blog/tag/special-functions/" target="_blank" rel="noopener" >special functions</a>. He also brings a unique approach in his posts on <a class="link" href="https://www.johndcook.com/blog/tag/music/" target="_blank" rel="noopener" >music</a> and has a neat collection of <a class="link" href="https://www.johndcook.com/blog/tag/interview/" target="_blank" rel="noopener" >interviews</a> he’s done with <a class="link" href="https://www.johndcook.com/blog/2010/05/26/fred-brooks-interview/" target="_blank" rel="noopener" >Fred Brooks</a>, <a class="link" href="https://www.johndcook.com/blog/2013/09/24/interview-with-sir-michael-atiyah/" target="_blank" rel="noopener" >Michael Atiyah</a> and others.</p> <p>John’s blog is not his only educational outreach. He maintains <a class="link" href="https://www.johndcook.com/blog/twitter_page/" target="_blank" rel="noopener" >17 (!) Twitter accounts</a> where he teaches about technical topics in <del>140</del> 280 characters or less. I’m always learning little tidbits from <a class="link" href="https://twitter.com/UnixToolTip" target="_blank" rel="noopener" >@UnixToolTip</a> and find my schooling refreshed by <a class="link" href="https://twitter.com/AlgebraFact" target="_blank" rel="noopener" >@AlgebraFact</a> and <a class="link" href="https://twitter.com/AnalysisFact" target="_blank" rel="noopener" >@AnalysisFact</a>.</p> <p>One of John’s areas of expertise is implementing numerical algorithms in code. On his website, he has a <a class="link" href="https://www.johndcook.com/blog/notes/" target="_blank" rel="noopener" >handful of technical notes</a> on this that many people have found invaluable. Two years ago, I did <a class="link" href="https://tdhopper.com/blog/mentions-of-john-cook-on-github/" >a quick survey</a> of John’s articles mentioned in Github repos. Hundreds of repos cited his <a class="link" href="https://www.johndcook.com/blog/standard_deviation/" target="_blank" rel="noopener" >Accurately computing running variance</a> article. No doubt countless more have used his resources without citing him!</p> <p>If you don’t follow John’s blog, you should <a class="link" href="https://www.johndcook.com/blog/ways-to-subscribe/" target="_blank" rel="noopener" >subscribe today</a>.</p> <p>Here’s to 10 more years, John!</p> <blockquote> <p>Congrats to <a class="link" href="https://twitter.com/JohnDCook?ref_src=twsrc%5Etfw" target="_blank" rel="noopener" >@JohnDCook</a> on 10 years of blogging!<a class="link" href="https://t.co/DdMBjI89eV" target="_blank" rel="noopener" >https://t.co/DdMBjI89eV</a></p> <p>— Tim Hopper 🆘 (@tdhopper) <a class="link" href="https://twitter.com/tdhopper/status/950556877246615552?ref_src=twsrc%5Etfw" target="_blank" rel="noopener" >January 9, 2018</a></p> </blockquote> </article> <article> <h1>An Infosec Limerick</h1> <p>Thu, 21 Dec 2017 12:00:00 +0000</p> <p>There once was a man from Nantucket<br> Who was careful with his data, where he stuck it<br> His passwords always impress<br> He checked for HTTPS<br> But somebody still put his PII in a public S3 bucket</p> </article> <article> <h1>Neal Ford on Companies Ignoring Developer Costs</h1> <p>Mon, 16 Oct 2017 00:00:00 +0000</p> <blockquote> <p>The downside of this is you have to have two production stacks… which is kind of expensive, but it’s dirt cheap compared to amount of money people spend on developers hacking away at things that are making them crazy. This is one of those things that drives me crazy about companies… We will gladly spend a million dollars in overtime in developers to sit and work on our antiquated infrastructure, but we won’t spend a thousand dollars to buy a new machine to make them not have to do do that.</p> </blockquote> <p>— Neal Ford, <a class="link" href="http://shop.oreilly.com/product/0636920020271.do" target="_blank" rel="noopener" >“Automation”, <em>Neal Ford on Agile Engineering Practices</em></a></p> </article> <article> <h1>How I Became a Data Scientist Despite Having Been a Math Major: The Talk</h1> <p>Fri, 06 Oct 2017 00:00:00 +0000</p> <p>A few years ago, I wrote a posted called <a class="link" href="https://tdhopper.com/how/" >How I Became a Data Scientist Despite Having Been a Math Major</a>. I recently gave a talk on the same topic at the <a class="link" href="https://www.thisismetis.com/demystifying-data-science" target="_blank" rel="noopener" >Demystify Data Science Conference</a>.</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Zen of Git</h1> <p>Sat, 30 Sep 2017 00:26:00 +0000</p> <pre><code>$ git checkout this Zen of Git Ugly is better than beautiful. Explicit is better than implicit. Complex is better than simple. Complicated is better than complex. Flat is better than nested. Readability is meaningless. Special cases are everything. Errors should never be comprehensible. In the face of ambiguity, copy and paste from Stack Overflow. There should be no obvious way to do it. Although there may be endless non obvious ways to do it. If the documentation is hard to understand, it's a great idea. If the documentation is easy to understand, it's probably for another tool. </code></pre> <p>(With apologies to <a class="link" href="https://www.python.org/dev/peps/pep-0020/" target="_blank" rel="noopener" >Tim Peters</a>.)</p> </article> <article> <h1>Python Plotting for Exploratory Data Analysis</h1> <p>Mon, 26 Jun 2017 00:00:00 +0000</p> <p>Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).</p> <p>In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.</p> <p>To that end, I made <a class="link" href="http://pythonplot.com/" target="_blank" rel="noopener" >pythonplot.com</a>, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to <a class="link" href="https://en.wikipedia.org/wiki/Ggplot2" title="Wikipedia Entry: ggplot2 - Wikipedia" target="_blank" rel="noopener" >ggplot2</a>, the R plotting library that I and many others consider a gold standard.</p> </article> <article> <h1>Parallelizing a Python Function for the Extremely Lazy</h1> <p>Wed, 07 Jun 2017 00:00:00 +0000</p> <p>Do you ever want to be able to run a Python function in parallel on a set of inputs? Have you ever gotten frustrated with the GIL, the multiprocessing library, or joblib?</p> <p>Try this:</p> <h2 id="install-python-fire-to-run-your-command-from-the-command-line">Install Python Fire to run your command from the command line </h2><p>Install <a class="link" href="https://github.com/google/python-fire" target="_blank" rel="noopener" >Python Fire</a> with <code>$ pip install fire</code>.</p> <p>Add this snippet to the bottom of your file:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">if __name__ == '__main__': </span></span><span class="line"><span class="cl"> import fire </span></span><span class="line"><span class="cl"> fire.Fire() </span></span></code></pre></td></tr></table> </div> </div><h2 id="install-gnu-parallel">Install GNU Parallel </h2><p><code>$ brew install parallel</code> or <code>$ sudo apt-get install parallel</code> may work for you. Otherwise, see <a class="link" href="https://www.gnu.org/software/parallel/" target="_blank" rel="noopener" >this</a>.</p> <h2 id="run-your-function-from-the-command-line">Run your function from the command line </h2><p><code>$ parallel -j3 "python python_file.py function_name {1} " ::: input1 input2 input3 input4 input5</code></p> <ul> <li><code>parallel</code> is the command for GNU Parallel.</li> <li><code>-j3</code> tells Parallel to run at most 3 processes at once.</li> <li><code>{1}</code> fills in each item after the <code>:::</code> as an argument to the <code>function_name</code>.</li> </ul> <h2 id="for-example">For example </h2><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">(lazy) ~ $ cat python_file.py </span></span><span class="line"><span class="cl">from time import sleep </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">def function_name(arg1): </span></span><span class="line"><span class="cl"> print("Starting to run with", arg1) </span></span><span class="line"><span class="cl"> sleep(2) </span></span><span class="line"><span class="cl"> print("Finishing to run with", arg1) </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">if __name__ == '__main__': </span></span><span class="line"><span class="cl"> import fire </span></span><span class="line"><span class="cl"> fire.Fire() </span></span><span class="line"><span class="cl">(lazy) ~ $ parallel -j3 --lb "python -u python_file.py function_name {1} " ::: input1 input2 input3 input4 input5 </span></span><span class="line"><span class="cl">Starting to run with input2 </span></span><span class="line"><span class="cl">Starting to run with input1 </span></span><span class="line"><span class="cl">Starting to run with input3 </span></span><span class="line"><span class="cl">Finishing to run with input2 </span></span><span class="line"><span class="cl">Finishing to run with input1 </span></span><span class="line"><span class="cl">Finishing to run with input3 </span></span><span class="line"><span class="cl">Starting to run with input4 </span></span><span class="line"><span class="cl">Starting to run with input5 </span></span><span class="line"><span class="cl">Finishing to run with input4 </span></span><span class="line"><span class="cl">Finishing to run with input5 </span></span></code></pre></td></tr></table> </div> </div><p>I added <code>--lb</code> and <code>-u</code> to keep Python and Parallel from buffering the output so you can see it being run in parallel.</p> </article> <article> <h1>CondaHTTPError: HTTP 401 UNAUTHORIZED for url</h1> <p>Tue, 06 Jun 2017 00:00:00 +0000</p> <p>I was getting this message when I tried to install packages from <a class="link" href="https://conda-forge.github.io/" title="conda-forge | community driven packaging for conda" target="_blank" rel="noopener" >conda-forge</a> with <a class="link" href="https://conda.io/docs/intro.html" title="Intro to conda — Conda documentation" target="_blank" rel="noopener" >Conda</a>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-go" data-lang="go"><span class="line"><span class="cl"><span class="nx">Fetching</span><span class="w"> </span><span class="kn">package</span><span class="w"> </span><span class="nx">metadata</span><span class="w"> </span><span class="o">...</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">CondaHTTPError</span><span class="p">:</span><span class="w"> </span><span class="nx">HTTP</span><span class="w"> </span><span class="mi">401</span><span class="w"> </span><span class="nx">UNAUTHORIZED</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">url</span><span class="w"> </span><span class="p"><</span><span class="nx">https</span><span class="p">:</span><span class="cp">//conda.anaconda.org/conda-forge/osx-64/repodata.json></span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">Elapsed</span><span class="p">:</span><span class="w"> </span><span class="mo">00</span><span class="p">:</span><span class="mf">00.920954</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">CF</span><span class="o">-</span><span class="nx">RAY</span><span class="p">:</span><span class="w"> </span><span class="mi">36</span><span class="nx">ad7cbd5d1c23d8</span><span class="o">-</span><span class="nx">IAD</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">The</span><span class="w"> </span><span class="nx">remote</span><span class="w"> </span><span class="nx">server</span><span class="w"> </span><span class="nx">has</span><span class="w"> </span><span class="nx">indicated</span><span class="w"> </span><span class="nx">you</span><span class="w"> </span><span class="nx">are</span><span class="w"> </span><span class="nx">using</span><span class="w"> </span><span class="nx">invalid</span><span class="w"> </span><span class="nx">credentials</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">this</span><span class="w"> </span><span class="nx">channel</span><span class="p">.</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">If</span><span class="w"> </span><span class="nx">the</span><span class="w"> </span><span class="nx">remote</span><span class="w"> </span><span class="nx">site</span><span class="w"> </span><span class="nx">is</span><span class="w"> </span><span class="nx">anaconda</span><span class="p">.</span><span class="nx">org</span><span class="w"> </span><span class="nx">or</span><span class="w"> </span><span class="nx">follows</span><span class="w"> </span><span class="nx">the</span><span class="w"> </span><span class="nx">Anaconda</span><span class="w"> </span><span class="nx">Server</span><span class="w"> </span><span class="nx">API</span><span class="p">,</span><span class="w"> </span><span class="nx">you</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">will</span><span class="w"> </span><span class="nx">need</span><span class="w"> </span><span class="nf">to</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="p">(</span><span class="nx">a</span><span class="p">)</span><span class="w"> </span><span class="nx">remove</span><span class="w"> </span><span class="nx">the</span><span class="w"> </span><span class="nx">invalid</span><span class="w"> </span><span class="nx">token</span><span class="w"> </span><span class="nx">from</span><span class="w"> </span><span class="nx">your</span><span class="w"> </span><span class="nx">system</span><span class="w"> </span><span class="nx">with</span><span class="w"> </span><span class="s">`anaconda logout`</span><span class="p">,</span><span class="w"> </span><span class="nx">optionally</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="nx">followed</span><span class="w"> </span><span class="nx">by</span><span class="w"> </span><span class="nx">collecting</span><span class="w"> </span><span class="nx">a</span><span class="w"> </span><span class="nx">new</span><span class="w"> </span><span class="nx">token</span><span class="w"> </span><span class="nx">with</span><span class="w"> </span><span class="s">`anaconda login`</span><span class="p">,</span><span class="w"> </span><span class="nf">or</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="p">(</span><span class="nx">b</span><span class="p">)</span><span class="w"> </span><span class="nx">provide</span><span class="w"> </span><span class="nx">conda</span><span class="w"> </span><span class="nx">with</span><span class="w"> </span><span class="nx">a</span><span class="w"> </span><span class="nx">valid</span><span class="w"> </span><span class="nx">token</span><span class="w"> </span><span class="nx">directly</span><span class="p">.</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="nx">Further</span><span class="w"> </span><span class="nx">configuration</span><span class="w"> </span><span class="nx">help</span><span class="w"> </span><span class="nx">can</span><span class="w"> </span><span class="nx">be</span><span class="w"> </span><span class="nx">found</span><span class="w"> </span><span class="nx">at</span><span class="w"> </span><span class="p"><</span><span class="nx">https</span><span class="p">:</span><span class="cp">//conda.io/docs/config.html>.</span><span class="w"> </span></span></span></code></pre></td></tr></table> </div> </div><p>I tried to do <code>$ anaconda logout</code> but didn’t have a program called <code>anaconda</code> installed.</p> <p>You can install the <a class="link" href="https://docs.continuum.io/anaconda-cloud/user-guide/getting-started#cloud-install-client" target="_blank" rel="noopener" >Anaconda Cloud Client</a> with <code>$ conda install anaconda-client</code>.</p> <p>After that, I was able to do <code>$ anaconda logout</code> followed by <code>$ anaconda login</code> where I used my old Binstar credentials (now anaconda.org).</p> <p><a class="link" href="https://github.com/conda-forge/ulmo-feedstock/issues/5" target="_blank" rel="noopener" >I’m not the only one having this problem.</a></p> </article> <article> <h1>Choice of the Name Dynamic Programming</h1> <p>Fri, 26 May 2017 00:00:00 +0000</p> <p><a class="link" href="http://smo.sogang.ac.kr/doc/dy_birth.pdf" target="_blank" rel="noopener" >Richard Bellman quoted by Stuart Dreyfus</a> via <a class="link" href="http://economics.gmu.edu/people/gjonesb" target="_blank" rel="noopener" >Garrett Jones</a>:</p> <blockquote> <p>I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. “An interesting question is, ‘Where did the name, dynamic programming, come from?’ The 1950s were not good years for mathematical research.</p> <p>We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical.</p> <p>The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons.</p> <p>I decided therefore to use the word, ‘programming.’ I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.</p> </blockquote> </article> <article> <h1>Adversarial Learning: Stories of Degradation and Humiliation</h1> <p>Mon, 22 May 2017 00:00:00 +0000</p> <p>My friends <a class="link" href="http://www.twitter.com/akm" target="_blank" rel="noopener" >Andrew</a> and <a class="link" href="http://www.twitter.com/joelgrus" target="_blank" rel="noopener" >Joel</a> were kind enough to have me back on their podcast <a class="link" href="http://adversariallearning.com/episode-10-stories-of-degradation-and-humiliation.html" target="_blank" rel="noopener" >Adversarial Learning</a>. We shared our tales of bad data science interviews. Enjoy!</p> </article> <article> <h1>Build a Real Time Machine Learning System</h1> <p>Wed, 17 May 2017 00:00:00 +0000</p> <p>I gave a talk at the Data Science Conference on on building a realtime machine learning system with Kafka, Streamparse, and Storm. You can see the video on <a class="link" href="https://www.youtube.com/watch?v=inB33qWugMw" target="_blank" rel="noopener" >Youtube</a></p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Like most great mathematicians, he expects universal precision</h1> <p>Wed, 26 Apr 2017 00:00:00 +0000</p> <p>From the <a class="link" href="http://amzn.to/2qe0hRp" target="_blank" rel="noopener" >Autobiography of Benjamin Franklin</a>:</p> <blockquote> <p><a class="link" href="https://en.wikipedia.org/wiki/Thomas_Godfrey_%28inventor%29" target="_blank" rel="noopener" >Thomas Godfrey</a>, a self-taught mathematician, great in his way, and afterward inventor of what is now called Hadley’s Quadrant. But he knew little out of his way, and was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in every-thing said, or was for ever denying or distinguishing upon trifles, to the disturbance of all conversation.</p> </blockquote> <p>I’m a recovering Godfrey Precisionist.</p> </article> <article> <h1>Metawork is more interesting than work</h1> <p>Fri, 14 Apr 2017 00:00:00 +0000</p> <p><a class="link" href="http://www.se-radio.net/2017/04/se-radio-episode-287-success-skills-for-architects-with-neil-ford/" target="_blank" rel="noopener" >This Software Engineering Radio interview</a> with Neal Ford on Success Skills for Architects is full of gems about building effective software.</p> <p>He talks a lot about how coders love to solve problems, and that love can lead them to invent interesting, but unnecessary, problems to solve. This is true.</p> <blockquote> <p>Metawork is more interesting than work. It’s so hard to get back to simplicity, because we love complicated little puzzles to solve, so we keep overengineering everything.</p> </blockquote> <p>Anyone who’s developing software would benefit from listening.</p> </article> <article> <h1>Towards Reducing Distractions while Working</h1> <p>Thu, 13 Apr 2017 00:00:00 +0000</p> <p>Staying focused while working in front of a computer and within reach of a smartphone is hard.</p> <p>In 2017, teaching people to focus is becoming a <a class="link" href="https://thefocuscourse.com/" target="_blank" rel="noopener" >industry</a>.</p> <p>I’ve been trying to rethink distractions in my own life, particularly in my work environment. Here are some things that have helped:</p> <h2 id="working-from-home">Working from Home </h2><p>Working in an office, especially an open-floor plan office, is disastrous for staying focused. DeMarco and Lister wrote about this in <a class="link" href="http://amzn.to/2px5PGj" target="_blank" rel="noopener" >Peopleware</a> 30 years ago, and yet open offices are the norm for startups today.</p> <p>I’m much more productive by working from home in my quiet office or on my back patio. I’m finally able to spend my time thinking about hard problems rather than ways of silencing Constant Throat Clearer or Perpetual Annoying Laugher.</p> <h2 id="notifications">Notifications </h2><p>Every app and website these days wants to send you notifications. I’m aggressive about <a class="link" href="http://www.makeuseof.com/tag/take-control-ios-notifications-iphone-ipad/" target="_blank" rel="noopener" >reducing notifications</a> down to those that I <em>need</em> see, and I let almost nothing notify me with sound. I use Do Not Disturb mode on my phone and <a class="link" href="https://support.apple.com/kb/PH18740?locale=en_US" target="_blank" rel="noopener" >Mac</a> whenever I need to stop notifications altogether.</p> <h2 id="slack">Slack </h2><p><a class="link" href="https://slack.com/" target="_blank" rel="noopener" >Slack</a> has become the new normal for company communication. Some would say Slack itself is ruining our focus, but having it regularly available has been essential for my own work.</p> <p>I’ve come up with a few ways to take control of Slack:</p> <ol> <li>Only show “My unread, along with everything I’ve starred” in the sidebar. See Michael Lopp’s <a class="link" href="http://randsinrepose.com/archives/how-i-slack/" target="_blank" rel="noopener" >excellent post on Slack</a> for more here.</li> <li>Enable notifications selectively.</li> <li>Sign out of distracting avocational Slacks.</li> </ol> <h2 id="social-media">Social Media </h2><p>I’ve started using an app called <a class="link" href="https://heyfocus.com/" target="_blank" rel="noopener" >Focus</a> to block distracting websites (including Facebook and Twitter.com) and apps on my work computer from 9 AM to 5:30 PM. I use Focus’s scheduling feature so blocking isn’t optional for me.</p> <p>I’ve decided <em>not</em> to block <a class="link" href="https://tapbots.com/tweetbot/" target="_blank" rel="noopener" >Tweetbot</a>. Though it can be distracting, Twitter is an invaluable way for me to learn from my professional colleagues, bounce ideas off of them, and have a good laugh.</p> <p>On my iPhone, iPad, and personal Laptop, I’ve started using <a class="link" href="https://freedom.refersion.com/c/c6bad" target="_blank" rel="noopener" >Freedom</a> to block all social media during the day. This has stopped me from instinctively checking Instagram every time I walk to the bathroom or get suck on a hard problem. I highly recommend it.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <p>I also use Freedom to block social media for the first hour I’m up in the morning and before I go to bed.</p> <h2 id="email">Email </h2><p>I have two main tactics to keep email from being distracting.</p> <ul> <li>I aggressively unsubscribe from mailing lists and ads.</li> <li>I use <a class="link" href="https://sanebox.com/t/49nyh" target="_blank" rel="noopener" >Sanebox</a> to filter low priority messages out of my inbox.</li> </ul> <p>When emails only need a brief reply, I tend to write responses as soon as possible. At the moment, I’m trying to break people of the expectation that I’ll respond quickly. Using services like <a class="link" href="http://www.boomeranggmail.com/" target="_blank" rel="noopener" >Boomerang</a> which lets me write emails now and have them sent later helps here.</p> <h2 id="reading">Reading </h2><p>Long-form reading at the computer is terrible for comprehension. As <a class="link" href="http://www.econtalk.org/archives/2016/11/doug_lemov_on_r.html" target="_blank" rel="noopener" >Doug Lemov has argued</a>, you have to get away from your computer <strong>and</strong> other devices to read deeply. I do this by <a class="link" href="http://amzn.to/2oDATHj" target="_blank" rel="noopener" >printing</a> articles or reading on my iPad with <a class="link" href="https://freedom.refersion.com/c/c6bad" target="_blank" rel="noopener" >Freedom blocking enabled</a>. I take my printouts or iPad and walk away from my desk to read.</p> <h2 id="todo-items">Todo Items </h2><p>I’m a firm believer in the <a class="link" href="http://amzn.to/2ovAgit" target="_blank" rel="noopener" >Getting Things Done</a> principle of reducing the cognitive overhead of tracking to-do items in my head. I use Omnifocus for task management. <a class="link" href="https://support.omnigroup.com/omnifocus-mail-drop/" target="_blank" rel="noopener" >Mail Drop</a> and <a class="link" href="http://bachyaproductions.com/better-omnifocus-task-entry/" target="_blank" rel="noopener" >this</a> Alfred workflow help me to quickly add tasks to my Omnifocus inbox. When I think of something I need to take care of outside of work, I drop that thought into Omnifocus; this keeps those personal to-do items from distracting me while I’m working.</p> <hr> <p>Staying focused is hard. I’m still learning how to do it well, and I’m sure I’m not the only one struggling to improve here. If you have any tips to share, I’d love to hear them!</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I can’t use Freedom on my work computer, because it acts as a VPN which conflicts with my work VPN. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Web Development and Design for the Backend Developer</h1> <p>Fri, 31 Mar 2017 18:42:00 +0000</p> <p>I’ve been tinkering with websites for nearly 20 years. My friend Hunter and I were big into making terrible Angelfire sites as pre-teens. In high school, my dad paid me to make him a webpage for his doctor’s office (I used Frontpage). A year or two after that, I read Kevin Yank’s “Build Your Own Database Driven Website Using PHP & MySQL” and hacked together a PHP back-end for a Lord of the Rings fan site.</p> <p>In recent years, I’ve put together this blog, <a class="link" href="http://shouldigetaphd.com" target="_blank" rel="noopener" >shouldigetaphd.com</a>, and a few other simple <a class="link" href="https://shouldigetaphd.com/" target="_blank" rel="noopener" >web-based</a> <a class="link" href="https://pythonplot.com" target="_blank" rel="noopener" >side</a> <a class="link" href="https://dp.tdhopper.com" target="_blank" rel="noopener" >projects</a>. However, I haven’t kept up with modern web development, and my projects have been hacked together from boilerplate or templates. I’ve programmed professionally since 2011, I’ve spent very little of that writing anything close to graphical user interfaces.</p> <p>I have a number of other side projects that I’d like to do at some point, and most of them would require some sort of graphical interface. While I could work on app development, I think web-based implementations would be a great starting place.</p> <p>A few months back, I decided to stop watching Netflix on the treadmill and instead use those 45 minutes each morning to learn; in particular, I’ve been trying to learn more about modern(ish) web design and development. My work has a subscription to <a class="link" href="http://safaribooksonline.com" target="_blank" rel="noopener" >Safari Books Online</a> which gives me access to copious technical books and video tutorials.</p> <p>The number of resources available on Safari (along with YouTube, blog posts, etc) is astounding. I started many video tutorials on Safari that I quickly realized weren’t going to be useful. Yet there many gems to be found, which I share here with you.</p> <p>What follows is an overview of the technologies I’ve realized I need to learn more about and links to the resources I’ve found valuable in learning about them. If you think there are gaps I haven’t yet filled or better resources than I’ve listed below, I’d love your feedback.</p> <h2 id="what-i-knew-going-in">What I Knew Going In </h2><p>I’ve been a professional software developer and data scientist since 2012. I mostly write Python, but I’ve programmed in a number of different languages.</p> <p>I have a pretty good grasp on how HTML and CSS work. I’ve used enough Javascript over the years to be dangerous; I understood how it runs in the browsers. I understand what a DOM is and how it relates to the page source.</p> <p>I’ve used the <a class="link" href="http://flask.pocoo.org/" target="_blank" rel="noopener" >Python Flask web framework</a> for several projects. I understand how to repond to HTTP requests with server- generated content. I had some idea of how to run my own web server on AWS.</p> <p>I’ve used <a class="link" href="https://jekyllrb.com/" target="_blank" rel="noopener" >Jekyll</a>, <a class="link" href="http://gohugo.io/" target="_blank" rel="noopener" >Hugo</a>, and <a class="link" href="http://docs.getpelican.com/" target="_blank" rel="noopener" >Pelican</a> to create statically generated sites.</p> <p>I understood DNS at a high level, but never really learned what all the different DNS types were, and I didn’t understand why name server changes take so long to propagate.</p> <p>I had some idea of what <a class="link" href="https://nodejs.org/en/" title="Node.js" target="_blank" rel="noopener" >node.js</a> and <a class="link" href="https://www.npmjs.com/" title="npm" target="_blank" rel="noopener" >npm</a> are.</p> <p>I’m a committed Sublime Text user.</p> <h2 id="a-meta-tutorial-on-web-development">A Meta Tutorial on Web Development </h2><p>A great place to start is Andrew Montalenti’s lengthy tutorial on using Python, Flask, Bootstrap, and Mongo to rapidly prototype a website. The tutorial is out of date, but the principles still stand.</p> <p>Another great resource is Cody Lindley’s free <a class="link" href="https://frontendmasters.com/books/front-end-handbook/" target="_blank" rel="noopener" >Front-End Developer’s Handbook</a>. This is a substantial list meta-resource that organizes links for learning all angles of front-end development. “It is specifically written with the intention of being a professional resource for potential and currently practicing front-end developers to equip themselves with learning materials and development tools.”</p> <h3 id="chrome-developer-tools">Chrome Developer Tools </h3><p>One of the most important tools for me in learning more about web development has been the <a class="link" href="https://developer.chrome.com/devtools" target="_blank" rel="noopener" >Chrome Developer Tools</a>. You can live edit the DOM elements and style sheets and watch how a website changes. I’ve mostly learned Developer Tools through exploring it myself, but there are lots of tutorials for it on Youtube.</p> <h3 id="html-css-and-bootstrap">HTML, CSS, and Bootstrap </h3><p>Many modern websites are <a class="link" href="https://en.wikipedia.org/wiki/Responsive_web_design" target="_blank" rel="noopener" >responsive</a>: they automatically adapt to various size screens and devices, from phones to desktops. Writing responsive websites from scratch requires deep knowledge of HTML, CSS, Javascript, and browsers. Unless you’re doing this professionally, you probably don’t want to write a responsive site from scratch.</p> <p>For several projects, I’ve used the lightweight <a class="link" href="http://getskeleton.com/" target="_blank" rel="noopener" >Skeleton</a> project to create simple, responsive pages.</p> <p>Recently, I decide to dive deep into the more robust <a class="link" href="http://getbootstrap.com/" target="_blank" rel="noopener" >Bootstrap framework</a> originally developed at Twitter.</p> <p>I watched Brock Nunn’s <a class="link" href="http://shop.oreilly.com/product/110000475.do" target="_blank" rel="noopener" >Building a Responsive Website with Bootstrap </a> (<a class="link" href="https://www.safaribooksonline.com/library/view/building-a-responsive/9781782164982/" target="_blank" rel="noopener" >Safari</a>), a two hour tutorial on getting started with Bootstrap. The <a class="link" href="http://getbootstrap.com/getting-started/" target="_blank" rel="noopener" >documentation for Bootstrap</a> is clear (if terse) and worth reading through.</p> <p>Once you have a basic idea of how Bootstrap works, the best thing you can do is start playing with it. Since I was familiar with the Pelican static site generator, I decided to switch this blog to Bootstrap theme starting with <a class="link" href="https://github.com/getpelican/pelican-themes/tree/master/pelican-bootstrap3" target="_blank" rel="noopener" >pelican-bootstrap3</a>.</p> <p>I’ve worked with Bootstrap 3 until now. Bootstrap 4 is about to come out. <a class="link" href="https://v4-alpha.getbootstrap.com/migration/" target="_blank" rel="noopener" >Bootstrap 4 moves the style sheets from LESS to SASS and adds Flexbox functionality</a>. Unless you understand what those mean (more below), you’d be fine using version 3.</p> <p>I wanted to get a better grasp on CSS Selectors, so I read Eric Meyer’s brief <a class="link" href="https://www.amazon.com/Selectors-Specificity-Cascade-Applying-Documents/dp/1449342493/ref=as_li_ss_il?ie=UTF8&qid=1490966101&sr=8-3&linkCode=li2&tag=dothopper-20&linkId=260c71eee0317cf531ad099db4c8a816" target="_blank" rel="noopener" >Selectors, Specificity, and the Cascade: Applying CSS3 to Documents</a> (<a class="link" href="https://www.safaribooksonline.com/library/view/selectors-specificity-and/9781449358334/" target="_blank" rel="noopener" >Safari</a>)</p> <p>I watched Marty Hall’s <a class="link" href="http://www.coreservlets.com/javascript-jquery-tutorial/" target="_blank" rel="noopener" >JavaScript, jQuery, and jQuery UI</a> tutorial). I was able to skip big chunks where I already understood certain parts, but it helped me fill in lots of gaps.</p> <h3 id="advanced-stylesheets-less-sass-and-flexbox">Advanced Stylesheets (LESS, SASS, and Flexbox) </h3><p>There are several alternatives to writing raw CSS. Two popular ones are <a class="link" href="http://lesscss.org/" target="_blank" rel="noopener" >Less</a> and <a class="link" href="http://sass-lang.com/" target="_blank" rel="noopener" >SASS</a>. These “preprocessors” allow you to write CSS-like stylesheets but with constructs such as variables, nesting, inheritance, and mathematical operators.</p> <p>I found this <a class="link" href="http://shop.oreilly.com/product/0636920051213.do" target="_blank" rel="noopener" >brief tutorial on Less</a> (<a class="link" href="https://www.safaribooksonline.com/library/view/learning-less/9781783989867/" target="_blank" rel="noopener" >Safari</a>) helpful, and I’ve enjoyed Less a lot. I haven’t used SASS yet, but it’s very similar. I’ll probably switch to SASS when I start using Bootstrap 4.</p> <p>Another modern innovation is the <a class="link" href="https://en.wikipedia.org/wiki/CSS_Flex_Box_Layout" target="_blank" rel="noopener" >Flexbox layout model</a> for CSS. Stone River Learning has a great <a class="link" href="http://stoneriverelearning.com/p/css-flexbox" target="_blank" rel="noopener" >tutorial</a> on Flexbox (<a class="link" href="https://www.safaribooksonline.com/library/view/css-flexbox/100000006A0444/" target="_blank" rel="noopener" >Safari</a>). It seems that Flexbox is the future of CSS-based layouts, and it’s worth learning about.</p> <h3 id="advanced-javascript-elm-react-angular-backbone-ember">Advanced JavaScript (Elm, React, Angular, Backbone, Ember) </h3><p>The JavaScript web framework space has exploded. Many of these are implementations of the <a class="link" href="https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller" title="Model-view-controller - Wikipedia" target="_blank" rel="noopener" >Model, View, Controller pattern</a>, including React, Angular, and Ember. These tools allow the creation of complex web apps (as well as mobile apps).</p> <h3 id="web-server-operations-and-dns">Web Server Operations and DNS </h3><p>I learned a ton form <a class="link" href="http://shop.oreilly.com/product/0636920006066.do" target="_blank" rel="noopener" >Linux Web Operations</a> (<a class="link" href="https://www.safaribooksonline.com/library/view/linux-web-operations/9780133927757/" target="_blank" rel="noopener" >Safari</a>) by Ben Whaley. “The videos discuss the relationship between web and application servers, load balancers, and databases and introduce configuration management, monitoring, containers, cryptography, and DNS.”</p> <p>I’ve struggled with DNS configuration over the years, so I watched Cricket Liu’s <a class="link" href="http://shop.oreilly.com/product/0636920040088.do" target="_blank" rel="noopener" >Learning DNS</a> series (<a class="link" href="https://www.safaribooksonline.com/library/view/learning-dns/9781771373692/" target="_blank" rel="noopener" >Safari</a>). I still wouldn’t want to be responsible for a company’s complex DNS infrastructure, but I can now configure my own sites DNS with a little more understanding.</p> <h3 id="development-automation">Development Automation </h3><h4 id="package-managers">Package Managers </h4><p>It’s likely that any modern web project will have some external Javascript dependencies. Package managers (analogous to Pypi or Anaconda.org on Python) have emerged to help support this. Node.js comes with the npm package manager, but <a class="link" href="https://bower.io/" target="_blank" rel="noopener" >Bower</a> seems to make more sense for front-end development.[^yarn] <a class="link" href="http://codylindley.com/techpro/2013_04_12__package-managers-an-introducto/" target="_blank" rel="noopener" >Cody Lindley has a nice introduction to npm and Bower</a>. Bower is <a class="link" href="https://bower.io" target="_blank" rel="noopener" >well documented</a> and easy to start using. There is a nice <a class="link" href="https://pypi.python.org/pypi/Flask-Bower/" target="_blank" rel="noopener" >Flask extension</a> to help you integrate Bower with your Python project. (Update: since writing this, <a class="link" href="https://yarnpkg.com/en/" target="_blank" rel="noopener" >Yarn</a> has come to dominate this scene.)</p> <h4 id="task-automation">Task Automation </h4><p>Web development comes with lots of build-style tasks that have to happen repeatedly. For example, before you can render a webpage in the browser, you might need to convert the Less to CSS and start a local web server. Before deploying to production, you might want to also run tests and minify your Javascript.</p> <p>There’s a GUI application called <a class="link" href="https://codekitapp.com/" target="_blank" rel="noopener" >Codekit</a> that can do a lot of these tasks. You can also do it through a Node.js program called <a class="link" href="https://gruntjs.com/" target="_blank" rel="noopener" >Grunt</a>. I haven’t used it yet, but it looks like following the documentation would be the best way to get started.</p> <p><a class="link" href="http://gulpjs.com/" target="_blank" rel="noopener" >Gulp</a> is a popular alternative to Grunt.</p> <h3 id="design">Design </h3><h4 id="visual-design">Visual Design </h4><p>Design has never been my strong point. One way to compensate for that is to rely on the work of others. There are copious <a class="link" href="https://www.google.com/search?q=bootstrap+theme" target="_blank" rel="noopener" >Bootstrap themes available</a>, and some are even <a class="link" href="https://bootswatch.com/" target="_blank" rel="noopener" >free</a>.</p> <p>I enjoyed <a class="link" href="https://softwareengineeringdaily.com/2016/02/18/design-non-designers-tracy-osborn/" target="_blank" rel="noopener" >Software Engineering Daily’s interview with Tracy Osborn on Design for Non-designers</a>. She has some <a class="link" href="https://medium.com/@limedaring/design-for-non-designers-part-1-6559ed93ff91" target="_blank" rel="noopener" >blog posts</a> on the topic. Tracy recommends <a class="link" href="http://www.colourlovers.com/" target="_blank" rel="noopener" >COLOURLovers</a> for color ideas and <a class="link" href="http://fontpair.co/" target="_blank" rel="noopener" >Font Pair</a> for selecting fonts from <a class="link" href="https://fonts.google.com/about" target="_blank" rel="noopener" >Google Fonts</a>.</p> <h4 id="user-experience-design">User Experience Design </h4><p>On the topic of UX, I finally read Steve Krug’s classic <a class="link" href="http://amzn.to/2opg3ZG" target="_blank" rel="noopener" >Don’t Make Me Think</a> (<a class="link" href="https://www.safaribooksonline.com/library/view/dont-make-me/0321344758/" target="_blank" rel="noopener" >Safari</a>); it’s great. Ginny Redish’s <a class="link" href="http://amzn.to/2oDl0NY" target="_blank" rel="noopener" >Letting Go of Words</a> (<a class="link" href="https://www.safaribooksonline.com/library/view/letting-go-of/9780123859303/" target="_blank" rel="noopener" >Safari</a>) is similarly excellent.</p> <h2 id="conclusion">Conclusion </h2><p>I’ve learned a lot in the past few months. I’ve filled in some gaps about how CSS works. I’ve gotten a better grasp on the Javascript prototype model. I’ve learned that I can start with higher level tools (e.g. Bootstrap and JQuery) to rapidly build my side projects with some amount of visual appeal. I’m learning how to use available tools to reduce the boilerplate I have to write, automate tedious tasks, and reduce my personal technical debt.</p> <p>I still have a lot of learning and a lot of practicing ahead of me, but I’m starting to feel confident that I could make headway on some of my projects. The modern <a class="link" href="https://medium.com/@withinsight1/the-front-end-spectrum-c0f30998c9f0" target="_blank" rel="noopener" >frontend development landscape</a> is massive, varied, and ever changing, but that shouldn’t prohibit you from diving in if you want to.</p> </article> <article> <h1>Automating Python with Ansible</h1> <p>Thu, 23 Mar 2017 15:11:00 +0000</p> <p>I wrote <a class="link" href="https://tdhopper.com/blog/data-scientists-need-more-automation/" >a few months back</a> about how data scientists need more automation. In particular, I suggested that data scientists would be wise to learn more about automated system configuration and automated deployments.</p> <p>In an attempt to take my own advice, I’ve finally been making myself learn <a class="link" href="https://www.ansible.com/" target="_blank" rel="noopener" >Ansible</a>. It turns out that a great way to learn it is to sit down and read through the docs, front to back; I commend that tactic to you. I also put together this tutorial to walk through a practical example of how a working data scientist might use this powerful tool.</p> <p>What follows is an Ansible guide that will take you from installing Ansible to automatically deploying a long-running Python to a remote machine and running it in a <a class="link" href="https://conda.io/docs/using/envs.html" target="_blank" rel="noopener" >Conda environment</a> using <a class="link" href="http://supervisord.org/" target="_blank" rel="noopener" >supervisord</a>. It presumes your development machine is on OS X and the remote machine is Debian-like; however, it shouldn’t require too many changes to run it on other systems.</p> <p>I wrote this post in a Jupyter notebook with a Bash kernel. You can find the notebook, Ansible files, and installation directions on <a class="link" href="https://github.com/tdhopper/automating_python" target="_blank" rel="noopener" >my Github</a>.</p> <h2 id="ansible">Ansible </h2><p>Ansible provides “human readable automation” for “app deployment” and “configuration management”. Unlike tools like Chef, it doesn’t require an agent to be running on remote machines. In short, it translates declarative YAML files into shell commands and runs them on your machines over SSH.</p> <p>Ansible is backed by Red Hat and has a great <a class="link" href="https://www.ansible.com/" target="_blank" rel="noopener" >website</a>.</p> <h2 id="installing-ansible-with-homebrew">Installing Ansible with Homebrew </h2><p>First, you’ll need to <a class="link" href="http://docs.ansible.com/ansible/intro_installation.html" target="_blank" rel="noopener" >install Ansible</a>. On a Mac, I recommend doing this with <a class="link" href="https://brew.sh/" target="_blank" rel="noopener" >Homebrew</a>.</p> <pre><code>brew install ansible Warning: ansible-2.1.0.0 already installed Warning: You are using OS X 10.12. We do not provide support for this pre-release version. You may encounter build failures or other breakages. </code></pre> <h2 id="quickstart">Quickstart </h2><p>Soon, I’ll show you how to put write an Ansible YAML file. However, Ansible also allows you specify tasks from the command line.</p> <p>Here’s how we could use Ansible ping our local host:</p> <pre><code>ansible -i 'localhost,' -c local -m ping all ansible -i 'localhost,' -c local -m ping all localhost | SUCCESS => { "changed": false, "ping": "pong" } </code></pre> <p>This command calls ansible and tells it:</p> <ul> <li>To use <code>localhost</code> as it’s inventory (<code>-i</code>). Inventory is Ansible speak for machine or machines you want to be able to run commands on.</li> <li>To connect (<code>-c</code>) locally (<code>local</code>) instead of over SSH.</li> <li>To run the <a class="link" href="http://docs.ansible.com/ansible/ping_module.html" target="_blank" rel="noopener" ><code>ping</code> module</a> (<code>-m</code>) to test the connection.</li> <li>To run the command on <code>all</code> hosts in the inventory (in this case, our inventory is just the <code>localhost</code>).</li> </ul> <p><a class="link" href="https://web.archive.org/web/20170330023127/http://www.mechanicalfish.net/start-learning-ansible-with-one-line-and-no-files/" target="_blank" rel="noopener" >Michael Booth</a> has a <a class="link" href="https://web.archive.org/web/20170330023127/http://www.mechanicalfish.net/start-learning-ansible-with-one-line-and-no-files/" target="_blank" rel="noopener" >post</a> that goes into more detail about this command.</p> <p>Behind the scenes, Ansible is turning this <code>-m ping</code> command into shell commands. (Try running with the <code>-vvv</code> flag to see what’s happening behind the scenes.) It can also execute arbitrary commands; by default, it’ll use the Bourne shell <code>sh</code>.</p> <pre><code>ansible all -i 'localhost, ' -c local -a "/bin/echo hello" </code></pre> <h2 id="setting-up-an-ansible-inventory">Setting up an Ansible Inventory </h2><p>Instead of specifying our inventory with the <code>-i</code> flag each time, we should specify an Ansible inventory file. This file is a text file specifying machines you have SSH access to; you can also group machines under bracketed headings. For example:</p> <pre><code>mail.example.com [webservers] foo.example.com bar.example.com [dbservers] one.example.com two.example.com three.example.com </code></pre> <p>Ansible has to be able to connect to these machines over SSH, so you will likely need to have relevant entries in your <a class="link" href="http://nerderati.com/2011/03/17/simplify-your-life-with-an-ssh-config-file/" target="_blank" rel="noopener" ><code>.ssh/config</code> file</a>.</p> <p>By default, the Ansible CLI will look for a system-wide Ansible inventory file in <code>/etc/ansible/hosts</code>. You can also specify an alternative path for an intentory file with the <code>-i</code> flag.</p> <p>For this tutorial, I’d like to have an inventory file specific to the project directory without having to specify it each time we call Ansible. We can do this by creating a file called <code>./ansible.cfg</code> and set the name of our local inventory file:</p> <pre><code>cat ./ansible.cfg cat ./ansible.cfg [defaults] inventory = ./hosts </code></pre> <p>You can check that Ansible is picking up your config file by running <code>ansible --version</code>.</p> <pre><code>ansible --version ansible --version ansible 2.1.0.0 config file = /Users/tdhopper/repos/automating_python/ansible.cfg configured module search path = Default w/o overrides </code></pre> <p>For this example, I just have one host, a <a class="link" href="https://www.digitalocean.com/" target="_blank" rel="noopener" >Digital Ocean VPS</a>. To run the examples below, you should create a VPS instance on Digital Ocean, <a class="link" href="https://amazonlightsail.com" target="_blank" rel="noopener" >Amazon</a>, or elsewhere; you’ll want to configure it for <a class="link" href="https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys--2" target="_blank" rel="noopener" >passwordless authentication</a>. I have an entry like this in my <code>~/.ssh/hosts</code> file:</p> <pre><code>Host digitalocean HostName 45.55.395.23 User root Port 22 IdentityFile /Users/tdhopper/.ssh/id_rsa ForwardAgent yes </code></pre> <p>and my intentory file (<code>~/hosts</code>) is just</p> <pre><code>digitalocean </code></pre> <p>Before trying ansible, you should ensure that you can connect to this host:</p> <pre><code>ssh digitalocean echo 1 ssh digitalocean echo 1 1 </code></pre> <p>Now I can verify that Ansible can connect to my machine by running the ping command.</p> <pre><code>ansible all -m ping ansible all -m ping digitalocean | SUCCESS => { "changed": false, "ping": "pong" } </code></pre> <p>We told Ansible to run this command on <code>all</code> specified hosts in the inventory. It found our inventory by loading the <code>ansible.cfg</code> which specified <code>./hosts</code> as the inventory file.</p> <p>It’s possible that this will fail for you even if you can SSH into the machine. If the error is something like <code>/bin/sh: 1: /usr/bin/python: not found</code>, this is because your VPS doesn’t have Python installed on it. You can <a class="link" href="http://stackoverflow.com/questions/32429259/ansible-fails-with-bin-sh-1-usr-bin-python-not-found" target="_blank" rel="noopener" >install it with Ansible</a>, but you may just want to manually run <code>sudo apt-get -y install python</code> on the VPS to get started.</p> <h2 id="writing-our-first-playbook">Writing our first Playbook </h2><p>While adhoc commands will often be useful, the real power of Ansible comes from creating repeatable sets of instructions called <a class="link" href="http://docs.ansible.com/ansible/playbooks.html" target="_blank" rel="noopener" >Playbooks</a>.</p> <p>A playbook contains a list of “plays”. Each play specifies a set of tasks to be run and which hosts to run them on. A “task” is a call to an Ansible module, like the “ping” module we’ve already seen. Ansible <a class="link" href="http://docs.ansible.com/ansible/list_of_all_modules.html" target="_blank" rel="noopener" >comes packaged with about 1000 modules</a> for all sorts of use cases. You can also extend it with your own <a class="link" href="http://docs.ansible.com/ansible/dev_guide/developing_modules.html" target="_blank" rel="noopener" >modules</a> and <a class="link" href="http://docs.ansible.com/ansible/playbooks_roles.html#roles" target="_blank" rel="noopener" >roles</a>.</p> <p>Our first playbook will just execute the ping module on all our hosts. It’s a playbook with a single play comprised of a single task.</p> <pre><code>cat ping.yml cat ping.yml --- - hosts: all tasks: - name: ping all hosts ping: </code></pre> <p>We can run our playbook with the <code>ansible-playbook</code> command.</p> <pre><code>ansible-playbook ping.yml ansible-playbook ping.yml ____________ < PLAY [all] > ------------ \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || ______________ < TASK [setup] > -------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || ok: [digitalocean] _______________________ < TASK [ping all hosts] > ----------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || ok: [digitalocean] ____________ < PLAY RECAP > ------------ \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || digitalocean : ok=2 changed=0 unreachable=0 failed=0 </code></pre> <p>You might wonder why there are cows on your screen. You can find out <a class="link" href="https://michaelheap.com/cowsay-and-ansible/" target="_blank" rel="noopener" >here</a>. However, the important thing is that our task was executed and returned successfully.</p> <p>We can override the hosts list for the play with the <code>-i</code> flag to see what the output looks like when Ansible fails to run the play because it can’t find the host.</p> <p>Let’s work now on installing the dependencies for our Python project.</p> <h2 id="installing-supervisord">Installing supervisord </h2><p>“Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.” We’ll use it to run and monitor our Python process.</p> <p>On a Debian-like system, we can install it with APT. In the Ansible DSL that’s just:</p> <pre><code>- name: Install supervisord sudo: yes apt: name: supervisor state: present update_cache: yes </code></pre> <p>You can read more about the <a class="link" href="http://docs.ansible.com/ansible/apt_module.html" target="_blank" rel="noopener" >apt module here</a>.</p> <p>Once we have it installed, we can start it with this task:</p> <pre><code>- name: Start supervisord sudo: yes service: name: "supervisor" state: running enabled: yes </code></pre> <p>This uses the <a class="link" href="http://docs.ansible.com/ansible/service_module.html" target="_blank" rel="noopener" >service</a> module.</p> <p>We could add these these tasks to a playbook file (like ping.yml), but what maybe we will want to share it among multiple playbooks? For this, Ansible has a construct called <a class="link" href="http://docs.ansible.com/ansible/playbooks_roles.html" target="_blank" rel="noopener" >Roles</a>. A role is a collection of “variable values, certain tasks, and certain handlers – or just one or more of these things”. (You can learn more about variables and handlers in the Ansible docs.)</p> <p>Roles are organized as subfolders of a folder called “Roles” in the working directory. The rapid proliferation of folders in Ansible organization can be overwhelming, but a very simple rule is just a file called <code>main.yml</code> nestled several folders deep. In our case, it’s in <code>./roles/supervisor/tasks/main.yml</code>.</p> <p>Check out <a class="link" href="http://docs.ansible.com/ansible/playbooks_roles.html#roles" target="_blank" rel="noopener" >the docs</a> to learn more about role organization.</p> <p>Here’s what our role looks like:</p> <pre><code>cat ./roles/supervisor/tasks/main.yml cat ./roles/supervisor/tasks/main.yml --- - name: Install supervisord become: true apt: name: supervisor state: present update_cache: yes tags: supervisor - name: Start supervisord become: true service: name: "supervisor" state: running enabled: yes tags: supervisor </code></pre> <p>Note that I added <code>tags:</code> to the task definitions. <a class="link" href="http://docs.ansible.com/ansible/playbooks_tags.html" target="_blank" rel="noopener" >Tags</a> just allow you to run a portion of a playbook instead of the whole thing with the <code>--tags</code> flag for <code>ansible-playbook</code>.</p> <p>Now that we have the supervisor install encapsulated in a role, we can write a simple playbook to run the role.</p> <pre><code>cat supervisor.yml cat supervisor.yml --- - hosts: digitalocean roles: - role: supervisor ansible-playbook supervisor.yml ansible-playbook supervisor.yml _____________________ < PLAY [digitalocean] > --------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || ______________ < TASK [setup] > -------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || ok: [digitalocean] _________________________________________ < TASK [supervisor : Install supervisord] > ----------------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || changed: [digitalocean] _______________________________________ < TASK [supervisor : Start supervisord] > --------------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || changed: [digitalocean] ____________ < PLAY RECAP > ------------ \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || digitalocean : ok=3 changed=2 unreachable=0 failed=0 </code></pre> <h2 id="installing-conda-with-ansible-galaxy">Installing Conda with Ansible Galaxy </h2><p>Next we want to ensure that Conda installed on our system. We could write our own role to follow the <a class="link" href="https://www.continuum.io/downloads" target="_blank" rel="noopener" >recommended process</a>. However, Ansible has a helpful tool to help us avoid reinventing the wheel by allowing users to share roles; this is called <a class="link" href="https://galaxy.ansible.com/" target="_blank" rel="noopener" >Ansible Galaxy</a>.</p> <p>You can search the Galaxy website for <a class="link" href="https://galaxy.ansible.com/list#/roles?page=1&page_size=10&autocomplete=miniconda" target="_blank" rel="noopener" >miniconda</a> and see that a handful of roles for installing Miniconda exist. I liked <a class="link" href="https://galaxy.ansible.com/andrewrothstein/miniconda/" target="_blank" rel="noopener" >this one</a>.</p> <p>We can install the role locally using the <code>ansible-galaxy</code> command line tool.</p> <pre><code>ansible-galaxy install -f andrewrothstein.miniconda </code></pre> <p>You can have the role installed wherever you want (run <code>ansible-galaxy install --help</code> to see how, but by default they’ll go to <code>/usr/local/etc/ansible/roles/</code>.</p> <pre><code>ls -lh /usr/local/etc/ansible/roles/andrewrothstein.miniconda ls -lh /usr/local/etc/ansible/roles/andrewrothstein.miniconda total 32 -rw-rw-r-- 1 tdhopper admin 1.1K Jan 16 16:52 LICENSE -rw-rw-r-- 1 tdhopper admin 666B Jan 16 16:52 README.md -rw-rw-r-- 1 tdhopper admin 973B Jan 16 16:52 circle.yml drwxrwxr-x 3 tdhopper admin 102B Mar 21 11:33 defaults drwxrwxr-x 3 tdhopper admin 102B Mar 21 11:33 handlers drwxrwxr-x 4 tdhopper admin 136B Mar 21 11:33 meta drwxrwxr-x 3 tdhopper admin 102B Mar 21 11:33 tasks drwxrwxr-x 3 tdhopper admin 102B Mar 21 11:33 templates -rw-rw-r-- 1 tdhopper admin 57B Jan 16 16:52 test.yml drwxrwxr-x 3 tdhopper admin 102B Mar 21 11:33 vars </code></pre> <p>You can look at the <code>tasks/main.yml</code> to see the core logic of installing Miniconda. It has tasks to download the installer, run the installer, delete the installer, run <code>conda update conda</code>, and make <code>conda</code> the default system Python.</p> <pre><code>cat /usr/local/etc/ansible/roles/andrewrothstein.miniconda/tasks/main.yml /main.ymllocal/etc/ansible/roles/andrewrothstein.miniconda/tasks --- # tasks file for miniconda - name: download installer... become: yes become_user: root get_url: url: '{{miniconda_installer_url}}' dest: /tmp/{{miniconda_installer_sh}} timeout: '{{miniconda_timeout_seconds}}' checksum: '{{miniconda_checksum}}' mode: '0755' - name: installing.... become: yes become_user: root command: /tmp/{{miniconda_installer_sh}} -b -p {{miniconda_parent_dir}}/{{miniconda_name}} args: creates: '{{miniconda_parent_dir}}/{{miniconda_name}}' - name: deleting installer... become: yes become_user: root when: miniconda_cleanup file: path: /tmp/{{miniconda_installer_sh}} state: absent - name: link miniconda... become: yes become_user: root file: dest: '{{miniconda_parent_dir}}/miniconda' src: '{{miniconda_parent_dir}}/{{miniconda_name}}' state: link - name: conda updates become: yes become_user: root command: '{{miniconda_parent_dir}}/miniconda/bin/conda update -y --all' - name: make system default python etc... when: miniconda_make_sys_default become: yes become_user: root with_items: - etc/profile.d/miniconda.sh template: src: '{{item}}.j2' dest: /{{item}} mode: 0644 </code></pre> <h3 id="overriding-ansible-variables">Overriding Ansible Variables </h3><p>Once a role is installed locally, you can add it to a play just like you can with roles you wrote. Installing Miniconda is now as simple as:</p> <pre><code> roles: - role: andrewrothstein.miniconda </code></pre> <p>Before we add that to a playbook, I want to customize <em>where</em> miniconda is installed. If you look back at the <code>main.yml</code> file above, you see a bunch of things surrounded in double brackets. These are variables (in the <a class="link" href="http://jinja.pocoo.org/docs/2.9/" target="_blank" rel="noopener" >Jinja2 template language</a>). From the play, we can see that Miniconda will be installed at <code>{{miniconda_parent_dir}}/{{miniconda_name}}</code>. The role defines these variables in <code>/andrewrothstein.miniconda/defaults/main.yml</code>. We can override the default variables by specifying them in our play.</p> <p>A play to install miniconda could look like this:</p> <pre><code>--- - hosts: digitalocean vars: conda_folder_name: miniconda conda_root: /root roles: - role: andrewrothstein.miniconda miniconda_parent_dir: "{{ conda_root }}" miniconda_name: "{{ conda_folder_name }}" </code></pre> <p>I added this to <a class="link" href="https://github.com/tdhopper/automating_python/blob/master/playbook.yml" target="_blank" rel="noopener" ><code>playbook.yml</code></a>.</p> <p>We now know how to use Ansible to start and run supervisord and to install Miniconda. Let’s see how to use it to deploy and start our application.</p> <h2 id="deploy-python-application">Deploy Python Application </h2><p>There are countless ways to deploy a Python application. We’re going to see how to use Ansible to deploy from Github.</p> <p>I created a little project called <a class="link" href="https://github.com/tdhopper/long_running_python_process" target="_blank" rel="noopener" >long_running_python_application</a>. It has a <a class="link" href="https://github.com/tdhopper/long_running_python_process/blob/master/main.py" target="_blank" rel="noopener" ><code>main.py</code></a> that writes a log line to stdout every 30 seconds; that’s it. It also includes a <a class="link" href="https://github.com/tdhopper/long_running_python_process/blob/master/environment.yml" target="_blank" rel="noopener" >Conda environment file</a> specifying the dependencies and <a class="link" href="https://github.com/tdhopper/long_running_python_process/blob/master/run.sh" target="_blank" rel="noopener" >a shell script</a> that activates the environment and runs the program.</p> <p>We’re going to use Ansible to</p> <ol> <li>Clone the repository into our remote machine.</li> <li>Create a Conda environment based on the environment.yml file.</li> <li>Create a supervisord file for running the program.</li> <li>Start the supervisord job.</li> </ol> <h3 id="clone-the-repository">Clone the repository </h3><p>Cloning a repository with Ansible is easy. We just use the <a class="link" href="http://docs.ansible.com/ansible/git_module.html" target="_blank" rel="noopener" ><code>git</code> module</a>. This play will clone the repo into the specified directory. The <code>update: yes</code> flag tells Ansible to update the repository from the remote if it has already been cloned.</p> <pre><code>--- - hosts: digitalocean vars: project_repo: git://github.com/tdhopper/long_running_python_process.git project_location: /srv/long_running_python_process tasks: - name: Clone project code. git: repo: "{{ project_repo }}" dest: "{{ project_location }}" update: yes </code></pre> <h3 id="creating-the-conda-environment">Creating the Conda Environment </h3><p>Since we’ve now installed conda and cloned the repository with an <code>environment.yml</code> file, we just need to run <code>conda env update</code> from the directory containing the environment spec. Here’s a play to do that:</p> <pre><code>--- - hosts: digitalocean vars: project_location: /srv/long_running_python_process tasks: - name: Create Conda environment from project environment file. command: "{{conda_root}}/{{conda_folder_name}}/bin/conda env update" args: chdir: "{{ project_location }}" </code></pre> <p>It uses the <a class="link" href="http://docs.ansible.com/ansible/command_module.html" target="_blank" rel="noopener" ><code>command</code> module</a> which just executes a shell command in the desired directory.</p> <h3 id="create-a-supervisord-file">Create a Supervisord File </h3><p>By default, supervisord will <a class="link" href="http://supervisord.org/configuration.html" target="_blank" rel="noopener" >look in <code>/etc/supervisor/conf.d/</code> for configuration</a> on which programs to run.</p> <p>We need to put a file in there that tells supervisord to run our <code>run.sh</code> script. Ansible has an integrated way of setting up templates which can be placed on remote machines.</p> <p>I put a supervisord job template in the <code>./templates</code> folder.</p> <pre><code>cat ./templates/run_process.j2 cat ./templates/run_process.j2 [program:{{ program_name }}] command=sh run.sh autostart=true directory={{ project_location }} stderr_logfile=/var/log/{{ program_name }}.err.log stdout_logfile=/var/log/{{ program_name }}.out.log </code></pre> <p>This is a is normal INI-style config file, except it includes Jinja2 variables. We can use the Ansible <a class="link" href="http://docs.ansible.com/ansible/template_module.html" target="_blank" rel="noopener" ><code>template</code> module</a> to create a task which fills in the variables with information about our program and copies it into the <code>conf.d</code> folder on the remote machine.</p> <p>The play for this would look like:</p> <pre><code>- hosts: digitalocean vars: project_location: /srv/long_running_python_process program_name: long_running_process supervisord_configs_path: /etc/supervisor/conf.d tasks: - name: Copy supervisord job file to remote template: src: ./templates/run_process.j2 dest: "{{ supervisord_configs_path }}/run_process.conf" owner: root </code></pre> <h3 id="start-the-supevisord-job">Start the supevisord job </h3><p>Finally, we just need to tell supervisord on our remote machine to start the job described by <code>run_process.conf</code>.</p> <p>Instead of issuing our own shell commands via Ansible, we can use the <a class="link" href="http://docs.ansible.com/ansible/supervisorctl_module.html" target="_blank" rel="noopener" ><code>supervisorctl</code> module</a>. The task is just:</p> <pre><code> - name: Start job supervisorctl: name: "{{ program_name }}" state: present </code></pre> <p><code>state: present</code> ensures that Ansible calls <code>supervisorctl reread</code> to load a new config. Because our config has <code>autostart=true</code>, supervisor will start it as soon as the task is added.</p> <h2 id="the-big-playbook">The Big Playbook! </h2><p>We can take everything we’ve described above and put it in one playbook.</p> <p>This playbook will:</p> <ul> <li>Install Miniconda using the role from Ansible Galaxy.</li> <li>Install and start Supervisor using the role we created.</li> <li>Clone the Github project we want to run.</li> <li>Create a Conda environment based on the environment.yml file.</li> <li>Create a supervisord file for running the program.</li> <li>Start the supervisord job.</li> </ul> <p>All of this will be done on the host we specify (<code>digitalocean</code>).</p> <pre><code>cat playbook.yml cat playbook.yml --- - hosts: digitalocean vars: project_repo: git://github.com/tdhopper/long_running_python_process.git project_location: /srv/long_running_python_process program_name: long_running_process conda_folder_name: miniconda conda_root: /root supervisord_configs_path: /etc/supervisor/conf.d roles: - role: andrewrothstein.miniconda miniconda_parent_dir: "{{ conda_root }}" miniconda_name: "{{ conda_folder_name }}" tags: miniconda - role: supervisor tasks: - name: Clone project code. git: repo: "{{ project_repo }}" dest: "{{ project_location }}" update: yes tags: git - name: Create Conda environment from project environment file. command: "{{conda_root}}/{{conda_folder_name}}/bin/conda env update" args: chdir: "{{ project_location }}" tags: conda - name: Copy supervisord job file to remote template: src: ./templates/run_process.j2 dest: "{{ supervisord_configs_path }}/run_process.conf" owner: root tags: conf - name: Start job supervisorctl: name: "{{ program_name }}" state: present tags: conf </code></pre> <p>To configure our machine, we just have to run <code>ansible-playbook playbook.yml</code>.</p> <pre><code>ANSIBLE_NOCOWS=1 ansible-playbook playbook.yml ANSIBLE_NOCOWS=1 ansible-playbook playbook.yml PLAY [digitalocean] ************************************************************ TASK [setup] ******************************************************************* ok: [digitalocean] TASK [andrewrothstein.unarchive-deps : resolve platform specific vars] ********* TASK [andrewrothstein.unarchive-deps : install common pkgs...] ***************** changed: [digitalocean] => (item=[u'tar', u'unzip', u'gzip', u'bzip2']) TASK [andrewrothstein.bash : install bash] ************************************* ok: [digitalocean] TASK [andrewrothstein.alpine-glibc-shim : fix alpine] ************************** skipping: [digitalocean] TASK [andrewrothstein.miniconda : download installer...] *********************** changed: [digitalocean] TASK [andrewrothstein.miniconda : installing....] ****************************** changed: [digitalocean] TASK [andrewrothstein.miniconda : deleting installer...] *********************** skipping: [digitalocean] TASK [andrewrothstein.miniconda : link miniconda...] *************************** changed: [digitalocean] TASK [andrewrothstein.miniconda : conda updates] ******************************* changed: [digitalocean] TASK [andrewrothstein.miniconda : make system default python etc...] *********** skipping: [digitalocean] => (item=etc/profile.d/miniconda.sh) TASK [supervisor : Install supervisord] **************************************** ok: [digitalocean] TASK [supervisor : Start supervisord] ****************************************** ok: [digitalocean] TASK [Clone project code.] ***************************************************** changed: [digitalocean] TASK [Create Conda environment from project environment file.] ***************** changed: [digitalocean] TASK [Copy supervisord job file to remote] ************************************* changed: [digitalocean] TASK [Start job] *************************************************************** changed: [digitalocean] PLAY RECAP ********************************************************************* digitalocean : ok=13 changed=9 unreachable=0 failed=0 </code></pre> <p>See that the <code>PLAY RECAP</code> shows that everything was OK, no systems were unreachable, and no tasks failed.</p> <p>We can verify that the program is running without error:</p> <pre><code>ssh digitalocean sudo supervisorctl status ssh digitalocean sudo supervisorctl status long_running_process RUNNING pid 4618, uptime 0:01:34 ssh digitalocean cat /var/log/long_running_process.out.log ssh digitalocean cat /var/log/long_running_process.out.log INFO:root:Process ran for the 1th time INFO:root:Process ran for the 2th time INFO:root:Process ran for the 3th time INFO:root:Process ran for the 4th time </code></pre> <p>If your lucky (i.e. your systems and networks were setup sufficiently similar to mine), you can run this exact same command to configure and start a process on your own system. Moreover, you could use this exact same command to start this program on an arbitrary number of machines by simply adding more hosts to your inventory and play spec!</p> <h2 id="conclusion">Conclusion </h2><p>Ansible is a powerful, customizable tool. Unlike some similar tools, it requires very little setup to start using it. As I’ve learned more about it, I’ve seen more and more ways in which I could’ve used it in copious projects in the past; I intend to make it a regular part of my toolkit. (Historically I’ve done this kind of thing with hacky combinations of shell scripts and <a class="link" href="http://www.fabfile.org/" target="_blank" rel="noopener" >Fabric</a>; Ansible would often be better.)</p> <p>This tutorial just scratches the surface of the Ansible functionality. If you want to learn more, I again recommend reading through the <a class="link" href="http://docs.ansible.com/ansible/index.html" target="_blank" rel="noopener" >docs</a>; they’re very good. Of course, you should start writing and running your own playbooks as soon as possible! I also liked <a class="link" href="https://serversforhackers.com/an-ansible-tutorial" target="_blank" rel="noopener" >this tutorial from Server Admin for Programmers</a>. If you want to compare Ansible to alternatives, the <a class="link" href="https://valdhaus.co/books/taste-test-puppet-chef-salt-stack-ansible.html" target="_blank" rel="noopener" >Taste Test book</a> by Matt Jaynes looks promising. For more on Supervisor, <a class="link" href="https://serversforhackers.com/monitoring-processes-with-supervisord" target="_blank" rel="noopener" >serversforhackers.com</a> has a nice tutorial, and <a class="link" href="http://supervisord.org/" target="_blank" rel="noopener" >its docs are thorough</a>.</p> <blockquote> <p>I wrote a tutorial on using <a class="link" href="https://twitter.com/ansible" target="_blank" rel="noopener" >@ansible</a> and supervisor to deploy a long running Python process to a <a class="link" href="https://twitter.com/digitalocean" target="_blank" rel="noopener" >@digitalocean</a> VPS.<a class="link" href="https://t.co/uPC8bY5haD" target="_blank" rel="noopener" >https://t.co/uPC8bY5haD</a></p> <p>— Tim Hopper 🔭 (@tdhopper) <a class="link" href="https://twitter.com/tdhopper/status/845256769429483520" target="_blank" rel="noopener" >March 24, 2017</a></p> </blockquote> </article> <article> <h1>Naive Bayes Meme</h1> <p>Tue, 14 Mar 2017 21:29:00 +0000</p> <figure> </figure> <figure> </figure> <figure> </figure> <figure> </figure> <figure> </figure> <figure> </figure> </article> <article> <h1>Some Reflections on Being Turned Down for a Lot of Data Science Jobs</h1> <p>Mon, 06 Mar 2017 17:02:00 +0000</p> <blockquote> <p>👉 The decision was close, but the team has decided to keep looking for someone who might have more direct neural net experience.</p> </blockquote> <blockquote> <p>👉 Honestly, I think the way you communicated your thought process and results was confusing for some people in the room.</p> </blockquote> <blockquote> <p>👉 He’s needing someone with an image analysis background for data scientist we’re hiring now.</p> </blockquote> <blockquote> <p>👉 Quite honestly given your questions [about vacation policy] and the fact that you are considering other options, [we] may not be the best choice for you.</p> </blockquote> <p>These quotes above are some of the reasons I’ve been given for why I wasn’t offered a data science job after interviewing. I’ve been told a variety of other reasons as well: company decided against hiring remotes after interviewing (I’ve heard this at least 3 times), company thought I changed jobs too frequently, company decided it didn’t have necessary data infrastructure in place for data science work. Multiple companies gave no particular reason; some of these were at least kind enough to <em>notify me</em> they weren’t interested. One company hired someone with a Ph.D. from MIT soon after turning me down.</p> <p>In the last five years, I’ve clearly interviewed for a lot of data science jobs, and I’ve also been turned down for a lot of data science jobs]. I’ve spent a good bit of time reflecting on why I wasn’t offered this job or that. Several folks have asked me if I had any advice to share on the experience, and I hope to offer that here.</p> <h2 id="you-never-really-know">You never really know </h2><p>I learned with graduate school applications years ago: you rarely truly know why you were turned down. Maybe my GRE scores weren’t high enough, or maybe the reviewer rushed through my application in the 5 minutes before lunch. Maybe my statement of interest was too weak, or maybe the department needed to accept an alumni’s child.</p> <p>The same goes for companies. I’m fairly skeptical that the reasons I have been given for why I was passed by are the full story, and I suspect you will rarely (if ever) know the real reasons why you weren’t offered a job. I try to use the reasons I hear as a way to help me refine my skills and better present myself, but I don’t put too much weight in them.</p> <h2 id="some-advice-anyway">Some advice anyway </h2><p>That said, here are a few takeaways from interviewing for probably 20 data science jobs since 2012.</p> <ul> <li><strong>Companies often use interviews as a time to figure out what they’re really looking for.</strong> I suspect this is rarely intentional. But actually interviewing candidates forces a team to talk through what they’re actually looking for, and they often realize they had differing perspectives prior to the interview.</li> <li><strong>Companies where “data science” is a new addition need your help in understanding what data science can do for them.</strong> As much as possible, use the interview to sell your vision for what data science can offer at the company, how you’ll get it off the ground, and what the ROI might be.</li> <li><strong>Being the wrong fit for what a company needs is not ideal.</strong> I’ve come to appreciate a company trying to ensure my abilities align with their needs. You’d hope this was <em>always</em> the case, but I’ve been hired when it wasn’t. That said, I hesitate to say you should always look for this: if you need a job, and someone offers you a job, you should feel free to take it!</li> <li><strong>Data infrastructure is important and many companies are lacking it.</strong> Many data scientists can attest to being hired at a company only to discover the data they needed wasn’t available, and they spent months or years building the tools required for them to start their analysis. Many companies are naive about how much engineering effort is required for effective data science. Don’t assume that a company with a grand vision for data science necessarily knows what it will take to accomplish that vision.</li> <li><strong>Many companies are still uneasy about data science being done remotely.</strong> I think this is silly, but I’m biased.</li> <li><strong>There’s little consistency as to what you might be asked in a data science interview.</strong> I’ve been asked about Java design patterns, how to solve combinatorics problems, to describe my favorite machine learning model, to explain the <a class="link" href="https://en.wikipedia.org/wiki/Sequential_minimal_optimization" target="_blank" rel="noopener" >SMO algorithm</a>, my opinions about the TensorFlow API, how I do software testing, to analyze a never-before-seen dataset and prepare a presentation in a 4 hour window, the list goes on. I spent a flight to the west coast reading up on the statistics of A/B testing only to be asked largely soft-skills type questions for an entire interview. I’ve largely given up attempting any special preparation for interviews.</li> <li><strong>Networking is still king.</strong> Hiring is hard, and interviewing is hard; having a prior relationship with an applicant is attractive and reduces hiring uncertainty. In my own experience, my friendships and connections with the data science community on Twitter has <a class="link" href="https://twitter.com/tdhopper/status/760109403383144448" target="_blank" rel="noopener" >shaped my career</a>. Don’t downplay the benefits of networking.</li> </ul> <h2 id="conclusion">Conclusion </h2><p>So how do you get a data science job? I don’t know.</p> <p>I’ve been unbelievably fortunate to be continuously employed since college, but I’m not sure how to tell you to repeat that. The best I have to offer is to reiterate the conclusion of <a class="link" href="https://tdhopper.com/blog/how-i-quit-my-ph.d.-and-learned-to-love-data-science/" >my recent talk about data science as a career</a>. Learn and know the hard stuff: linear algebra, probability, statistics, machine learning, math modeling, data structures, algorithms, distributed systems, etc. You probably won’t use this knowledge every day in your job, but interviewers love to ask about it anyway.</p> <p>At the same time, don’t forget about the even harder skills: communication, careful thought, prose writing skill, software writing skill, software engineering, tenacity, Stack Overflow. You will use these every day in your job, and they’ll help you present yourself well in an interview. (With the exception of Stack Overflow. Using Stack Overflow in an interview is strangely taboo.)</p> <h2 id="further-reading">Further Reading </h2><ul> <li><a class="link" href="http://treycausey.com/data_science_interviews.html" target="_blank" rel="noopener" >Trey Causey: What it’s like to be on the data science job market</a></li> <li><a class="link" href="http://treycausey.com/hiring_data_scientists.html" target="_blank" rel="noopener" >Trey Causey: Hiring data scientists</a></li> <li><a class="link" href="http://www.erinshellman.com/crushed-it-landing-a-data-science-job/" target="_blank" rel="noopener" >Erin Shellman: Crushed it! Landing a data science job </a></li> <li><a class="link" href="http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/" target="_blank" rel="noopener" >Joel Grus: Fizz Buzz in Tensorflow</a></li> </ul> </article> <article> <h1>Logistic Regression Rules Everything Around Me</h1> <p>Tue, 21 Feb 2017 00:00:00 +0000</p> <p><a class="link" href="https://twitter.com/fredbenenson" target="_blank" rel="noopener" >Fred Benenson</a> spent 6 years doing data science at Kickstarter. When he left last year, he wrote a <a class="link" href="https://hackernoon.com/on-to-the-next-2-271-days-309d6ba672d7" target="_blank" rel="noopener" >fantastic recap</a> of his experience.</p> <p>His “list of things I’ve discovered over the years” is particularly good. Here are a few of the things that resonated with me:</p> <blockquote> <li> The more you can work with someone to help refine their question the easier it will be to answer <li> Conducting a randomized controlled experiment via an A/B test is always better than analyzing historical data <li> Metrics are crucial to the story a company tells itself; it is essential to honestly and rigorously define them <li> Good experimental design is difficult; don't allow a great testing framework to let you get lazy with it <li> Data science (A/B testing, etc.) can help you how to optimize for a particular outcome, but it will never tell you which particular outcome to optimize for <li> Always seek to record and attain data in its rawest form, whether you're instrumenting something yourself or retrieving it from an API </blockquote> <p>I highly recommend reading <a class="link" href="https://hackernoon.com/on-to-the-next-2-271-days-309d6ba672d7" target="_blank" rel="noopener" >the whole post</a>.</p> </article> <article> <h1>D̶r̶. Hopper or: How I Quit My Ph.D. and Learned to Love Data Science</h1> <p>Thu, 02 Feb 2017 00:00:00 +0000</p> <p>I recently gave to the <a class="link" href="http://bigdata.duke.edu/" target="_blank" rel="noopener" >Duke Big Data Initiative</a> entitled <em><s>Dr.</s> Hopper, or How I Quit My Ph.D. and Learned to Love Data Science</em>. The talk was well received, and <a class="link" href="https://twitter.com/tdhopper/status/827239362404433922/photo/1" target="_blank" rel="noopener" >my slides</a> seemed to resonate in the Twitter data science community.</p> <div class="embed-responsive embed-responsive-16by9"> </div> <p>Here’s my favorite section from the talk:</p> <figure> </figure> <figure> </figure> <figure> </figure> <figure> </figure> <p>I conclude by listing things I learned in school but don’t use every day:</p> <ul> <li>Linear Algebra</li> <li>Probability</li> <li>Statistics</li> <li>Machine Learning</li> <li>Math Modeling</li> <li>Data Structures</li> <li>Algorithms</li> <li>Distributed Systems</li> </ul> <p>And things I wasn’t taught in school but use daily:</p> <ul> <li>Communication</li> <li>Careful Thought</li> <li>Prose Writing Skill</li> <li>Software Writing Skill</li> <li>Software Engineering</li> <li>Tenacity</li> <li>Stack Overflow</li> </ul> </article> <article> <h1>Your Old Tweets from This Day</h1> <p>Tue, 24 Jan 2017 00:00:00 +0000</p> <p>A while ago, I <a class="link" href="https://gist.github.com/tdhopper/fccc4ab00a07cb290f1812d8ac5cc5f3" target="_blank" rel="noopener" >published a Bash script</a> that will open a Twitter search page to show your old tweets from this day of the year. I have enjoyed using it to <a class="link" href="https://twitter.com/tdhopper/status/559005293808746497" target="_blank" rel="noopener" >see</a> <a class="link" href="https://twitter.com/tdhopper/status/426718234461872130" target="_blank" rel="noopener" >what</a> I was <a class="link" href="https://twitter.com/tdhopper/status/691088342499381250" target="_blank" rel="noopener" >thinking</a> <a class="link" href="https://twitter.com/tdhopper/status/426702495361269760" target="_blank" rel="noopener" >about</a> in <a class="link" href="https://twitter.com/tdhopper/status/294300532770738176" target="_blank" rel="noopener" >days</a> <a class="link" href="https://twitter.com/tdhopper/status/29641415810088961" target="_blank" rel="noopener" >gone</a> by.</p> <p>So I turned this into a Twitter account.</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Follow me to have a link tweeted @ you each day that will show your old tweets from that day of the year. <a href="https://twitter.com/hashtag/experimental?src=hash">#experimental</a></p>— On This Day (@your_old_tweets) <a href="https://twitter.com/your_old_tweets/status/819196698044014592">January 11, 2017</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>If you follow <a class="link" href="https://twitter.com/your_old_tweets" target="_blank" rel="noopener" >@your_old_tweets</a>, it’ll tweet a link at you each day that will show you your old tweets from the day. It attempts to send it in the morning (assuming you have your <a class="link" href="https://twitter.com/settings/account" target="_blank" rel="noopener" >timezone set</a>).</p> <p>This runs on Amazon Lambda. The code is <a class="link" href="https://github.com/tdhopper/your_old_tweets" target="_blank" rel="noopener" >here</a>.</p> </article> <article> <h1>Compare RSA Key with Fingerprint in Github</h1> <p>Fri, 06 Jan 2017 00:00:00 +0000</p> <p>When you add an SSH key to your Github account, Github shows you the hexadecimal form of the MD5 hash of your public key.</p> <p>If you ever need to compare that against a key file on your computer, you can run:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">ssh-keygen -E md5 -lf ~/.ssh/id_rsa.pub </span></span></code></pre></td></tr></table> </div> </div><p>I learned this from <a class="link" href="http://stackoverflow.com/a/32130465/982745" target="_blank" rel="noopener" >StackOverflow</a>.</p> </article> <article> <h1>The perils of thinkism</h1> <p>Sat, 24 Dec 2016 15:01:00 +0000</p> <p>From <a class="link" href="https://lemire.me/blog/2016/12/20/what-is-a-useful-theory/" target="_blank" rel="noopener" >Daniel Lemire</a>:</p> <blockquote> <p>I was an adept, as a teenager and a young adult, of thinkism. Thinkism is the idea that intelligence alone can solve problems. I thought I was smart so that I could just sit down and solve important problems. One after the other. Whatever contributions I ended up making had little to do with sitting down and thinking hard… and much more to do with getting dirty.</p> </blockquote> </article> <article> <h1>Tallest Data Scientist</h1> <p>Thu, 08 Dec 2016 00:00:00 +0000</p> <p>I was honored to <a class="link" href="http://adversariallearning.com/episode-2-the-tallest-data-scientist.html" target="_blank" rel="noopener" >join my friends Joel and Andrew on the Adversarial Learning</a> podcast to talk about my career in data science (and what it’s like to be the tallest one).</p> </article> <article> <h1>Get Pykafka to work with rdkafka on Linux</h1> <p>Fri, 18 Nov 2016 00:00:00 +0000</p> <p>My former colleague’s from Parse.ly wrote the fantastic <a class="link" href="https://github.com/Parsely/pykafka" target="_blank" rel="noopener" >pykafka</a> library with an optional c-backed using <a class="link" href="https://github.com/edenhill/librdkafka" target="_blank" rel="noopener" >rdkafka</a>. I’ve had trouble getting it to work, and here are a few things I’ve learned:</p> <ul> <li>The version of rdkafka installable with apt-get was out of data, and pykafka couldn’t find the headers it need. I instead used the simple <a class="link" href="https://github.com/edenhill/librdkafka#building" target="_blank" rel="noopener" >build instructions</a> in the rdkafka README to build it from head.</li> <li>I was getting the error <code>ImportError: librdkafka.so.1: cannot open shared object file: No such file or directory</code> when trying to use rdkafka from Pykafka. It could be set in the short term by using <code>LD_LIBRARY_PATH=/usr/local/lib</code>. However, I fixed it permanently by running <a class="link" href="https://linux.die.net/man/8/ldconfig" target="_blank" rel="noopener" ><code>sudo ldconfig</code></a> after building rdkafka.</li> <li>Pykafka has to be installed <em>after</em> building rdkafka. At the moment, Pykafka tries to build a C-extension to connect to rdkafka, and if that fails, it will install without offering the rdkafka backend. Check the output of <code>pip install pykafka</code> <a class="link" href="https://github.com/Parsely/pykafka/blob/1b946fa288dd5eb278f397e858e887146e048e47/setup.py#L164-L172" target="_blank" rel="noopener" >to see if the rdkafka extension built</a>.</li> </ul> </article> <article> <h1>Data Scientists Need More Automation</h1> <p>Thu, 17 Nov 2016 02:00:00 +0000</p> <p>Many data scientists aren’t lazy enough.</p> <p>Whether we are managing production services or running computations on AWS machines, many data scientists are working on computers besides their laptops.</p> <p>For me, this often takes the form of SSH-ing into remote boxes, manually configuring the system with a combination of apt installs, <a class="link" href="http://conda.pydata.org/docs/using/envs.html" title="Managing environments — Conda documentation" target="_blank" rel="noopener" >Conda environments</a>, and bash scripts.</p> <p>To run my service or scripts, I open a <a class="link" href="https://tmux.github.io/" title="tmux" target="_blank" rel="noopener" >tmux</a> window, activate my virtual environement, and start the process. (When I have to do this on multiple machines, I’m occasionally clever enough to use tmux to broadcast the commands to multiple terminal windows.)</p> <p>When I need to check my logs or see the output, I SSH back into each box, reconnect to tmux (after I remember the name of my session), and tail my logs. When running on multiple boxes, I repeat this process N times. If I need to restart a process, I flip through my tmux tabs until I find the correct process, kill it with a Ctrl-C, and use the up arrow to reload the last run command.</p> <p>All of this works, of course. <a class="link" href="https://xkcd.com/1319/" target="_blank" rel="noopener" >And as we all know</a>, <a class="link" href="http://xkcd.com/974/" target="_blank" rel="noopener" >a simple solution that works</a> <a class="link" href="https://xkcd.com/1445/" target="_blank" rel="noopener" >can be preferable</a> to a fragile solution that requires constant maintenance. That said, I suspect many of us aren’t lazy enough. We don’t spend enough time automating tasks and processes. Even when we don’t save time by doing it, we may save <a class="link" href="http://www.johndcook.com/blog/2015/12/22/automate-to-save-mental-energy-not-time/" target="_blank" rel="noopener" >mental overhead</a>.</p> <p>I recently introduced several colleagues to some Python-based tools that can help. <a class="link" href="http://www.fabfile.org/" target="_blank" rel="noopener" >Fabric</a> is a “library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.” Fabric allows you to encapsulate sequences of commands as you might with a Makefile. It’s killer feature is the ease with which it lets you execute those commands on remote machines over SSH. With Fabric, you could tail all the logs on all your nodes with a single command executed in your local terminal. There are <a class="link" href="https://www.youtube.com/results?search_query=python+fabric" target="_blank" rel="noopener" >a number of talks about Fabric on Youtube</a> if you want to learn more. One of my colleagues reduced his daily workload by writing his system management tasks into a Fabric file.</p> <p>Another great tool is <a class="link" href="http://supervisord.org/" target="_blank" rel="noopener" >Supervisor</a>. If you run long running processes in tmux/screen/nohup, Supervisor might be for you. It allows you to define the tasks you want to run in an INI file and “provides you with one place to start, stop, and monitor your processes”. Supervisor will log the stdout and stderr to a log location of your choice. It can be a little confusing to set up, but will likely make your life easier in the longer run.</p> <p>A tool I want to learn but haven’t is <a class="link" href="https://www.ansible.com/" target="_blank" rel="noopener" >Ansible</a>, “a free-software platform for configuring and managing computers which combines multi-node software deployment, ad hoc task execution, and configuration management”. Unlike Chef and Puppet, Ansible doesn’t require an agent on the systems you need to configure; it does all the configuration over SSH. You can use Ansible to configure your systems and install your dependencies, even <a class="link" href="https://github.com/zenoamaro/ansible-supervisord" target="_blank" rel="noopener" >Supervisor</a>! Ansible is written in Python and, mercifully, doesn’t require learning a Ruby-based DSL (as does Chef).</p> <p>Recently I’ve been thinking that Fabric, Supervisor, and Ansible combined become a powerful toolset for management and configuration of data science systems. Each tool is also open source and can be installed in a few minutes. Each tool is well documented and offers helpful tutorials on getting started; however, learning to use them effectively may require some effort.</p> <p>I would love to see someone create training materials on these tools (and others!) focused on how data scientists can take improve their system management, configuration, and operations. A screencast series may be the perfect thing. Someone please help data scientists be lazier, do less work, and reduce the mental overhead of dealing with computers!</p> </article> <article> <h1>Filter by date in a Pandas MultiIndex</h1> <p>Tue, 08 Nov 2016 22:17:00 +0000</p> <p>I always forget how to do this.</p> <p>The pandas <code>DataFrame.loc</code> method allows for <em>label</em> -based filtering of data frames. The <a class="link" href="http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index" target="_blank" rel="noopener" >Pandas docs</a> show how it can be used to filter a <code>MultiIndex</code>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">In [39]: df </span></span><span class="line"><span class="cl">Out[39]: </span></span><span class="line"><span class="cl"> A B C </span></span><span class="line"><span class="cl">first second </span></span><span class="line"><span class="cl">bar one 0.895717 0.410835 -1.413681 </span></span><span class="line"><span class="cl"> two 0.805244 0.813850 1.607920 </span></span><span class="line"><span class="cl">baz one -1.206412 0.132003 1.024180 </span></span><span class="line"><span class="cl"> two 2.565646 -0.827317 0.569605 </span></span><span class="line"><span class="cl">foo one 1.431256 -0.076467 0.875906 </span></span><span class="line"><span class="cl"> two 1.340309 -1.187678 -2.211372 </span></span><span class="line"><span class="cl">qux one -1.170299 1.130127 0.974466 </span></span><span class="line"><span class="cl"> two -0.226169 -1.436737 -2.006747 </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">In [40]: df.loc['bar'] </span></span><span class="line"><span class="cl">Out[40]: </span></span><span class="line"><span class="cl"> A B C </span></span><span class="line"><span class="cl">second </span></span><span class="line"><span class="cl">one 0.895717 0.410835 -1.413681 </span></span><span class="line"><span class="cl">two 0.805244 0.813850 1.607920 </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">In [41]: df.loc['bar', 'two'] </span></span><span class="line"><span class="cl">Out[41]: </span></span><span class="line"><span class="cl">A 0.805244 </span></span><span class="line"><span class="cl">B 0.813850 </span></span><span class="line"><span class="cl">C 1.607920 </span></span><span class="line"><span class="cl">Name: (bar, two), dtype: float64 </span></span></code></pre></td></tr></table> </div> </div><p>It turns out you can easily use it to filter a <code>DateTimeIndex</code> level by a single date with <code>df['2016-11-07']</code> or a range of dates with <code>df['2016-11-07:2016-11-11']</code>. This applies whether or not its a <code>MultiIndex</code>.</p> <p>If you get an error like <code>KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'</code>, it’s because “MultiIndex Slicing requires the index to be fully lexsorted”. You may fix your problem by calling <code>df = df.sort_index()</code>.</p> </article> <article> <h1>Speeding up PyMC3 NUTS Sampler</h1> <p>Tue, 08 Nov 2016 00:00:00 +0000</p> <p>I’m trying to use the <a class="link" href="https://pymc-devs.github.io/pymc3/api.html?highlight=nuts#module-pymc3.step_methods.nuts" target="_blank" rel="noopener" >NUTS sampler</a> in <a class="link" href="https://github.com/pymc-devs/pymc3" title="GitHub - pymc-devs/pymc3: Probabilistic Programming in Python. Uses Theano as a backend, supports NUTS and ADVI." target="_blank" rel="noopener" >PyMC3</a></p> <p>However, it was running at 2 iterations per second on my model, while the Metropolis Hastings sampler ran 450x faster.</p> <p>I showed my example to some of the PyMC3 devs on Twitter, and <a class="link" href="https://de.linkedin.com/in/thomas-wiecki-46339244" title="Thomas Wiecki" target="_blank" rel="noopener" >Thomas Wiecki</a> showed me this trick:</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tdhopper">@tdhopper</a> <a href="https://twitter.com/Springcoil">@Springcoil</a> You need pm.NUTS(scaling=np.power(model.dict_to_array(v_params.stds), 2), is_cov=True) (terrible syntax, I know).</p>— Thomas Wiecki (@twiecki) <a href="https://twitter.com/twiecki/status/796007019941462016">November 8, 2016</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>It resulted in a 25x speedup of the NUTS sampler. The code looks like this</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span><span class="lnt">8 </span><span class="lnt">9 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">with</span> <span class="n">pm</span><span class="o">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">model</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="c1"># SETUP MODEL HERE</span> </span></span><span class="line"><span class="cl"> <span class="n">mu</span><span class="p">,</span> <span class="n">sds</span><span class="p">,</span> <span class="n">elbo</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">variational</span><span class="o">.</span><span class="n">advi</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">200000</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">step</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">NUTS</span><span class="p">(</span><span class="n">scaling</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">dict_to_array</span><span class="p">(</span><span class="n">sds</span><span class="p">),</span> <span class="mi">2</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">pm</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">niter</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">step</span><span class="o">=</span><span class="n">step</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">is_cov</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">start</span><span class="o">=</span><span class="n">mu</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Understanding Probabilistic Topic Models By Simulation</h1> <p>Tue, 25 Oct 2016 00:00:00 +0000</p> <p>I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here’s the description:</p> <blockquote> <p>Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.</p> </blockquote> <p>You can watch the video on <a class="link" href="https://www.youtube.com/watch?v=Wy-XhT2sHgM&feature=youtu.be" target="_blank" rel="noopener" >Youtube</a>:</p> <div class="video-wrapper"> </iframe> </div> <p>I gave a <a class="link" href="https://www.youtube.com/watch?v=_R66X_udxZQ" target="_blank" rel="noopener" >shorter version of the talk at PyData NYC 2015</a>.</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Undersampled Radio Interview</h1> <p>Mon, 24 Oct 2016 00:00:00 +0000</p> <p>I was flattered to be asked to be on a burgeoning data science podcast called Undersampled Radio. You can listen on <a class="link" href="https://www.youtube.com/watch?v=q4e_hBUd6zI&feature=youtu.be" target="_blank" rel="noopener" >YouTube</a>.</p> </article> <article> <h1>Sharing Your Side Projects Online</h1> <p>Wed, 05 Oct 2016 00:00:00 +0000</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Setting Up supervisord</h1> <p>Tue, 20 Sep 2016 00:00:00 +0000</p> <p>First run <code>$ sudo pip install supervisor</code>.</p> <p>Create <code>/etc/supervisord.conf</code>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="p">;</span> <span class="n">supervisor</span> <span class="n">config</span> <span class="n">file</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">unix_http_server</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">file</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="n">run</span><span class="o">/</span><span class="n">supervisor</span><span class="o">.</span><span class="n">sock</span> <span class="p">;</span> <span class="p">(</span><span class="n">the</span> <span class="n">path</span> <span class="n">to</span> <span class="n">the</span> <span class="n">socket</span> <span class="n">file</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">chmod</span><span class="o">=</span><span class="mi">0700</span> <span class="p">;</span> <span class="n">sockef</span> <span class="n">file</span> <span class="n">mode</span> <span class="p">(</span><span class="n">default</span> <span class="mi">0700</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">supervisord</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">logfile</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="nb">log</span><span class="o">/</span><span class="n">supervisor</span><span class="o">/</span><span class="n">supervisord</span><span class="o">.</span><span class="n">log</span> <span class="p">;</span> <span class="p">(</span><span class="n">main</span> <span class="nb">log</span> <span class="n">file</span><span class="p">;</span><span class="n">default</span> <span class="o">$</span><span class="n">CWD</span><span class="o">/</span><span class="n">supervisord</span><span class="o">.</span><span class="n">log</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">pidfile</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="n">run</span><span class="o">/</span><span class="n">supervisord</span><span class="o">.</span><span class="n">pid</span> <span class="p">;</span> <span class="p">(</span><span class="n">supervisord</span> <span class="n">pidfile</span><span class="p">;</span><span class="n">default</span> <span class="n">supervisord</span><span class="o">.</span><span class="n">pid</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">childlogdir</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="nb">log</span><span class="o">/</span><span class="n">supervisor</span> <span class="p">;</span> <span class="p">(</span><span class="s1">'AUTO'</span> <span class="n">child</span> <span class="nb">log</span> <span class="n">dir</span><span class="p">,</span> <span class="n">default</span> <span class="o">$</span><span class="n">TEMP</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">the</span> <span class="n">below</span> <span class="n">section</span> <span class="n">must</span> <span class="n">remain</span> <span class="ow">in</span> <span class="n">the</span> <span class="n">config</span> <span class="n">file</span> <span class="k">for</span> <span class="n">RPC</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="p">(</span><span class="n">supervisorctl</span><span class="o">/</span><span class="n">web</span> <span class="n">interface</span><span class="p">)</span> <span class="n">to</span> <span class="n">work</span><span class="p">,</span> <span class="n">additional</span> <span class="n">interfaces</span> <span class="n">may</span> <span class="n">be</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">added</span> <span class="n">by</span> <span class="n">defining</span> <span class="n">them</span> <span class="ow">in</span> <span class="n">separate</span> <span class="n">rpcinterface</span><span class="p">:</span> <span class="n">sections</span> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">rpcinterface</span><span class="p">:</span><span class="n">supervisor</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">supervisor</span><span class="o">.</span><span class="n">rpcinterface_factory</span> <span class="o">=</span> <span class="n">supervisor</span><span class="o">.</span><span class="n">rpcinterface</span><span class="p">:</span><span class="n">make_main_rpcinterface</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">supervisorctl</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">serverurl</span><span class="o">=</span><span class="n">unix</span><span class="p">:</span><span class="o">///</span><span class="k">var</span><span class="o">/</span><span class="n">run</span><span class="o">/</span><span class="n">supervisor</span><span class="o">.</span><span class="n">sock</span> <span class="p">;</span> <span class="n">use</span> <span class="n">a</span> <span class="n">unix</span><span class="p">:</span><span class="o">//</span> <span class="n">URL</span> <span class="k">for</span> <span class="n">a</span> <span class="n">unix</span> <span class="n">socket</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">The</span> <span class="p">[</span><span class="n">include</span><span class="p">]</span> <span class="n">section</span> <span class="n">can</span> <span class="n">just</span> <span class="n">contain</span> <span class="n">the</span> <span class="s2">"files"</span> <span class="n">setting</span><span class="o">.</span> <span class="n">This</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">setting</span> <span class="n">can</span> <span class="n">list</span> <span class="n">multiple</span> <span class="n">files</span> <span class="p">(</span><span class="n">separated</span> <span class="n">by</span> <span class="n">whitespace</span> <span class="ow">or</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">newlines</span><span class="p">)</span><span class="o">.</span> <span class="n">It</span> <span class="n">can</span> <span class="n">also</span> <span class="n">contain</span> <span class="n">wildcards</span><span class="o">.</span> <span class="n">The</span> <span class="n">filenames</span> <span class="n">are</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">interpreted</span> <span class="n">as</span> <span class="n">relative</span> <span class="n">to</span> <span class="n">this</span> <span class="n">file</span><span class="o">.</span> <span class="n">Included</span> <span class="n">files</span> <span class="o">*</span><span class="n">cannot</span><span class="o">*</span> </span></span><span class="line"><span class="cl"><span class="p">;</span> <span class="n">include</span> <span class="n">files</span> <span class="n">themselves</span><span class="o">.</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">include</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">files</span> <span class="o">=</span> <span class="o">/</span><span class="n">etc</span><span class="o">/</span><span class="n">supervisor</span><span class="o">/</span><span class="n">conf</span><span class="o">.</span><span class="n">d</span><span class="o">/*.</span><span class="n">conf</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="p">[</span><span class="n">program</span><span class="p">:</span><span class="n">program1</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">command</span><span class="o">=</span><span class="n">program1command</span> </span></span><span class="line"><span class="cl"><span class="n">autostart</span><span class="o">=</span><span class="bp">true</span> </span></span><span class="line"><span class="cl"><span class="n">stderr_logfile</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="nb">log</span><span class="o">/</span><span class="n">program1</span><span class="o">.</span><span class="n">err</span><span class="o">.</span><span class="n">log</span> </span></span><span class="line"><span class="cl"><span class="n">stdout_logfile</span><span class="o">=/</span><span class="k">var</span><span class="o">/</span><span class="nb">log</span><span class="o">/</span><span class="n">program1</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">log</span> </span></span></code></pre></td></tr></table> </div> </div><p>Now run <code>$ sudo supervisord -c /etc/supervisord.conf</code>. See the status with <code>$ sudo supervisorctl status</code>.</p> <p>After modifying the config, run modify the config, <code>$ sudo supervisorctl reread & sudo supervisorctl update</code>.</p> </article> <article> <h1>Entropy of a Discrete Probability Distribution</h1> <p>Mon, 05 Sep 2016 16:05:00 +0000</p> <p>Suppose we have a discrete set of possible events \(1,\ldots, n\) that occur with probabilities \( (p_1, p_2, \ldots, p_n)\). Claude Shannon asked the question</p> <blockquote> <p>Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome?</p> </blockquote> <p>For example, suppose we have a coin that lands on heads with probability \(p\) and tails with probability \(1-p\). If \(p=1\), the coin always lands on heads. Since there is no uncertainty, we might want to say the uncertainty is 0. However, if the coin is fair and \(p=0.5\), we maximize our uncertainty: it’s a complete tossup whether the coin is heads or tails. We might want to say the uncertainty in this case is 1.</p> <p>In general, Shannon wanted to devise a function \(H(p_1, p_2, \ldots, p_n)\) describing the uncertainty of an arbitrary set of discrete events (i.e. a \(n\)-sided die). He thought that “it is reasonable” that \(H\) should have three properties:</p> <ol> <li> <p>\(H\) should be a continuous function of each \(p_i\). A small change in a single probability should result in a similarly small entropy (uncertainty) change.</p> </li> <li> <p>If each event is equally likely (\(p_i=\frac{1}{n}\)), \(H\) should increase as a function of \(n\): the more events there are, the more uncertain we are.</p> </li> <li> <p>Finally, entropy should be additive for independent events. Suppose we generate a random variable \(x\) by the following process: Flip a fair coin. If it is heads, \(x=0\). However, if the flip was tails, flip the coin again (an <em>independent</em> event from the first flip). If the second flip is heads, \(x=1\), if tails \(x=2\). These three outcomes occur with probability $1/2$, $1/4$, and $1/4$, respectively.</p> <p>We can compute the entropy of $x$ as \(H(p_0=1/2, p_1=1/4, p_2=1/4)\). By the independence property, this relationship holds:</p> <p>\[H\left(\frac{1}{2}, \frac{1}{4}, \frac{1}{4}\right)=H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{1}{2}, \frac{1}{2}\right).\]</p> <p>As <a class="link" href="http://www.inference.phy.cam.ac.uk/itprnn/book.pdf" target="_blank" rel="noopener" >David MacKay explains</a>, this is the general claim that</p> <p>\[ H(\mathbf{p})=H(p_1, 1-p_1)+(1-p_1)H\left(\frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \ldots, \frac{p_n}{1-p_1}\right).\]</p> </li> </ol> <p>Shannon showed that given these three assumptions, there is a unique form that \(H\) must take:</p> $$ H\propto -\sum_{i=1}^n p_i \log p_i=\sum_{i=1}^n p_i \log \frac{1}{p_i}. $$<p>He named this measure of uncertainty <em>entropy</em>, because the form of \(H\) bears striking similarity to that of <a class="link" href="https://en.wikipedia.org/wiki/Entropy_%28statistical_thermodynamics%29#Gibbs_Entropy_Formula" target="_blank" rel="noopener" >Gibbs Entropy</a> in statistical thermodynamics.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <p>Shannon observes that \(H\) has many other interesting properties:</p> <ol> <li>Entropy \(H\) is 0 if and only if exactly one event has probability 1 and the rest have probability 0. (Uncertainty vanishes only when we are certain about the outcomes.)</li> <li>Entropy \(H\) is maximized when the \(p_i\) values are equal.</li> <li>The joint entropy of two events is less than or equal to the sum of the individual entropies. \(H(x, y)=H(x)+H(y)\) only when \(x\) and \(y\) are independent events.</li> </ol> <p>You can read more about this in Shannon’s seminal paper <a class="link" href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6773024" target="_blank" rel="noopener" >A Theory of Mathematical Communication</a>.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>Caianiello and Aizerman <a class="link" href="https://books.google.com/books?id=ak6qCAAAQBAJ&lpg=PP1&ots=bzAoIDJk1K&dq=Topics%20in%20the%20General%20Theory%20of%20Structures&pg=PA200#v=onepage&q=cross%20entropy&f=false" target="_blank" rel="noopener" >say</a> the name entropy is thanks to von Neumann who said</p> <blockquote> <p>You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.</p> </blockquote> <p>They argue that the name “uncertainty” would have been much more helpful since “Shannon entropy is simply and avowedly the ‘measure of the uncertainty inherient in a pre-assigned probability scheme.’” <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Cross Entropy and KL Divergence</h1> <p>Mon, 05 Sep 2016 14:12:00 +0000</p> <p><a class="link" href="https://tdhopper.com/blog/entropy-of-a-discrete-probability-distribution/" target="_blank" rel="noopener" >As we saw in an earlier post</a>, the entropy of a discrete probability distribution is defined to be</p> $$H(p)=H(p\_1,p\_2,\ldots,p\_n)=-\sum\_{i}p\_i \log p\_i.$$<p>Kullback and Leibler defined a similar measure now known as <em>KL divergence</em>. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.</p> $$D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.$$<p>$D_\text{KL}$ is non-negative and zero if and only if $ p_i = q_i $ for all $i$. However, it is important to note that it is not in general symmetric:</p> $$ D_{\text{KL}}(p\| q) \neq D_{\text{KL}}(q\| p).$$<p><a class="link" href="http://arxiv.org/pdf/1404.2000v1.pdf" target="_blank" rel="noopener" >Jonathon Shlens explains</a> that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p\| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p\| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.</p> <p>Algebraically, we can rewrite the definition as</p> $$ \begin{array}{rl} D_{\text{KL}}(p\| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\\\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\\\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\\\ &=- \sum_i p_i \log q_i-H(p) \\\\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\\\ \end{array} $$<p>KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called <em>cross entropy</em>:</p> $$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$<p>We could alternatively use this relationship to define cross entropy as:</p> $$H(p, q)=H(p) + D_\text{KL}(p\| q).$$<p>Intuitively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p \| q)$ differ by a constant factor for all $q$.</p> </article> <article> <h1>Photos Featured on Smithsonian Magazine</h1> <p>Mon, 25 Jul 2016 00:00:00 +0000</p> <p>A few weeks ago, I introduced my wife to backpacking in the beautiful <a class="link" href="http://www.dcr.virginia.gov/state-parks/grayson-highlands" title="Grayson Highlands State Park" target="_blank" rel="noopener" >Grayson Highlands State Park</a> in southestern Virginia. Part of my reason for picking this location was to see the herd of wild ponies that life at 5000’ on the grassy balds.</p> <p>I shared some of my best pictures from the trip on Flicker under a Creative Commons license (CC BY-NC-ND 2.0). On Saturday, I stumbled acrosss <a class="link" href="http://www.smithsonianmag.com/travel/more-than-100-ponies-roam-free-this-park-virginia-180959786/" target="_blank" rel="noopener" >an article about the Grayson Highlands ponies on the Smithsonian Magazine website</a>. I was pleasantly surprised to see they selected two of my images for the story! I’ve been spending more time lately exploring my <a class="link" href="https://twitter.com/tdhopper/status/43876534687248384" target="_blank" rel="noopener" >longtime interest in wildlife photography</a>, and I’m thrilled to see others sharing my work.</p> <p>You can find more of my photography on <a class="link" href="https://photos.tdhopper.com" target="_blank" rel="noopener" >photos.tdhopper.com</a> or <a class="link" href="https://www.instagram.com/tdhopper/" target="_blank" rel="noopener" >Instagram</a>.</p> <figure><figcaption> <h4>Ponies at Grayson Highlands State Park</h4> </figcaption> </figure> <figure><figcaption> <h4>Ponies at Grayon Highlands State Park</h4> </figcaption> </figure> </article> <article> <h1>Find Files Modified in last N Minutes</h1> <p>Wed, 22 Jun 2016 00:00:00 +0000</p> <p>I’ve been using Bash’s <a class="link" href="http://www.freebsd.org/cgi/man.cgi?find%281%29" target="_blank" rel="noopener" ><code>find</code></a> command a lot more regularly lately. I’ve always been scared off by its syntax, but it’s great once you’ve learned it.</p> <p>Today I learned how to filter the results by files changed in the last N minutes: the <code>cmin</code> flag:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl"> -cmin [-|+]n </span></span><span class="line"><span class="cl"> True if the difference between the time of last change of file </span></span><span class="line"><span class="cl"> status information and the time find was started, rounded up to </span></span><span class="line"><span class="cl"> the next full minute, is more than n (+n), less than n (-n), or </span></span><span class="line"><span class="cl"> exactly n minutes ago. </span></span></code></pre></td></tr></table> </div> </div><p>For example:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">find . -cmin +5 # Files modified more than 5 minutes ago </span></span><span class="line"><span class="cl">find . -cmin -5 # Files modified less than than 5 minutes ago </span></span><span class="line"><span class="cl">find . -cmin 5 # Files modified exactly 5 minutes ago </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Configuring Impala Query Results with impalarc</h1> <p>Wed, 15 Jun 2016 00:00:00 +0000</p> <p>I routinely run Impala queries on a <a class="link" href="http://tdhopper.com/notes/faster-ssh-access-to-remote-computers" target="_blank" rel="noopener" >remote machine</a> and want the results to save to a CSV file on that machine.</p> <p>I recently realized that I should setup an Impala config file on that machine to configure the connection to the Impala cluster and the output file format.</p> <p>So I created a text file at <code>~/.impalarc</code> with the following settings:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span><span class="lnt">8 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">[impala] </span></span><span class="line"><span class="cl">impalad=CLUSTER-ADDRESS:PORT </span></span><span class="line"><span class="cl">output_delimiter=, </span></span><span class="line"><span class="cl">verbose=true </span></span><span class="line"><span class="cl">ignore_query_failure=false </span></span><span class="line"><span class="cl">default_db=prd </span></span><span class="line"><span class="cl">write_delimited=true </span></span><span class="line"><span class="cl">print_header=true </span></span></code></pre></td></tr></table> </div> </div><p>Now when I run a query with <code>impala-shell</code>, I don’t have to specify the address of the cluster, or the various flags required to get a CSV with a header.</p> <p>Other impala-shell config options are specified <a class="link" href="http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_shell_options.html" target="_blank" rel="noopener" >here</a></p> </article> <article> <h1>Faster SSH Access to Remote Computers</h1> <p>Wed, 15 Jun 2016 00:00:00 +0000</p> <p>I routinely <a class="link" href="https://en.wikipedia.org/wiki/SSH" title="Wikipedia Entry: SSH" target="_blank" rel="noopener" >SSH</a> into a company machine on Openstack to do work. Until yesterday, I’d use my Bash history to find the SSH command I needed to access it. I was reading <a class="link" href="https://www.amazon.com/SSH-Mastery-OpenSSH-PuTTY-Tunnels/dp/1470069717" target="_blank" rel="noopener" >SSH Mastery</a> in a plane yesterday and realized I’ve been foolish to neglect the power of <code>~/ssh/config</code>.</p> <p>I added this to my SSH config file:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Host lab </span></span><span class="line"><span class="cl"> HostName MACHINE.IP.ADDRESS </span></span><span class="line"><span class="cl"> User tdhopper </span></span><span class="line"><span class="cl"> Port 22 </span></span><span class="line"><span class="cl"> IdentityFile ~/.ssh/id_rsa </span></span><span class="line"><span class="cl"> ForwardAgent yes </span></span><span class="line"><span class="cl"> RemoteForward 52698 localhost:52698 </span></span></code></pre></td></tr></table> </div> </div><p>Now I can connect to the machine by typing <code>$ ssh lab</code>. That’s it. I don’t have to provide the SSH key, username, or machine IP.</p> <p>The <code>RemoteForward</code> bit also forwards a port to enable me to open files on the remote machine in my local SublimeText editor using <a class="link" href="https://github.com/henrikpersson/rsub" target="_blank" rel="noopener" >rsub</a>. This is great.</p> </article> <article> <h1>Using Conda with the Fish Shell</h1> <p>Wed, 04 May 2016 00:00:00 +0000</p> <p>I’m trying out the beautiful <a class="link" href="https://fishshell.com/" target="_blank" rel="noopener" >Fish shell</a>. I’ve been a Bash user up until now.</p> <p>I was bummed to see that <code>source activate ENV</code> (for activating <a class="link" href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html" target="_blank" rel="noopener" >Conda environments</a> didn’t work natively in Fish.</p> <p>Apparently the <a class="link" href="https://groups.google.com/a/continuum.io/d/msg/conda/8aaYESlLZI8/Ov1UcFhWCAAJ" target="_blank" rel="noopener" >Conda team is adding native support for Fish</a>, but it’s not available as of 2016-05-04.</p> <p>There are several projects developed to help in the meantime:</p> <ul> <li><a class="link" href="http://pypi.python.org/pypi/conda-workon/0.3.0" target="_blank" rel="noopener" >conda-workon</a></li> <li><a class="link" href="https://github.com/fabioramponi/fish-conda-virtualenv" target="_blank" rel="noopener" >fish-conda-virtualenv</a></li> <li><a class="link" href="https://gist.github.com/jiffyclub/9679788" target="_blank" rel="noopener" >condactivate</a></li> </ul> <p>I’m going to wait for the Conda update to be released.</p> </article> <article> <h1>Concatenate files with header row</h1> <p>Mon, 02 May 2016 00:00:00 +0000</p> <p>I needed to concatenate a bunch of CSV files while skipping the header row. There was a nice solution on <a class="link" href="http://stackoverflow.com/questions/10103619/unix-merge-many-files-while-deleting-first-line-of-all-files" target="_blank" rel="noopener" >Stack Overflow</a>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">find . -name "*.csv" | xargs -n 1 tail -n +2 </span></span></code></pre></td></tr></table> </div> </div><p>With the GNU version of <code>tail</code> (sadly not the one installed on OS X by default), you can just use</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">tail -q -n +2 *.csv </span></span></code></pre></td></tr></table> </div> </div><p>or</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">awk 'FNR != 1' *.csv </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Searching the tweets of people you follow</h1> <p>Mon, 02 May 2016 00:00:00 +0000</p> <p>This is more of a Today I Taught.</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Turns out, it is actually possible to search my timeline. Thanks <a href="https://twitter.com/tdhopper">@tdhopper</a> for the tip. <a href="https://t.co/7IsD76qszD">pic.twitter.com/7IsD76qszD</a></p>— Ole Zorn (@olemoritz) <a href="https://twitter.com/olemoritz/status/727148272675115008">May 2, 2016</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>I have a quick search in <a class="link" href="https://www.alfredapp.com/" target="_blank" rel="noopener" >Alfred</a> to search the tweets of people I follow. I launch Alfred with ⌘-Space and type <code>following {query}</code>. Restricting a Twitter search to people you follow only requires adding the <code>s=follows</code> query string to a search url:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">https://twitter.com/search?s=follows&q={query} </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Testing whether a Python string contains an integer</h1> <p>Fri, 29 Apr 2016 15:13:00 +0000</p> <p>If you want to check whether a Python string is an integer, you can try casting to an int with <code>int()</code> and catching the <code>ValueError</code> if it’s not an integer:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">is_integer</span><span class="p">(</span><span class="n">value</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span> <span class="n">base</span><span class="p">:</span> <span class="nb">int</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bool</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">try</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="nb">int</span><span class="p">(</span><span class="n">value</span><span class="p">,</span> <span class="n">base</span><span class="o">=</span><span class="n">base</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">True</span> </span></span><span class="line"><span class="cl"> <span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">False</span> </span></span></code></pre></td></tr></table> </div> </div><p>To check for <em>nonnegative</em> integers, you can use the <a class="link" href="https://docs.python.org/3.8/library/stdtypes.html#str.isdigit" target="_blank" rel="noopener" ><code>str.is_digit()</code> method</a>. It will “return true if all characters in the string are digits and there is at least one character, false otherwise:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="o">>>></span> <span class="s2">"123"</span><span class="o">.</span><span class="n">isdigit</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="kc">True</span> </span></span><span class="line"><span class="cl"><span class="o">>>></span> <span class="s2">"-123"</span><span class="o">.</span><span class="n">isdigit</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="kc">False</span> </span></span></code></pre></td></tr></table> </div> </div><p>Thanks to <a class="link" href="https://twitter.com/trochee/status/726162607460114433" target="_blank" rel="noopener" >Jeremy Kahn for reminding me</a> that <code>isdigit</code> only detects positive integers.</p> </article> <article> <h1>Ultralight Backpacking for the Ultratall</h1> <p>Fri, 29 Apr 2016 00:00:00 +0000</p> <p>I created a single page website to collect notes on one of my other hobbies: <a class="link" href="https://backpacking.tdhopper.com/" target="_blank" rel="noopener" >ultralight backpacking</a>. In particular, notes on ultralight gear for the very tall.</p> </article> <article> <h1>Don't Buffer Python's stdout</h1> <p>Tue, 26 Apr 2016 00:00:00 +0000</p> <p>I was using <code>[tee](http://man7.org/linux/man-pages/man1/tee.1.html)</code> with a long running Python process, but I wasn’t seeing any output. This is a result of Python buffering the stdout stream. You can run force Python to run in <a class="link" href="https://docs.python.org/2/using/cmdline.html#cmdoption-u" target="_blank" rel="noopener" >unbuffered mode</a> using the <code>-u</code> flag at the command line.</p> <blockquote> <p>Force stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode.</p> </blockquote> </article> <article> <h1>Econtalk</h1> <p>Wed, 13 Apr 2016 01:42:00 +0000</p> <p>Listening to Russ Roberts’ <a class="link" href="http://www.econtalk.org/" target="_blank" rel="noopener" >Econtalk</a> podcast for the last 5 years has given me a whole new perspective on the world. Roberts has exposed me to a whole new way of economic thinking, refined my scientific skepticism, and introduced me to copious topics and scholars.</p> <p>Here are episodes from over the years that I’ve particularly enjoyed.</p> <ul> <li><a class="link" href="http://www.econtalk.org/archives/2006/06/the_economics_o_4.html" target="_blank" rel="noopener" >The Economics of Organ Donations</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2006/07/hanushek_on_edu.html" target="_blank" rel="noopener" >Making Schools Better: A Conversation with Rick Hanushek</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2007/04/boudreaux_on_th.html" target="_blank" rel="noopener" >Boudreaux on the Economics of “Buy Local”</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2011/01/deer_on_autism.html" target="_blank" rel="noopener" >Deer on Autism, Vaccination, and Scientific Fraud</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2011/08/odonohoe_on_pot.html" target="_blank" rel="noopener" >O’Donohoe on Potato Chips and Salty Snacks</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2012/02/david_owen_on_t.html" target="_blank" rel="noopener" >David Owen on the Environment, Unintended Consequences, and The Conundrum</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2012/04/cowen_on_food.html" target="_blank" rel="noopener" >Cowen on Food</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2012/12/lisa_turner_on.html" target="_blank" rel="noopener" >Lisa Turner on Organic Farming</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2013/07/narlikar_on_fai.html" target="_blank" rel="noopener" >Narlikar on Fair Trade and Free Trade</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2014/03/velasquez-manof.html" target="_blank" rel="noopener" >Velasquez-Manoff on Autoimmune Disease, Parasites, and Complexity</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/03/david_skarbek_o.html" target="_blank" rel="noopener" >David Skarbek on Prison Gangs and the Social Order of the Underworld</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2012/06/jonah_lehrer_on.html" target="_blank" rel="noopener" >Jonah Lehrer on Creativity and Imagine</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2012/07/scott_atlas_on.html" target="_blank" rel="noopener" >Scott Atlas on American Health Care</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2014/08/terry_anderson.html" target="_blank" rel="noopener" >Terry Anderson on the Environment and Property Rights</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/04/leonard_wong_on.html" target="_blank" rel="noopener" >Leonard Wong on Honesty and Ethics in the Military</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/05/eric_topol_on_t.html" target="_blank" rel="noopener" >Eric Topol on the Power of Patients in a Digital World</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/06/adam_davidson_o_1.html" target="_blank" rel="noopener" >Adam Davidson on Hollywood and the Future of Work</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/06/matt_ridley_on.html" target="_blank" rel="noopener" >Matt Ridley on Climate Change</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/07/lee_ohanian_arn.html" target="_blank" rel="noopener" >Lee Ohanian, Arnold Kling, and John Cochrane on the Future of Freedom, Democracy, and Prosperity</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/08/rachel_laudan_o.html" target="_blank" rel="noopener" >Rachel Laudan on the History of Food and Cuisine</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2015/09/william_macaski.html" target="_blank" rel="noopener" >William MacAskill on Effective Altruism and Doing Good Better</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2016/03/jayson_lusk_on.html" target="_blank" rel="noopener" >Jayson Lusk on Food, Technology, and Unnaturally Delicious</a></li> </ul> <p>One of my favorite guests is Duke economist Mike Munger. Here are some great interviews with him:</p> <ul> <li><a class="link" href="http://www.econtalk.org/archives/2007/01/munger_on_price_1.html" target="_blank" rel="noopener" >Price Gouging</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2013/09/munger_on_milk.html" target="_blank" rel="noopener" >On Milk</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2007/04/mike_munger_on.html" target="_blank" rel="noopener" >Division of Labor</a></li> <li><a class="link" href="http://www.econtalk.org/archives/2008/10/munger_on_middl.html" target="_blank" rel="noopener" >Middlemen</a></li> </ul> <p>Incidentally, Priceonomics recently published a <a class="link" href="http://priceonomics.com/russ-roberts-and-the-quest-to-make-economics/" target="_blank" rel="noopener" >great article</a> on Roberts and Econtalk.</p> </article> <article> <h1>Column binding two Panda's Dataframes</h1> <p>Mon, 11 Apr 2016 20:56:00 +0000</p> <p>Joining two Pandas DataFrames with an equal number of rows is slightly harder than it appears. In R, you just use the <code>cbind</code> function.</p> <p>As <a class="link" href="http://stackoverflow.com/questions/33088010/pandas-column-bind-cbind-two-data-frames" target="_blank" rel="noopener" >this</a> StackOverflow question shows, in Pandas it’s easy to end up with something like this:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">unique_id lacet_number latitude longitude </span></span><span class="line"><span class="cl">0 NaN NaN -93.193560 31.217029 </span></span><span class="line"><span class="cl">1 NaN NaN -93.948082 35.360874 </span></span><span class="line"><span class="cl">2 NaN NaN -103.131508 37.787609 </span></span><span class="line"><span class="cl">15 5570613 TLA-0138365 NaN NaN </span></span><span class="line"><span class="cl">24 5025490 EMP-0138757 NaN NaN </span></span><span class="line"><span class="cl">36 4354431 DXN-0025343 NaN NaN </span></span></code></pre></td></tr></table> </div> </div><p>This results from the indices not being identical. Frustratingly (to me) the <code>ignore_index</code> argument doesn’t give the 3-rowed DataFrame I’d hope it gives.</p> <p>As the <a class="link" href="http://stackoverflow.com/a/33088410/982745" target="_blank" rel="noopener" >accepted answer</a> on that question shows, the thing to do is reset the indices on the DataFrames before concatenating:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_a</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">),</span> <span class="n">df_b</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Printing Pandas Data Frames as HTML in Jupyter Notebooks</h1> <p>Wed, 23 Mar 2016 14:18:00 +0000</p> <p>Jupyter notebooks nicely render Pandas data frames if they’re the last line in a cell. It renders the HTML version of the data frame returned by <code>pandas.DataFrame.to_html()</code>. However, if you call <code>print(df)</code> in a cell, the data frame is rendered in less readable text-based output.</p> <p>Despite using Notebooks regularly for years, I’d never bothered to figure out a way around this. However, the solution is easy.</p> <p>Instead of <code>print(df)</code> you use</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">display</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">display</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Impala Doesn't Have a histogram() function</h1> <p>Wed, 23 Mar 2016 00:00:00 +0000</p> <p><a class="link" href="https://groups.google.com/a/cloudera.org/d/msg/impala-user/O8u5JFfKJPk/V0ZR1muNBAAJ" target="_blank" rel="noopener" >Impala Doesn’t Have a histogram() function</a>.</p> <p>Ibis’s histogram works by normalizing a column and rounding to integers:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span><span class="lnt">8 </span><span class="lnt">9 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">WITH</span><span class="w"> </span><span class="n">t0</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">db</span><span class="p">.</span><span class="o">`</span><span class="k">table_name</span><span class="o">`</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="n">floor</span><span class="p">((</span><span class="n">t0</span><span class="p">.</span><span class="o">`</span><span class="k">column_name</span><span class="o">`</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">t1</span><span class="p">.</span><span class="o">`</span><span class="n">min_1fe5be</span><span class="o">`</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="n">e</span><span class="o">-</span><span class="mi">13</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">((</span><span class="n">t1</span><span class="p">.</span><span class="o">`</span><span class="n">max_1fe5be</span><span class="o">`</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">t1</span><span class="p">.</span><span class="o">`</span><span class="n">min_1fe5be</span><span class="o">`</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="n">e</span><span class="o">-</span><span class="mi">13</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">NUM_BINS</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">)))</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="o">`</span><span class="n">tmp</span><span class="o">`</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">t0</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">CROSS</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="k">min</span><span class="p">(</span><span class="o">`</span><span class="k">column_name</span><span class="o">`</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="o">`</span><span class="n">min_1fe5be</span><span class="o">`</span><span class="p">,</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">max</span><span class="p">(</span><span class="o">`</span><span class="k">column_name</span><span class="o">`</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="o">`</span><span class="n">max_1fe5be</span><span class="o">`</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t0</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="n">t1</span><span class="w"> </span></span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Creating Impala Tables from Pandas Dataframes</h1> <p>Tue, 15 Mar 2016 00:00:00 +0000</p> <p><a class="link" href="http://wesmckinney.com/" title="Wes McKinney" target="_blank" rel="noopener" >Wes Mckinney</a>’s <a class="link" href="http://docs.ibis-project.org/index.html" target="_blank" rel="noopener" >Ibis</a>, a Pythonic interface to Impala, has functionality for creating <a class="link" href="https://www.cloudera.com/products/apache-hadoop/impala.html" title="Apache Impala" target="_blank" rel="noopener" >Impala</a> tables from <a class="link" href="http://pandas.pydata.org/" title="Python Data Analysis Library — pandas: Python Data Analysis Library" target="_blank" rel="noopener" >Python Pandas</a> dataframes.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span><span class="lnt">8 </span><span class="lnt">9 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">ibis</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">hdfs</span> <span class="o">=</span> <span class="n">ibis</span><span class="o">.</span><span class="n">hdfs_connect</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">webhdfs_host</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="n">webhdfs_port</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">ibis</span><span class="o">.</span><span class="n">impala</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">impala_host</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="n">impala_port</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">hdfs_client</span><span class="o">=</span><span class="n">hdfs</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">db</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">database</span><span class="p">(</span><span class="s1">'ibis_testing'</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'foo'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> <span class="s1">'bar'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">,</span> <span class="s1">'d'</span><span class="p">]})</span> </span></span><span class="line"><span class="cl"><span class="n">db</span><span class="o">.</span><span class="n">create_table</span><span class="p">(</span><span class="s1">'pandas_table'</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p><a class="link" href="http://docs.ibis-project.org/impala.html#creating-tables" target="_blank" rel="noopener" >This functionality</a>, added in Ibis 0.6.0, is <em>much</em> easier that manually move data to HDFS and loading into Impala.</p> </article> <article> <h1>Galileo's Conflict was Science vs. Science</h1> <p>Fri, 11 Mar 2016 00:00:00 +0000</p> <p>Galileo’s conflict with the Church was not as it is often portrayed:</p> <blockquote> <p>In January of 1616, the month before before the Roman Inquisition would infamously condemn the Copernican theory as being “foolish and absurd in philosophy”, Monsignor Francesco Ingoli addressed Galileo Galilei with an essay entitled “Disputation concerning the location and rest of Earth against the system of Copernicus”. … The essay, upon which the Inquisition condemnation was likely based, lists mathematical, physical, and theological arguments against the Copernican theory. Ingoli asks Galileo to respond to those mathematical and physical arguments that are “more weighty”, and does not ask him to respond to the theological arguments at all. … Ingoli’s emphasis on the scientific arguments of Brahe, and his lack of emphasis on theological arguments, raises the question of whether the condemnation of the Copernican theory was, in contrast to how it is usually viewed, essentially scientific in nature, following the ideas of Brahe.</p> </blockquote> <p>From <a class="link" href="http://arxiv.org/abs/1211.4244" target="_blank" rel="noopener" >Francesco Ingoli’s essay to Galileo: Tycho Brahe and science in the Inquisition’s condemnation of the Copernican theory</a>.</p> </article> <article> <h1>Using rbenv for Ruby Versions</h1> <p>Fri, 11 Mar 2016 00:00:00 +0000</p> <p>I don’t use Ruby much, but I wanted to use Jekyll for this blog. I kept getting this error when running <code>gem install jekyll</code>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Fetching: colorator-0.1.gem (100%) </span></span><span class="line"><span class="cl">ERROR: While executing gem ... (Gem::FilePermissionError) </span></span><span class="line"><span class="cl"> You don't have write permissions for the /Library/Ruby/Gems/2.0.0 directory. </span></span></code></pre></td></tr></table> </div> </div><p>I installed rbenv with <code>brew install rbenv</code> and added</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="o">[[</span> -s <span class="nv">$HOME</span>/.rbenv/bin <span class="o">]]</span> <span class="o">&&</span> <span class="nb">export</span> <span class="nv">PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.rbenv/bin:</span><span class="nv">$PATH</span><span class="s2">"</span> <span class="o">&&</span> <span class="nb">eval</span> <span class="s2">"</span><span class="k">$(</span>rbenv init -<span class="k">)</span><span class="s2">"</span> </span></span></code></pre></td></tr></table> </div> </div><p>to my <code>.bashrc</code> file. Then I changed the default ruby interpreter from the system interpreter to one managed by rbenv with <code>rbenv install 2.2.3 && rbenv global 2.2.3</code>.</p> </article> <article> <h1>Stanford Encyclopedia of Philosophy Bot</h1> <p>Sun, 28 Feb 2016 00:00:00 +0000</p> <p>I enjoy the <a class="link" href="" >Stanford Encyclopedia of Philosophy</a>, so I created a <a class="link" href="" >Twitter account</a> that tweets links to random articles from it.</p> </article> <article> <h1>Mentions of John Cook on Github</h1> <p>Thu, 14 Jan 2016 02:59:00 +0000</p> <p>People mention <a class="link" href="http://www.johndcook.com" target="_blank" rel="noopener" >John Cook’s blog</a> a lot <a class="link" href="https://github.com/search?q=johndcook.com&type=Code&utf8=%E2%9C%93" target="_blank" rel="noopener" >in Github repos</a>.</p> <p>I scraped the Github search pages to try to figure out which of his pages are most mentioned. His post <a class="link" href="http://www.johndcook.com/blog/standard_deviation" target="_blank" rel="noopener" >Accurately computing running variance</a> gets many more mentions than any other post. It provides C++ code for Knuth’s algorithm for computing the mean, sample variance, and standard deviation for a stream of data.</p> <p>Here are top 12 pages from his site most linked in Github:</p> <ol> <li><a class="link" href="http://www.johndcook.com/blog/standard_deviation" target="_blank" rel="noopener" >Accurately computing running variance</a> (377 mentions)</li> <li><a class="link" href="http://www.johndcook.com/stand_alone_code.html" target="_blank" rel="noopener" >Stand-alone code for numerical computing</a> (58 mentions)</li> <li><a class="link" href="http://www.johndcook.com/blog/2011/09/27/bayesian-amazon" target="_blank" rel="noopener" >A Bayesian view of Amazon Resellers</a> (52 mentions)</li> <li><a class="link" href="http://www.johndcook.com/blog" target="_blank" rel="noopener" >johndcook.com/blog</a> (47 mentions)</li> <li><a class="link" href="http://www.johndcook.com/python_longitude_latitude.html" target="_blank" rel="noopener" >Computing the distance between two locations on Earth from coordinates</a> (44 mentions)</li> <li><a class="link" href="http://www.johndcook.com/math_h.html" target="_blank" rel="noopener" >Math.h in POSIX, ISO, and Visual Studio</a> (38 mentions)</li> <li><a class="link" href="http://www.johndcook.com/blog/2009/08/24/algorithms-convert-color-grayscale" target="_blank" rel="noopener" >Three algorithms for converting color to grayscale</a> (26 mentions)</li> <li><a class="link" href="http://www.johndcook.com" target="_blank" rel="noopener" >johndcook.com</a> (21 mentions)</li> <li><a class="link" href="http://www.johndcook.com/blog/skewness_kurtosis" target="_blank" rel="noopener" >Computing skewness and kurtosis in one pass</a> (20 mentions)</li> <li><a class="link" href="http://www.johndcook.com/blog/2010/06/02/whats-so-hard-about-finding-a-hypotenuse" target="_blank" rel="noopener" >What’s so hard about finding a hypotenuse?</a> (19 mentions)</li> <li><a class="link" href="http://www.johndcook.com/simplerng.cpp" target="_blank" rel="noopener" >Random number generation in C++</a> (19 mentions)</li> <li><a class="link" href="http://www.johndcook.com/r_language_for_programmers.html" target="_blank" rel="noopener" >R language for programmers</a> (19 mentions)</li> </ol> </article> <article> <h1>Quotes from Former Professors</h1> <p>Tue, 12 Jan 2016 00:00:00 +0000</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">I'll never forget the great words of one of my best math professors: "If all I wanted was the answer, I sure as hell wouldn't ask you."</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/410071644896382977">December 9, 2013</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">Another great former professor quote: "I'm not smarter than you. I can just recover from mistakes faster than you."</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/410084681443323904">December 9, 2013</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>I Love Twitter</h1> <p>Mon, 07 Dec 2015 00:00:00 +0000</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tdhopper">@tdhopper</a> my father wrote his PhD on this question, and I believe his answer was "depends"</p>— Fred Benenson (@fredbenenson) <a href="https://twitter.com/fredbenenson/status/673926014406434816">December 7, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>My Python Environment Workflow with Conda</h1> <p>Tue, 24 Nov 2015 14:41:00 +0000</p> <p><strong>If you enjoy this post, check out my <a class="link" href="https://pydevtools.com" target="_blank" rel="noopener" >Python Developer Tooling Handbook</a>!</strong></p> <p>Many new Python programmers rely on their system install of Python to run their scripts. There are several good reasons to stop using the system Python. First, it’s probably an old version of Python. Secondly, if you install 3rd party packages with <a class="link" href="https://pypi.python.org/pypi/pip" target="_blank" rel="noopener" >pip</a>, every package is installed into the same globally accessible directory. While this may sound convenient, it causes problems if you (1) install different packages with the same name (2) need to use different versions of the same package (3) upgrade your operating system (OS X will delete all the packages you have installed).</p> <p>For many years, best practice for Python developers was to use virtualenv to create a sandbox-ed environment for each project. If you use virtualenv, each project you work on can have its own version of Python with its own 3rd party packages (hopefully specified in an <code>requirements.txt</code> file). In my experience, getting started with virtualenv is cumbersome and confusing; to this day, I have to look up the command to create a Python 3 virtualenv. virtualenv also provides no helping in actually managing Python versions. You have to install each version yourself and then tell virtualenv to use it.</p> <p>In 2015, I have almost exclusively used Python installations provided through <a class="link" href="https://www.continuum.io/" title="Continuum" target="_blank" rel="noopener" >Continuum Analytics’s</a> <a class="link" href="http://conda.pydata.org/docs/intro.html" title="Intro to conda — Conda documentation" target="_blank" rel="noopener" >Conda/Anaconda</a> platform. I have also switched from using virtualenvs to using <a class="link" href="http://conda.pydata.org/docs/using/envs.html" title="Managing environments — Conda documentation" target="_blank" rel="noopener" >conda environments</a>, and I am loving it.</p> <p>Before explaining my workflow, here’s a quick glossary of the similarly-named products that Continuum offers.</p> <ol> <li><a class="link" href="http://conda.pydata.org/docs/intro.html" target="_blank" rel="noopener" >conda</a>: “Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was <em>created for Python programs but can package and distribute any software.</em> " A conda install provides a whole suite of command line tools for installing and managing packages and environments. Because conda works for any software, it can even install different versions of Python (unlike pip).</li> <li><a class="link" href="https://www.continuum.io/downloads" target="_blank" rel="noopener" >Anaconda</a>: “Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 300 of the most popular Python packages for science, math, engineering, and data analysis.” It is available across platforms and installable through a binary.</li> <li><a class="link" href="https://anaconda.org" target="_blank" rel="noopener" >Anaconda Cloud</a>: Also known as Anaconda.org and formerly known as Binstar, “Anaconda Cloud is a package management service where you can host software packages of all kinds.” Anaconda Cloud is a package repository analogous to PyPI. Packages are installed via the conda command line tool instead of Pip. By default, the <code>conda install</code> command installs packages from a curated collection of packages (a superset of those in Anaconda). Continuum allows users to host their own packages on Anaconda Cloud; these packages can also be installed through <code>conda install</code> using the <code>-n</code> flag with the username.</li> </ol> <p>Conda, Anaconda, and Anaconda cloud are distinct but interrelated tools; keeping them straight can be hard, but is helpful.</p> <p>Conda (the package manager) can be installed in two ways. Through the <a class="link" href="http://conda.pydata.org/miniconda.html" target="_blank" rel="noopener" >Miniconda</a> installer or the <a class="link" href="https://www.continuum.io/downloads" target="_blank" rel="noopener" >Anaconda</a> installer. Both install the package manager, but the latter also installs the 300+ packages for scientific Python. (Installing Anaconda is equivalent to installing Miniconda and then running <code>conda install anaconda</code>.)</p> <h2 id="conda-environment-files">Conda Environment Files </h2><p>It has become standard for pip users to create a <code>requirements.txt</code> file for specifying dependencies for a particular project. Often, a developer working a project will (1) create and activate a virtual environment (2) run <code>pip install -r requirements.txt</code> to build an isolated development environment with the needed packages.</p> <p>Conda provides an analogous (but more powerful) file: <code>environment.yml</code>.</p> <p>A simple <code>environment.yml</code> file might look like this:</p> <pre><code>name: numpy-env dependencies: - python=3 - numpy </code></pre> <p>If you are in a directory containing this file, you can run <code>$ conda env create</code> to create a Conda environment named <code>numpy-env</code> that runs Python 3 and has <a class="link" href="http://www.numpy.org/" title="NumPy — Numpy" target="_blank" rel="noopener" >numpy</a> installed[^numpy]. Run <code>$ source activate numpy-env</code> to activate this environment. Once activated, running <code>$ python</code> will run Python 3 from your environment instead of the globally installed Python for your system. Moreover, you will be able to <code>import numpy</code> but not any of the 3rd party packages installed globally.</p> <p><code>environment.yml</code> can also install packages via pip with this syntax:</p> <pre><code>name: pip-env dependencies: - python - pip - pip: - pypi-package-name </code></pre> <p>I see <code>environment.yml</code> files as a positive development from <code>requirements.txt</code> files for several reasons. Foremost, they allow you to specify the version of Python you want to use. At Pydata NYC 2015, many presenters provided their code in Github repositories without specifying anywhere whether they were using Python 2 or 3. Because I <a class="link" href="https://github.com/tdhopper/pydata-nyc-2015/blob/55b9d2892b18e1d191325fc1890740901723dcfd/environment.yml" target="_blank" rel="noopener" >included a YAML file</a>, attendees could see exactly what version I was using and quickly install it with <code>conda env create</code>. I also like being able to specify the name of the environment in the file; this is particularly helpful when working with others. Finally, because conda can install from PyPI via pip, <code>environment.yml</code> files provide no less functionality than a <code>requirements.txt</code> file provides.</p> <h2 id="my-python-environment-workflow">My Python Environment Workflow </h2><p>Lately, whenever I am working on a new project (however big or small), I follow the following steps:</p> <ol> <li>Create a project folder in the <code>~/repos/</code> directory on my computer.</li> <li>Create an <code>environment.yml</code> file in the directory. Typically the environment name will be the same as the folder name. At minimum, it will specify the version of Python I want to use; it will often include <code>anaconda</code> as a dependency.</li> <li>Create the conda environment with <code>$ conda env create</code>.</li> <li>Activate the conda environment with <code>$ source activate ENV_NAME</code>.</li> <li>Create a <code>.env</code> file containing the line <code>source activate ENV_NAME</code>. Because I have <a class="link" href="https://github.com/kennethreitz/autoenv" target="_blank" rel="noopener" >autoenv</a> installed, this file will be run every time I navigate to the project folder in the Terminal. Therefore, my conda environment will be activated as soon as I navigate to the folder.</li> <li>Run <code>$ git init</code> to make the folder a Git repository. I then run <code>$ git add environment.yml && git commit -m 'initial commit'</code> to add the YAML file to the repository.</li> <li>If I want to push the repository to Github, I use <code>$ git create</code> using Github’s <a class="link" href="https://github.com/github/hub" target="_blank" rel="noopener" >hub</a> commands. I then push the master branch with <code>$ git push -u origin master</code>.</li> </ol> <p>As I add dependencies to my project, I try to be sure I add them to my <code>environment.yml</code> file.</p> <p>A major benefit of all this is how easily reproducible a development environment becomes. If a colleague or conference attendee wants to run my code, they can setup the dependencies ( <em>including</em> Python version) by (1) cloning the repository, (2) running <code>$ conda env create</code>, (3) running <code>$ source activate ENV_NAME</code>. It’s easy enough for me to drop those instructions and further instructions for running the code in a README file. If I’m feeling especially helpful, I’ll create a <a class="link" href="http://mrbook.org/blog/tutorials/make/" title="Makefiles – Mrbook's Stuff" target="_blank" rel="noopener" >Makefile</a> or <a class="link" href="http://www.fabfile.org/" title="Welcome to Fabric! — Fabric documentation" target="_blank" rel="noopener" >Fabfile</a> to encapsulate commands for core functionality of the code.</p> <p>An even larger benefit is that I can return to a project after, days, months, or years and quickly start developing without first having to hunt for <code>print</code> statements to figure out whether I was using Python 2 or 3.</p> <p>I’ve come to love <code>environment.yml</code> files, and I think you might too.</p> </article> <article> <h1>Sequential Minimal Optimization Algorithm for Support Vector Machines</h1> <p>Sat, 21 Nov 2015 00:00:00 +0000</p> <p>In my nonlinear optimization class in grad school at North Carolina State University, I wrote a paper on the famed SMO algorithm for support vector machines. In particular, I derive the Lagrangian dual of the classic formulation of the SVM optimization model and show how it can be solved using the stochastic gradient descent algorithm.</p> <p>You can <a class="link" href="https://github.com/tdhopper/smo-svm/blob/master/OR706%20Support%20Vector%20Machines.pdf" target="_blank" rel="noopener" >find the paper here</a>.</p> </article> <article> <h1>Knitting</h1> <p>Fri, 20 Nov 2015 00:00:00 +0000</p> <p>True story: I’m a closet knitter. I don’t have much time for it these days, but it helped keep me sane in grad school. Here are some things I’ve made over the years.</p> <p></p> <p></p> <p></p> <p></p> <p></p> </article> <article> <h1>Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use</h1> <p>Fri, 06 Nov 2015 00:00:00 +0000</p> <p>When I worked at <a class="link" href="http://www.rti.org/" title="RTI International" target="_blank" rel="noopener" >RTI International</a>, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the <a class="link" href="http://www.jmir.org/" title="JMIR-Journal of Medical Internet Research" target="_blank" rel="noopener" >Journal of Internet Medical Research</a>: <a class="link" href="http://www.jmir.org/2015/11/e251/" target="_blank" rel="noopener" >Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study</a>.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <blockquote> <p>Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,508/1,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184/471) followed by home/room/bed (12.5%, 59/471), school (12.1%, 57/471), in public (8.7%, 41/471), the bathroom (5.7%, 27/471), and at work (4.5%, 21/471).</p> </blockquote> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I have no idea what “Infoveillance” means. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Nonparametric Latent Dirichlet Allocation</h1> <p>Fri, 16 Oct 2015 00:00:00 +0000</p> <p>Today is my last day at <a class="link" href="http://qadium.com" target="_blank" rel="noopener" >Qadium</a>. Next week, I am joining the data science team at <a class="link" href="http://distilnetworks.com" target="_blank" rel="noopener" >Distil Networks</a>.</p> <p>I’ve been privileged to work with <a class="link" href="http://ericjonas.com/" target="_blank" rel="noopener" >Eric Jonas</a> on the <a class="link" href="http://datamicroscopes.github.io" target="_blank" rel="noopener" >data microscopes</a> project for the past 8 months. In particular, I contributed the implementation of <a class="link" href="https://github.com/datamicroscopes/lda" target="_blank" rel="noopener" >Nonparametric Latent Dirichlet Allocation</a>.</p> <p>I published a collection of <a class="link" href="https://tdhopper.com/blog/notes-on-dirichlet-processes/" >notes on nonparametric Bayesian methods and Latent Dirichlet Allocation</a>. I hope this will be useful to other students and researchers of these methods.</p> </article> <article> <h1>Tweets I'm Proud Of (2)</h1> <p>Fri, 16 Oct 2015 00:00:00 +0000</p> <p>On MCMC:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">Any sufficiently advanced MCMC is indistinguishable from magic. Very slow magic.</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/654023792667291648">October 13, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>On nonparametric Bayesian methods:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">Nonparametric methods are an advanced technique for replacing intuitive parameters with unintuitive hyperparameters & way more computation.</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/649679350510424064">October 1, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>On code:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">The cost of a line of code is larger than the time it takes to write it.</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/654386934672924673">October 14, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>On key pressers:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">"We call people like these two young programmers 'key pressers.'" (from SEAL Team Six) <a href="http://t.co/VFxZbafLlG">pic.twitter.com/VFxZbafLlG</a></p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/653902530406187008">October 13, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>Sample from Antoniak Distribution with Python</h1> <p>Thu, 15 Oct 2015 00:00:00 +0000</p> <p><code>rand_antoniak</code> draws a sample from the distribution of tables created by a Chinese restaurant process with parameter <code>alpha</code> after <code>n</code> patrons are seated. Some notes on this distribution are <a class="link" href="http://www.cs.cmu.edu/~tss/antoniak.pdf" target="_blank" rel="noopener" >available here</a>.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span><span class="lnt">36 </span><span class="lnt">37 </span><span class="lnt">38 </span><span class="lnt">39 </span><span class="lnt">40 </span><span class="lnt">41 </span><span class="lnt">42 </span><span class="lnt">43 </span><span class="lnt">44 </span><span class="lnt">45 </span><span class="lnt">46 </span><span class="lnt">47 </span><span class="lnt">48 </span><span class="lnt">49 </span><span class="lnt">50 </span><span class="lnt">51 </span><span class="lnt">52 </span><span class="lnt">53 </span><span class="lnt">54 </span><span class="lnt">55 </span><span class="lnt">56 </span><span class="lnt">57 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">choice</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">stirling</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">N</span> <span class="o"><</span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">m</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s2">"Bad input to stirling."</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">m</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">N</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">0</span> </span></span><span class="line"><span class="cl"> <span class="k">elif</span> <span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"> <span class="k">elif</span> <span class="n">N</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">m</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">m</span> </span></span><span class="line"><span class="cl"> <span class="k">elif</span> <span class="n">m</span> <span class="o">></span> <span class="n">N</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">0</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">stirling</span><span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">m</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">stirling</span><span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">118124</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">118124</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">9</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">stirling</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span> <span class="o">==</span> <span class="mi">4536</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">normalized_stirling_numbers</span><span class="p">(</span><span class="n">nn</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="c1"># * stirling(nn) Gives unsigned Stirling numbers of the first</span> </span></span><span class="line"><span class="cl"> <span class="c1"># * kind s(nn,*) in ss. ss[i] = s(nn,i). ss is normalized so that maximum</span> </span></span><span class="line"><span class="cl"> <span class="c1"># * value is 1. After Teh (npbayes).</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="p">[</span><span class="n">stirling</span><span class="p">(</span><span class="n">nn</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">nn</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span> </span></span><span class="line"><span class="cl"> <span class="n">max_val</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">ss</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ss</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="o">/</span> <span class="n">max_val</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">ss1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">])</span> </span></span><span class="line"><span class="cl"><span class="n">ss2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> </span></span><span class="line"><span class="cl"><span class="n">ss10</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span> <span class="mf">3.09439754e-01</span><span class="p">,</span> <span class="mf">8.75395242e-01</span><span class="p">,</span> <span class="mf">1.00000000e+00</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="mf">6.17105824e-01</span><span class="p">,</span> <span class="mf">2.29662318e-01</span><span class="p">,</span> <span class="mf">5.39549757e-02</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="mf">8.05832694e-03</span><span class="p">,</span> <span class="mf">7.41877718e-04</span><span class="p">,</span> <span class="mf">3.83729854e-05</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="mf">8.52733009e-07</span><span class="p">])</span> <span class="c1"># Verified with Yee Whye Teh's code</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(((</span><span class="n">normalized_stirling_numbers</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">ss1</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span> <span class="o"><</span> <span class="mf">0.00001</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(((</span><span class="n">normalized_stirling_numbers</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="n">ss2</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span> <span class="o"><</span> <span class="mf">0.00001</span> </span></span><span class="line"><span class="cl"><span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(((</span><span class="n">normalized_stirling_numbers</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">-</span> <span class="n">ss10</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span> <span class="o"><</span> <span class="mf">0.00001</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">rand_antoniak</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="c1"># Sample from Antoniak Distribution</span> </span></span><span class="line"><span class="cl"> <span class="c1"># cf http://www.cs.cmu.edu/~tss/antoniak.pdf</span> </span></span><span class="line"><span class="cl"> <span class="n">p</span> <span class="o">=</span> <span class="n">normalized_stirling_numbers</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">aa</span> <span class="o">=</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">p</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">p</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*=</span> <span class="n">aa</span> </span></span><span class="line"><span class="cl"> <span class="n">aa</span> <span class="o">*=</span> <span class="n">alpha</span> </span></span><span class="line"><span class="cl"> <span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">choice</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">p</span><span class="o">=</span><span class="n">p</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">rand_antoniak</span><span class="p">(</span><span class="mf">.5</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Collapsed Gibbs Sampling for Bayesian Mixture Models (with a Nonparametric Extension)</h1> <p>Wed, 14 Oct 2015 00:00:00 +0000</p> <p><a class="link" href="https://tdhopper.com/blog/mixture-model/" >In an earlier notebook</a>, I showed how we can fit the parameters of a Bayesian mixture model using a Gibbs sampler. The sampler defines a Markov chain that, in steady state, samples from the posterior distribution of the mixture model. To move the chain forward by one step we:</p> <ul> <li>Sample the cluster assignment $z_i$.</li> <li>Sample the mixture weights $\pi$</li> <li>Sample the cluster means $\mu_n$.</li> </ul> <p>It turns out that we can derive a Gibbs sampler that <em>just</em> samples the assignments instead of the mixture weights and cluster means. This is known as a <em>collapsed</em> Gibbs sampler. If we integrate out the cluster means $\theta_k$ and mixture weights from the margial distribution of cluster assignment </p> $$p(z_i=k \,|\, z_{\neg i}, \pi, \theta_1, \theta_2, \theta_3, \sigma, \mathbf{x}, \alpha )$$<p> we are left with </p> $$p(z_i\,|\, z_{\neg i}, \sigma, \mathbf{x}, \alpha).$$<p>By the conditional independence, we can factorize this marginal distribution </p> $$ \begin{align} p(z_i=k\,|\, z_{\neg i}, \sigma, \mathbf{x}, \alpha) &\propto p(z_i=k\,|\, x_i, z_{\neg i}, \sigma, \mathbf{x}_{\neg i}, \alpha)\\ &= p(z_i=k\,|\, z_{\neg i}, \sigma, \mathbf{x}_{\neg i}, \alpha) p(x_i \,|\, z, \sigma, \mathbf{x}_{\neg i}, \alpha)\\ &= p(z_i=k \,|\, z_{\neg i}, \alpha) p(x_i \,|\, z, \mathbf{x}_{\neg i}, \sigma)\\ &= p(z_i=k \,|\, z_{\neg i}, \alpha)p(x_i \,|\, z_i=k, z_{\neg i}, x_{\neg_i}, \sigma)\\ &= p(z_i=k \,|\, z_{\neg i}, \alpha)p(x_i \,|\, \left\{x_j \,|\, z_j=k, j\neq i\right\}, \sigma). \end{align} $$<p>The two terms have intuitive explanations. $p(z_i = k \,|\, z_{\neg i}, \alpha)$ is the probability point $x_i$ will be assigned to component $k$ given the current assignments. Because we are using a symmetric Dirichlet prior, this is the predictive likelihood of a Dirichlet-categorical distribution. This is given by: </p> $$p(z_i=k \,|\, z_{\not i}, \alpha)= \frac{N_k^{-i}+\alpha / K}{N-1+\alpha}$$<p> where $N_k^{-i}=\sum_{j\neq i} \delta(z_j, k)$ is the number of observation assigned to $k$ (except $x_i$). We also need to define $\bar{x}_k^{-i}$ to be the mean of all observations assigned to component $k$ (except $x_i$).</p> <p>The second term is the predictive likelihood that point $x_i$ is distributed according to cluster $k$ (given the data currently in cluster $k$). For our example, we are assuming unknown cluster means are distributed according to a normal distribution with hyperparameter mean $\lambda_1$ and variance $\lambda_2^2$ and known cluster variance $\sigma^2$.</p> <p>Thus, </p> $$ \begin{align} p(x_i \,|\, \left\{x_j \,|\, z_j=k, j\neq i\right\}, \sigma) &= \mathcal{N}(x_i \,|\, \mu_k, \sigma_k^2+\sigma^2) \end{align} $$<p> where </p> $$\sigma_k^2 = \left( \frac{N_k^{-i}}{\sigma^2} + \frac{1}{\lambda_2^2} \right)^{-1}$$<p> and </p> $$\mu_k = \sigma_k^2 \left( \frac{\lambda_1}{\lambda_2^2}+\frac{N_k^{-i}\cdot \bar{x}_k^{-i}}{\sigma^2} \right).$$<p> This is derived in Kevin Murphey’s fantastic article <a class="link" href="http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf" target="_blank" rel="noopener" >Conjugate Bayesian analysis of the Gaussian distribution</a>.</p> <p>At each step of the collapsed sampler, we sample each $z_i$ as follows:</p> <ul> <li>For each cluster $k$, compute $$f_k(x_i) =p(x_i \,|\, \left\{x_j \,|\, x_j=k, j\neq i\right\}, \lambda).$$This is the predictive probability that $x_i$ is in cluster $k$ given the data currently assigned to that cluster.</li> <li>Sample $$z_i\sim \frac{1}{Z_i}\sum_{k=1}^K(N_k^{-i}+\alpha/K)f_k(x_i)\delta(z_i,k)$$</li> </ul> <p>where the normalizing constant is $Z_i=\sum_{k=1}^K(N_k^{-i}+\alpha/K)f_k(x_i)$.</p> <p>Let’s write code for this Gibbs sampler!</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">namedtuple</span><span class="p">,</span> <span class="n">Counter</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">random</span> </span></span></code></pre></td></tr></table> </div> </div><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">12345</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>First, load the same dataset we used <a class="link" href="https://tdhopper.com/blog/mixture-model/" >previously</a>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">from_csv</span><span class="p">(</span><span class="s2">"clusters.csv"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">_</span><span class="o">=</span><span class="n">data</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Again, we want to define a state object and a function for updating the sufficient statistics of the state.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">SuffStat</span> <span class="o">=</span> <span class="n">namedtuple</span><span class="p">(</span><span class="s1">'SuffStat'</span><span class="p">,</span> <span class="s1">'theta N'</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">initial_state</span><span class="p">(</span><span class="n">num_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">1.0</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_ids</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_clusters</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">state</span> <span class="o">=</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="s1">'cluster_ids_'</span><span class="p">:</span> <span class="n">cluster_ids</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'data_'</span><span class="p">:</span> <span class="n">data</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'num_clusters_'</span><span class="p">:</span> <span class="n">num_clusters</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'cluster_variance_'</span><span class="p">:</span> <span class="mf">.01</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'alpha_'</span><span class="p">:</span> <span class="n">alpha</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'hyperparameters_'</span><span class="p">:</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="s2">"mean"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s2">"variance"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="s1">'suffstats'</span><span class="p">:</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="kc">None</span> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">cluster_ids</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="s1">'assignment'</span><span class="p">:</span> <span class="p">[</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">cluster_ids</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">data</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'pi'</span><span class="p">:</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="n">alpha</span> <span class="o">/</span> <span class="n">num_clusters</span> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">cluster_ids</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="n">update_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">state</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">update_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">N</span> <span class="ow">in</span> <span class="n">Counter</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">points_in_cluster</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">cid</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">cid</span> <span class="o">==</span> <span class="n">cluster_id</span> </span></span><span class="line"><span class="cl"> <span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">points_in_cluster</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">SuffStat</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Next we define functions to compute the two terms of our marginal distribution over cluster assignments (as we derived above).</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_predictive_likelihood</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Predictive likelihood of the data at data_id is generated </span></span></span><span class="line"><span class="cl"><span class="s2"> by cluster_id given the currenbt state. </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> From Section 2.4 of </span></span></span><span class="line"><span class="cl"><span class="s2"> http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_mean</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'mean'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'variance'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">param_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">x</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">_log_predictive_likelihood</span><span class="p">(</span><span class="n">ss</span><span class="p">,</span> <span class="n">hp_mean</span><span class="p">,</span> <span class="n">hp_var</span><span class="p">,</span> <span class="n">param_var</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">_log_predictive_likelihood</span><span class="p">(</span><span class="n">ss</span><span class="p">,</span> <span class="n">hp_mean</span><span class="p">,</span> <span class="n">hp_var</span><span class="p">,</span> <span class="n">param_var</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">posterior_sigma2</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">*</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">param_var</span> <span class="o">+</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">hp_var</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">predictive_mu</span> <span class="o">=</span> <span class="n">posterior_sigma2</span> <span class="o">*</span> <span class="p">(</span><span class="n">hp_mean</span> <span class="o">*</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">hp_var</span> <span class="o">+</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">*</span> <span class="n">ss</span><span class="o">.</span><span class="n">theta</span> <span class="o">*</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">param_var</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">predictive_sigma2</span> <span class="o">=</span> <span class="n">param_var</span> <span class="o">+</span> <span class="n">posterior_sigma2</span> </span></span><span class="line"><span class="cl"> <span class="n">predictive_sd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">predictive_sigma2</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">predictive_mu</span><span class="p">,</span> <span class="n">predictive_sd</span><span class="p">)</span><span class="o">.</span><span class="n">logpdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_cluster_assign_score</span><span class="p">(</span><span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Log-likelihood that a new point generated will </span></span></span><span class="line"><span class="cl"><span class="s2"> be assigned to cluster_id given the current state. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">current_cluster_size</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span><span class="o">.</span><span class="n">N</span> </span></span><span class="line"><span class="cl"> <span class="n">num_clusters</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'num_clusters_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'alpha_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">current_cluster_size</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">*</span> <span class="mf">1.</span> <span class="o">/</span> <span class="n">num_clusters</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Given these two functions, we can compute the posterior probability distribution for assignment of a given datapoint. This is the core of our collapsed Gibbs sampler.</p> <p>To simplify the computation of things like $N_k^{-i}$ (where we remove point $i$ from the summary statistics), we create two simple functions to add and remove a point from the summary statistics for a given cluster.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">cluster_assignment_distribution</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Compute the marginal distribution of cluster assignment </span></span></span><span class="line"><span class="cl"><span class="s2"> for each cluster. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">log_predictive_likelihood</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">+=</span> <span class="n">log_cluster_assign_score</span><span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">score</span><span class="p">)</span> <span class="k">for</span> <span class="n">cid</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">scores</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()}</span> </span></span><span class="line"><span class="cl"> <span class="n">normalization</span> <span class="o">=</span> <span class="mf">1.0</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">values</span><span class="p">())</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="n">score</span><span class="o">*</span><span class="n">normalization</span> <span class="k">for</span> <span class="n">cid</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">scores</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()}</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">scores</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">add_datapoint_to_suffstats</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ss</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Add datapoint to sufficient stats for normal component </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">SuffStat</span><span class="p">((</span><span class="n">ss</span><span class="o">.</span><span class="n">theta</span><span class="o">*</span><span class="p">(</span><span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="p">)</span><span class="o">+</span><span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">remove_datapoint_from_suffstats</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ss</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Remove datapoint from sufficient stats for normal component </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">SuffStat</span><span class="p">((</span><span class="n">ss</span><span class="o">.</span><span class="n">theta</span><span class="o">*</span><span class="p">(</span><span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="p">)</span><span class="o">-</span><span class="n">x</span><span class="o">*</span><span class="mf">1.0</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Finally, we’re ready to create a function that takes a Gibbs step on the state. For each datapoint, it</p> <ol> <li>Removes the datapoint from its current cluster.</li> <li>Computes the posterior probability of the point being assigned to each cluster (given the other current assignments).</li> <li>Assigns the datapoint to a cluster sampled from this probability distribution.</li> </ol> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">pairs</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">data_id</span><span class="p">,</span> <span class="p">(</span><span class="n">datapoint</span><span class="p">,</span> <span class="n">cid</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">pairs</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">remove_datapoint_from_suffstats</span><span class="p">(</span><span class="n">datapoint</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="n">cluster_assignment_distribution</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">labels</span><span class="p">,</span> <span class="n">scores</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">scores</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">cid</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">labels</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">scores</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">cid</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">add_datapoint_to_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">][</span><span class="n">data_id</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><p>Here’s our old function to plot the assignments.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">gby</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span> </span></span><span class="line"><span class="cl"> <span class="s1">'data'</span><span class="p">:</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'assignment'</span><span class="p">:</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">]}</span> </span></span><span class="line"><span class="cl"> <span class="p">)</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">'assignment'</span><span class="p">)[</span><span class="s1">'data'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hist_data</span> <span class="o">=</span> <span class="p">[</span><span class="n">gby</span><span class="o">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">cid</span><span class="p">)</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">gby</span><span class="o">.</span><span class="n">groups</span><span class="o">.</span><span class="n">keys</span><span class="p">()]</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">hist_data</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">.5</span> <span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Randomly assign the datapoints to a cluster to start.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">state</span> <span class="o">=</span> <span class="n">initial_state</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Look what happens to the assignments after just one Gibbs step!</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Two:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>After just two steps, our assignments look <em>really</em> good. We can run it a few more times and see the assignments again.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20</span><span class="p">):</span> <span class="n">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <h3 id="nonparametric-mixture-models">Nonparametric Mixture Models! </h3><p>It turns out, the collapsed Gibbs sampler for mixture models is almost identical in the context of a <em>nonparametric</em> model. This model uses a <em>Dirichlet process prior</em> instead of a <em>Dirichlet distribution prior</em>. It doesn’t require us to specify how many clusters we are looking for in our data.</p> <p>The cluster assignment score changes slightly. It is proportional to $N_k^{-i}$ for each known cluster. We assign a datapoint to a <em>new</em> cluster with probability proportional to $\alpha$ (which is now the DP dispersion parameter).</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span><span class="lnt">8 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_cluster_assign_score_dp</span><span class="p">(</span><span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Log-likelihood that a new point generated will </span></span></span><span class="line"><span class="cl"><span class="s2"> be assigned to cluster_id given the current state. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">cluster_id</span> <span class="o">==</span> <span class="s2">"new"</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s2">"alpha_"</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span><span class="o">.</span><span class="n">N</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>The predictive likelihood remains the same for known clusters. However, we need to know the likelihood of assigning a datapoint to a new cluster. In this case, we fall back on the hyperparameters to get:</p> $$ \begin{align} p(x_i \,|\, z, x_{\neg_i}, \sigma) &= \mathcal{N}(x_i \,|\, \lambda_1, \lambda_2^2+\sigma^2) \end{align} $$<div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_predictive_likelihood_dp</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Predictive likelihood of the data at data_id is generated </span></span></span><span class="line"><span class="cl"><span class="s2"> by cluster_id given the currenbt state. </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> From Section 2.4 of </span></span></span><span class="line"><span class="cl"><span class="s2"> http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">cluster_id</span> <span class="o">==</span> <span class="s2">"new"</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">SuffStat</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">hp_mean</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'mean'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'variance'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">param_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">x</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">_log_predictive_likelihood</span><span class="p">(</span><span class="n">ss</span><span class="p">,</span> <span class="n">hp_mean</span><span class="p">,</span> <span class="n">hp_var</span><span class="p">,</span> <span class="n">param_var</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Given this, we can define the marginal distribution over cluster assignment. The only change is that the “<code>new</code>” state enters in the distribution.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">cluster_assignment_distribution_dp</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Compute the marginal distribution of cluster assignment </span></span></span><span class="line"><span class="cl"><span class="s2"> for each cluster. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_ids</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span> <span class="o">+</span> <span class="p">[</span><span class="s1">'new'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">cluster_ids</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">log_predictive_likelihood_dp</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span> <span class="o">+=</span> <span class="n">log_cluster_assign_score_dp</span><span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">score</span><span class="p">)</span> <span class="k">for</span> <span class="n">cid</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">scores</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()}</span> </span></span><span class="line"><span class="cl"> <span class="n">normalization</span> <span class="o">=</span> <span class="mf">1.0</span><span class="o">/</span><span class="nb">sum</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">values</span><span class="p">())</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span><span class="n">cid</span><span class="p">:</span> <span class="n">score</span><span class="o">*</span><span class="n">normalization</span> <span class="k">for</span> <span class="n">cid</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">scores</span><span class="o">.</span><span class="n">iteritems</span><span class="p">()}</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">scores</span> </span></span></code></pre></td></tr></table> </div> </div><p>We also need to be able to create a new cluster when “<code>new</code>” is drawn, and destroy a cluster if its emptied.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">create_cluster</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s2">"num_clusters_"</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_id</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span> <span class="o">+</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">SuffStat</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">cluster_id</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">cluster_id</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">destroy_cluster</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s2">"num_clusters_"</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> </span></span><span class="line"><span class="cl"> <span class="k">del</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">cluster_id</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">prune_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]:</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">]</span><span class="o">.</span><span class="n">N</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">destroy_cluster</span><span class="p">(</span><span class="n">state</span><span class="p">,</span> <span class="n">cid</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Finally, we can define the <code>gibbs_step_dp</code> function. It’s nearly identical to the earlier <code>gibbs_step</code> function except</p> <ul> <li>It uses <code>cluster_assignment_distribution_dp</code></li> <li>It creates a new cluster when the sampled assignment is “<code>new</code>”.</li> <li>It destroys a cluster any time it is emptied.</li> </ul> <p>For clarity, I split out the code for sampling assignment to its own function.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sample_assignment</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Sample new assignment from marginal distribution. </span></span></span><span class="line"><span class="cl"><span class="s2"> If cluster is "`new`", create a new cluster. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="n">cluster_assignment_distribution_dp</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">labels</span><span class="p">,</span> <span class="n">scores</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">scores</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">cid</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">labels</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">scores</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">cid</span> <span class="o">==</span> <span class="s2">"new"</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">create_cluster</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">cid</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">gibbs_step_dp</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Collapsed Gibbs sampler for Dirichlet Process Mixture Model </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">pairs</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">data_id</span><span class="p">,</span> <span class="p">(</span><span class="n">datapoint</span><span class="p">,</span> <span class="n">cid</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">pairs</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">remove_datapoint_from_suffstats</span><span class="p">(</span><span class="n">datapoint</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="n">prune_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">cid</span> <span class="o">=</span> <span class="n">sample_assignment</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">cid</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">]</span> <span class="o">=</span> <span class="n">add_datapoint_to_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">][</span><span class="n">data_id</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cid</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><p>This time, we will start by randomly assigning our data to two clusters.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">state</span> <span class="o">=</span> <span class="n">initial_state</span><span class="p">(</span><span class="n">num_clusters</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Here’s what happens when we run our Gibbs sampler once.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">gibbs_step_dp</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>We went from 2 to 4 clusters!</p> <p>After 100 iterations:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">99</span><span class="p">):</span> <span class="n">gibbs_step_dp</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>After 100 iterations, our assignment looks correct! We went back to 3 clusters.</p> <p>We can sample the mixture weights, if we need them, using the “Conditional Distribution of Mixture Weights” derived <a class="link" href="https://tdhopper.com/blog/mixture-model/" >here</a>.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">alpha</span> <span class="o">=</span> <span class="p">[</span><span class="n">ss</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span><span class="o">.</span><span class="n">N</span> <span class="o">+</span> <span class="n">state</span><span class="p">[</span><span class="s1">'alpha_'</span><span class="p">]</span> <span class="o">/</span> <span class="n">state</span><span class="p">[</span><span class="s1">'num_clusters_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]]</span> </span></span><span class="line"><span class="cl"><span class="n">stats</span><span class="o">.</span><span class="n">dirichlet</span><span class="p">(</span><span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>array([ 0.21330625, 0.29838101, 0.48831275]) </code></pre> <p>We can also sample the cluster means using <a class="link" href="https://tdhopper.com/blog/mixture-model/" >the method we derived earlier</a>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">cluster_id</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]:</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_mean</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'mean'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'variance'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">numerator</span> <span class="o">=</span> <span class="n">hp_mean</span> <span class="o">/</span> <span class="n">hp_var</span> <span class="o">+</span> <span class="n">ss</span><span class="o">.</span><span class="n">theta</span> <span class="o">*</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">/</span> <span class="n">cluster_var</span> </span></span><span class="line"><span class="cl"> <span class="n">denominator</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="n">hp_var</span> <span class="o">+</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">/</span> <span class="n">cluster_var</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">posterior_mu</span> <span class="o">=</span> <span class="n">numerator</span> <span class="o">/</span> <span class="n">denominator</span> </span></span><span class="line"><span class="cl"> <span class="n">posterior_var</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">denominator</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">mean</span> <span class="o">=</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">posterior_mu</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">posterior_var</span><span class="p">))</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"cluster_id:"</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="s2">"mean"</span><span class="p">,</span> <span class="n">mean</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>cluster_id: 1 mean -0.0176257860235 cluster_id: 2 mean -0.400581819532 cluster_id: 3 mean 0.600302879661 </code></pre> <p>Much thanks to Erik Sudderth’s excellent introduction to nonparametric Bayes in <a class="link" href="http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf" target="_blank" rel="noopener" >Chapter 2 of his dissertation</a>. Algorithms 2.2 and 2.3 in that piece are the clearest formulation of collapsed Gibbs sampling for mixture models that I have come across.</p> </article> <article> <h1>Nomenclature of Dirichlet Processes</h1> <p>Tue, 13 Oct 2015 00:00:00 +0000</p> <p>Much of the literature on Dirichlet Processes makes assertions similar to the following:</p> <ul> <li>“DP is the Dirichlet process, a distribution over distributions.” (<a class="link" href="http://www.stat.columbia.edu/npbayes/papers/neal_sampling.pdf" target="_blank" rel="noopener" >Neal, 2000</a>)</li> <li>“[The Dirichlet process] is a distribution over distributions, i.e. each draw from a Dirichlet process is itself a distribution.” (<a class="link" href="http://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf" target="_blank" rel="noopener" >Teh, 2010</a>)</li> <li>“The Dirichlet process (DP) is a distribution over distributions.” (<a class="link" href="http://web.mit.edu/sjgershm/www/GershmanBlei12.pdf" target="_blank" rel="noopener" >Gershman and Blei, 2012</a>)</li> <li>The “Dirichlet process defines a distribution on random probability measures…” (<a class="link" href="http://cs.brown.edu/~sudderth/papers/sudderthPhD.pdf" target="_blank" rel="noopener" >Sudderth, 2006</a>)</li> <li>“Dirichlet processes define a distribution over distributions…” (<a class="link" href="http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf" target="_blank" rel="noopener" >Ghahramani, 2005</a>)</li> </ul> <p>Michael Jordan makes an <a class="link" href="http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/dp1.pdf" target="_blank" rel="noopener" >equivalent statement</a>.</p> <p>Each of these sources makes the claim that a Dirichlet Process is a distribution over probability distributions. That is, given a base distribution $G_0$ and a parameter $\alpha$, $DP(\alpha, G_0)$ <em>is</em> a Dirichlet process and <em>is</em> (equivalently) a distribution over probability distributions. Therefore, a sample $G$ drawn from the Dirichlet process $DP(\alpha, G_0)$ is itself a probability distribution. $G$ (where $G\sim DP(\alpha, G_0)$) is a discrete probability distribution whose support is a discrete subset of the support of $G_0$.</p> <p>Confusingly, while many sources refer to the DP as a distribution over distributions, when using the phrase “sample from a Dirichlet process”, they mean a sample from $G$, not from $DP(\alpha, G_0)$. These authors appear to equivocate on the term “Dirichlet process”. It means both a distribution over distributions and a distribution sampled from this distribution over distributions.</p> <p>After being confused by this point for some time, I <a class="link" href="https://tdhopper.com/blog/dirichlet-distribution/" >prepared these notes</a> arguging that the Dirichlet process <em>is</em> a distribution over distributions. I argued that the term <em>sample from a Dirichlet process</em> should refer to a distribution sampled from the DP, not to a point sampled from the support of $G_0$.</p> <p>In response to my notes, <a class="link" href="http://danroy.org/marginalia/Nomenclature_for_stochastic_processes_and_Bayesian_nonparametric_statistics" target="_blank" rel="noopener" >Dan Roy</a> <a class="link" href="http://danroy.org/marginalia/Nomenclature_for_stochastic_processes_and_Bayesian_nonparametric_statistics" target="_blank" rel="noopener" >briefly argued</a> that “The Dirichlet process is a distribution on the space of probability measures” is a misstatement. In fact, Roy argues that $DP(\alpha, G_0)$ is not a Dirichlet process at all! Instead $G$ (the distribution sampled from $DP(\alpha, G_0)$) is the Dirichlet process.</p> <p>Thomas Ferguson first defined the Dirichlet Process <a class="link" href="http://www.cis.upenn.edu/~taskar/courses/cis700-sp08/papers/ferguson.pdf" target="_blank" rel="noopener" >in his 1973</a> paper. Charles Antoniak (a student of Ferguson) repeats the definition <a class="link" href="http://www.cis.upenn.edu/~taskar/courses/cis700-sp08/papers/antoniak.pdf" target="_blank" rel="noopener" >in his his 1974 paper</a>. Antoniak’s definition is as follows:</p> <blockquote> <p>Let $\Theta$ be a set, and $\mathcal{A}$ a $\sigma$-field of subsets of $\Theta$. Let $\beta$ be a finite, nonnull, nonnegative, finitely additive measure on $(\Theta, \mathcal{A})$. We say a random probability measure $P$ on $(\Theta, \mathcal{A})$ is a Dirichlet process on $(\Theta, \mathcal{A})$ with parameter $\beta$, denoted $P\in \mathcal{D}(\beta)$, if for every $k=1, 2, \ldots$ and measurable partition $B_1,\ldots,B_k$ of $\Theta$, the joint distribution of the random probabilities $(P(B_1),\ldots,P(B_k))$ is Dirichlet with parameters $(\beta(B_1),\ldots,\beta(B_k))$, denoted $(P(B_1),\ldots,P(B_k))\in \mathcal{D}(\beta(B_1),\ldots,\beta(B_k))$.</p> </blockquote> <p>Let’s unpack this dense, measure theoretic definition.</p> <blockquote> <p>Let $\Theta$ be a set, and $\mathcal{A}$ a $\sigma$-field of subsets of $\Theta$. Let $\beta$ be a finite, nonnull, nonnegative, finitely additive measure on $(\Theta, \mathcal{A})$.</p> </blockquote> <p>First, note that $\beta$ here is a finite measure, i.e. a not-necessarily-normalized probability distribution. Antoniak’s $\alpha$ is equivalent to $\alpha\cdot G_0$ in our notation. Essentially, this means we have a probability distribution over some set $\Theta$ with density $p(x) = \beta(x) / \beta(\Theta)$. (The business about $\sigma$-field of subsets just allows us to avoid things like the <a class="link" href="https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox" target="_blank" rel="noopener" >Banach-Tarski paradox</a>.)</p> <blockquote> <p>We say a random probability measure $P$ on $(\Theta, \mathcal{A})$ is a Dirichlet process on $(\Theta, \mathcal{A})$ with parameter $\beta$, denoted $P\in \mathcal{D}(\beta)$</p> </blockquote> <blockquote> <p>…if for every $k=1, 2, \ldots$ and measurable partition $B_1,\ldots,B_k$ of $\Theta$, the joint distribution of the random probabilities $(P(B_1),\ldots,P(B_k))$ is Dirichlet with parameters $(\beta(B_1),\ldots,\beta(B_k))$, denoted $(P(B_1),\ldots,P(B_k))\in \mathcal{D}(\beta(B_1),\ldots,\beta(B_k))$.</p> </blockquote> </article> <article> <h1>Econtalk Topics: Nonparametric LDA in Practice</h1> <p>Wed, 07 Oct 2015 00:00:00 +0000</p> <p><em>The interactive visualizations in this analysis require a live notebook. <a class="link" href="http://nbviewer.ipython.org/github/tdhopper/notes-on-dirichlet-processes/blob/master/2015-10-07-econtalk-topics.ipynb" target="_blank" rel="noopener" >View the original notebook on NBViewer</a> for the full interactive experience, including the pyLDAvis topic visualization.</em></p> <h2 id="nonparametric-latent-dirichlet-allocation">Nonparametric Latent Dirichlet Allocation </h2><h3 id="analysis-of-the-topics-of-econtalk">Analysis of the topics of <a class="link" href="http://www.econtalk.org/" target="_blank" rel="noopener" >Econtalk</a> </h3><p>In 2003, a groundbreaking statistical model called “<a class="link" href="https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf" target="_blank" rel="noopener" >Latent Dirichlet Allocation</a>” was presented by David Blei, Andrew Ng, and Michael Jordan.</p> <p>LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see <a class="link" href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/" target="_blank" rel="noopener" >Edwin Chen’s post</a>.</p> <p>The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published <a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >a “nonparametric” version of this model</a> that doesn’t require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. <a class="link" href="https://tdhopper.com/blog/nonparametric-lda/" >I wrote an introduction to this HDP-LDA model</a> earlier this year.</p> <p>For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of “robust, validated Bayesian nonparametric models for discovering structure in data” known as <a class="link" href="http://datamicroscopes.github.io" target="_blank" rel="noopener" >Data Microscopes</a>.</p> <p>This notebook demonstrates the functionality of this implementation.</p> <p>The Data Microscopes library is available on <a class="link" href="https://anaconda.org/datamicroscopes/" target="_blank" rel="noopener" >anaconda.org</a> for Linux and OS X. <code>microscopes-lda</code> can be installed with:</p> <pre><code>$ conda install -c datamicroscopes -c distributions microscopes-lda </code></pre> <p>The Econtalk transcript data used in this analysis is <a class="link" href="https://github.com/tdhopper/notes-on-dirichlet-processes/tree/master/econtalk-data" target="_blank" rel="noopener" >available on GitHub</a>.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pyLDAvis</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">json</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">sys</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">cPickle</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">microscopes.common.rng</span> <span class="kn">import</span> <span class="n">rng</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">microscopes.lda.definition</span> <span class="kn">import</span> <span class="n">model_definition</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">microscopes.lda.model</span> <span class="kn">import</span> <span class="n">initialize</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">microscopes.lda</span> <span class="kn">import</span> <span class="n">utils</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">microscopes.lda</span> <span class="kn">import</span> <span class="n">model</span><span class="p">,</span> <span class="n">runner</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">genfromtxt</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">linalg</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">array</span> </span></span></code></pre></td></tr></table> </div> </div><p><code>dtm.csv</code> contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in <code>vocab.txt</code>. The rows in the matrix correspond to the show urls in <code>urls.txt</code>.</p> <p>Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">vocab</span> <span class="o">=</span> <span class="n">genfromtxt</span><span class="p">(</span><span class="s1">'./econtalk-data/vocab.txt'</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">","</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s1">'str'</span><span class="p">)</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">dtm</span> <span class="o">=</span> <span class="n">genfromtxt</span><span class="p">(</span><span class="s1">'./econtalk-data/dtm.csv'</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">","</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">docs</span> <span class="o">=</span> <span class="n">utils</span><span class="o">.</span><span class="n">docs_from_document_term_matrix</span><span class="p">(</span><span class="n">dtm</span><span class="p">,</span> <span class="n">vocab</span><span class="o">=</span><span class="n">vocab</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">urls</span> <span class="o">=</span> <span class="p">[</span><span class="n">s</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'./econtalk-data/urls.txt'</span><span class="p">)</span><span class="o">.</span><span class="n">readlines</span><span class="p">()]</span> </span></span></code></pre></td></tr></table> </div> </div><p>Let’s set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">N</span><span class="p">,</span> <span class="n">V</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">docs</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">defn</span> <span class="o">=</span> <span class="n">model_definition</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">V</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">prng</span> <span class="o">=</span> <span class="n">rng</span><span class="p">(</span><span class="mi">12345</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">state</span> <span class="o">=</span> <span class="n">initialize</span><span class="p">(</span><span class="n">defn</span><span class="p">,</span> <span class="n">docs</span><span class="p">,</span> <span class="n">prng</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">vocab_hp</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">dish_hps</span><span class="o">=</span><span class="p">{</span><span class="s2">"alpha"</span><span class="p">:</span> <span class="mf">.6</span><span class="p">,</span> <span class="s2">"gamma"</span><span class="p">:</span> <span class="mi">2</span><span class="p">})</span> </span></span><span class="line"><span class="cl"><span class="n">r</span> <span class="o">=</span> <span class="n">runner</span><span class="o">.</span><span class="n">runner</span><span class="p">(</span><span class="n">defn</span><span class="p">,</span> <span class="n">docs</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After running 1000 iterations of the MCMC, the perplexity drops significantly:</p> <pre><code>randomly initialized model: number of documents 454 vocabulary size 16445 perplexity: 16523.1820356 num topics: 9 after 1000 iterations: perplexity: 2363.65138771 num topics: 11 </code></pre> <p>We can extract the term relevance for each topic. Here are the 10 most relevant words for each of the 11 discovered topics:</p> <pre><code>topic 0 : banks fed bank money financial monetary debt inflation crisis rates topic 1 : party republicans constitution vote democrats republican tax election president stalin topic 2 : fat science diet eat insulin disease immune replication scientific eating topic 3 : growth trade water cities china city development climate inequality oil topic 4 : people think don just going like say things lot way topic 5 : smith hayek moral economics society adam liberty coase theory rules topic 6 : bitcoin internet software google technology store bitcoins computer machines company topic 7 : prison health drug care drugs medicaid medical patients patient women topic 8 : schools teachers school kids teacher education students parents teaching sports topic 9 : bees honey pollination colony ants bee queen cheung ant colonies topic 10 : museum museums art gallery galleries monet seating trustees admission director </code></pre> <p>We could assign titles to each of these topics. For example, <em>Topic 5</em> appears to be about the <em>foundations of classical liberalism</em>. <em>Topic 6</em> is obviously <em>Bitcoin and Software</em>. <em>Topic 0</em> is the <em>financial system and monetary policy</em>. <em>Topic 4</em> seems to be <em>generic words used in most episodes</em>; unfortunately, the prevalence of “don” is a result of my preprocessing which splits up the contraction “don’t”.</p> <p>Topic 5 appears to be about the theory of classical liberalism. The 20 episodes with the highest proportion of words from that topic include discussions of the Theory of Moral Sentiments, foundations of liberty, and microeconomics:</p> <pre><code>The Economics of Organ Donations Klein on The Theory of Moral Sentiments, Episode 2 Boudreaux on Law and Legislation Klein on The Theory of Moral Sentiments, Episode 4 Klein on The Theory of Moral Sentiments, Episode 5 Wolfe on Liberalism Boettke on Katrina and the Economics of Disaster Richard Thaler on Libertarian Paternalism ... </code></pre> <p>We can also use the topic distributions as low-dimensional projections of the documents, allowing us to find episodes that are similar in content. For example, episodes similar to “Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market” include many episodes about the financial crisis:</p> <pre><code>Irwin on the Great Depression and the Gold Standard Rustici on Smoot-Hawley and the Great Depression Reinhart on Financial Crises Posner on the Financial Crisis Sumner on Monetary Policy Calomiris on the Financial Crisis John Taylor on the Financial Crisis ... </code></pre> <p>We can also find the topics a given word is most likely to appear in. For example, the word “Munger” (as in <a class="link" href="http://www.michaelmunger.com/" target="_blank" rel="noopener" >Mike Munger</a>) appears most frequently in discussions of classical liberalism and microeconomics:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">topics_related_to_word</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">wd</span><span class="p">,</span> <span class="n">rel</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word_dists</span><span class="p">,</span> <span class="n">relevance</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">score</span> <span class="o">=</span> <span class="n">wd</span><span class="p">[</span><span class="n">word</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">rel_words</span> <span class="o">=</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">rel</span><span class="p">][:</span><span class="n">n</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">bars</span><span class="p">(</span><span class="n">score</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span> <span class="n">bars</span><span class="p">(</span><span class="n">score</span><span class="p">),</span> <span class="n">rel_words</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>==== growth trade water cities china city development climate inequality oil ================== smith hayek moral economics society adam liberty coase theory rules === bitcoin internet software google technology store bitcoins computer machines company </code></pre> <p>Where does Munger come up? In discussing the moral foundations of classical liberalism and microeconomics!</p> <p>The word “lovely”—which Russ Roberts uses often when talking about the <em>Theory of Moral Sentiments</em>—appears most in that topic as well:</p> <pre><code>= smith hayek moral economics society adam liberty coase theory rules </code></pre> </article> <article> <h1>A Joke</h1> <p>Thu, 24 Sep 2015 00:00:00 +0000</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">Even though I've been coding for 12 years, I'd still just consider myself a grammer.</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/531928033553563648">November 10, 2014</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>Notes on Gibbs Sampling in Hierarchical Dirichlet Process Models</h1> <p>Mon, 21 Sep 2015 00:00:00 +0000</p> <p><a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >Yee Whye Teh et al</a>’s 2005 paper <em>Hierarchical Dirichlet Processes</em> describes a nonparametric prior for grouped clustering problems. For example, the HDP helps in generalizing the <a class="link" href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" target="_blank" rel="noopener" >latent Dirichlet allocation</a> model to the case the number of topics in the data are discovered by the inference algorithm instead of being specified as a parameter of the model.</p> <p>The authors describe three MCMC-based algorithms for fitting HDP based models. Unfortunately, the algorithms are described somewhat vaguely and in general terms. A fair bit of mathematical leg work is required before the HDP algorithms can be applied to the specific case of nonparametric latent Dirichlet allocation.</p> <p>Here are some notes I’ve compiled in my effort to understand these algorithms.</p> <h2 id="hdp-lda-generative-model">HDP-LDA Generative Model </h2><p>The generative model for Hierarchical Dirichlet Process Latent Dirichlet Allocation is as follows:</p> \begin{equation} \begin{aligned} H & \sim \text{Dirichlet}(\beta) \\ G_0 \,|\, \gamma, H & \sim \text{DP}(\gamma, H) \\ G_j \,|\, \alpha_0, G_0 & \sim \text{DP}(\alpha_0, G_0) \\ \theta_{ji} \,|\, G_j & \sim G_j \\ x_{ij} \,|\, \theta_{ji} & \sim \text{Categorical}(\theta_{ji}) \\ \end{aligned} \end{equation}<ul> <li>$H$ is Dirichlet distribution whose dimension is the size of the vocabulary, i.e. it is distribution over an uncountably-number of term distributions (topics).</li> <li>$G_0$ is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.</li> <li>For each document $j$, $G_j$ is a is a distribution over a countably-infinite number of categorical term distributions, i.e. topics.</li> <li>$\theta_{ji}$ is a categorical distribution over terms, i.e. a topic.</li> <li>$x_{ij}$ is a term.</li> </ul> <p>To see code for sampling from this generative model, see <a class="link" href="https://tdhopper.com/blog/nonparametric-lda/" >this post</a>.</p> <h3 id="chinese-restaurant-franchise-approach">Chinese Restaurant Franchise Approach </h3><p>Instead of the above Dirichlet process model, we can think of an identical “Chinese Restaurant Franchise” model.</p> <p>Each $\theta_{ji}$ is a customer in restaurant $j$. Each customer is sitting at a table, and each table has multiple customers.</p> <p>There is a global menu of $K$ dishes that the restaurants serve, $\phi_1,\ldots,\phi_K$.</p> <p>Some other definitions:</p> <ul> <li>$\psi_{jt}$ is the dish served at table $t$ in restaurant $j$; i.e. each $\psi_{jt}$ corresponds to some $\phi_k$.</li> <li>$t_{ji}$ is the index of the $\psi_{jt}$ associated with $\theta_{ji}$.</li> <li>$k_{jt}$ is the index of $\phi_k$ associated with $\psi_{jt}$.</li> </ul> <p><em>Customer $i$ in restaurant $j$ sits at table $t_{ji}$ while table $t$ in restaurant $j$ serves dish $k_{jt}$.</em></p> <p>There are two arrays of count variables we will want to track:</p> <ul> <li>$n_{jtk}$ is the number of customers in restaurant $j$ at table $t$ eating dish $k$.</li> <li>$m_{jk}$ is the number of tables in restaurant $j$ serving dish $k$ (multiple tables may serve the same dish).</li> </ul> <p>To summarize:</p> <p>$x_{ij}$ are observed data (words). We assume $x_{ij}\sim F(\theta_{ij})$. Further, we assume $\theta_{ji}$ is associated with table $t_{ji}$, that is $\theta_{ji}=\psi_{jt_{ji}}$. Further, we assume the topic for table $j$ is indexed by $k_{jt}$, i.e. $\psi_{jt}=\phi_{k_{jt}}$. Thus, if we know $t_{ji}$ (the table assignment for $x_{ij}$) and $k_{jt}$ (the dish assignment for table $t$) for all $i, j, t$, we could determine the remaining parameters by sampling.</p> <h2 id="gibbs-sampling">Gibbs Sampling </h2><p><a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >Teh et al</a> describe three Gibbs samplers for this model. The first and third are most applicable to the LDA application. The section helps with more complication applications of the LDA algorithm (e.g. the hidden Markov model).</p> <h3 id="53-posterior-sampling-by-direct-assignment">5.3 Posterior sampling by direct assignment </h3><p>Section 5.3 describes a direct assignment Gibbs sampler that directly assigns words to topics by augmenting the model with an assignment variable $z_{ji}$ that is equivalent to $k_{jt_{ji}}$. This also requires a count variable $m_{jk}$: the number of tables in document/franchise $j$ assigned to dish/topic $k$. This sampler requires less “bookkeeping” than the algorithm from 5.1, however it requires expensive simulation or computation of recursively computed Stirling numbers of the first kind.</p> <p>My notes below are derive the necessary equations for Gibbs sampling for the algorithm in section 5.1. However, they also provide most of the derivations needed for 5.3.</p> <h3 id="51-posterior-sampling-in-the-chinese-restaurant-franchise">5.1 Posterior sampling in the Chinese restaurant franchise </h3><p>Section 5.1 describes “Posterior sampling in the Chinese restaurant franchise”. Given observed data $\mathbf{x}$ (i.e. documents), we sample over the index variables $t_{ji}$ (associating tables with customers/words) and $k_{jt}$ (associating tables with dishes/topics). Given these variables, we can reconstruct the distribution over topics for each document and distribution over words for each topic.</p> <h4 id="notes-on-implementing-algorithm-51">Notes on Implementing Algorithm 5.1 </h4><p>Teh et al’s original HDP paper is sparse on details with regard to applying these samplers to the specific case of nonparametric LDA. For example, both samplers require computing the conditional distribution of word $x_{ji}$ under topic $k$ given all data items except $x_{ji}$: $f_k^{x_{ji}}(x_{ji})$) (eq. 30).</p> <p><a class="link" href="https://shuyo.wordpress.com/2012/08/15/hdp-lda-updates/" target="_blank" rel="noopener" >A Blogger Formerly Known As Shuyo</a> has a brief post where he states (with little-to-no derivation) the equations specific to the LDA case. Below I attempt provide some of those derivations in pedantic detail.</p> <p>As stated above, in the case of topic modeling, $H$ is a Dirichlet distribution over terms distributions and $F$ is a multinomial distribution over terms.</p> <p>By definition,</p> \begin{equation} h(\phi_k)=\frac{1}{Z}\prod_v[\phi_k]_v^{\beta-1} \end{equation}<p>and</p> \begin{equation} f(x_{ji}=v \,|\, \phi_k)=\phi_{kv}. \end{equation}<h5 id="equation-30">Equation (30): $f_k^{x_{ji}}(x_{ji})$ </h5><p>For convenience, define $v=x_{ji}$ (word $i$ in document $j$), define $k=k_{jt_{ji}}$ (topic assignment for the table in document $j$ containing word $i$), and</p> \begin{equation} n_{kv}^{-ji}=\left|\left\{ x_{mn} \,|\, k_{mt_{mn}}=k,\, x_{mn}=v,\, (m,n)\neq(j,i) \right\}\right| \end{equation}<p>(the number of times the term $x_{ji}$, besides $x_{ji}$ itself, is generated by the same topic as was $x_{ji}$).</p> <p>First look at the term (for fixed $k$):</p> \begin{equation} \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k)= \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} [\phi_{k}]_{x_{j'i'}} \end{equation}<p>$[\phi_k]_v$ is the probability that term $v$ is generated by topic $k$. The double sums run over every word generated by topic $k$ in every document. Since $[\phi_{k}]_{x_{j'i'}}$ is fixed for a given word $w$, we could instead do a product over the each word of the vocabulary:</p> \begin{equation} \prod_{j'i'\neq ji, z_{j'i'=k}} f(x_{j'i'} \,|\, \phi_k) =\prod_{w\in\mathcal{V}}[\phi_k]_w^{n_{kw}^{-ji}} \end{equation}<p>We use this early on in the big derivation below.</p> <p>Also, note that</p> \begin{equation} \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \text{ and } \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \end{equation}<p>are the normalizing coefficients for Dirichlet distributions.</p> <p>Equation (30) in Teh’s paper is:</p> \begin{equation} \begin{aligned} f_k^{-x_{ji}}(x_{ji}) &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{ \int f(x_{ji} \,|\, \phi_k) \left[ \prod_{j'} \prod_{i'\neq i, z_{j'i'}=k} \phi_{kv} \right]\cdot h(\phi_k) d\phi_k } { \int \left[ \prod_{j'i'\neq ji, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right]h(\phi_k)d\phi_k } \\ &\propto\frac{ \int \phi_{kv} \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_{w} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &=\frac{ \int \phi_{kv}\cdot \phi_{kv}^{n_{kv}^{-ji}}\cdot \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}}\cdot \phi_{kv}^{\beta-1}\cdot \prod_{w\neq v} \phi_{kw}^{\beta-1} d\phi_k }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}} \prod_w \phi_{kw}^{\beta-1} d\phi_k }\\ &= \int \phi_{kv}^{n_{kv}^{-ji}+\beta} \prod_{w\neq v} \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k \cdot \frac{ 1 }{ \int \prod_w \phi_{kw}^{n_{kw}^{-ji}+\beta-1} d\phi_k. } \end{aligned} \end{equation}<p>Recognizing these integrals as those that occur in the Dirichlet distribution, we have,</p> \begin{equation} \begin{aligned} f_k^{-x_{ji}}(x_{ji}) &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\neq v} \left[ \beta+n_{kw}^{-ji} \right]+ (\beta+n_{kv}^{-ji}+1)\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ 1 }{ \int\prod_w \left(\phi_{kw}\right)^{n_{kw}^{-ji}+\beta-1} d\phi_k }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] +1\right) }\cdot \frac{ \Gamma\left( \sum_{w\in\mathcal{V}} \left[ \beta+n_{kw}^{-ji} \right] \right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kv}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}+1\right) }\cdot \frac{ \prod_{w\neq v} \Gamma\left(\beta+n_{kw}^{-ji}\right) }{ \prod_{w}\Gamma\left(\beta+n_{kw}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) \Gamma\left(\beta+n_{kv}^{-ji}\right) }\\ &=\frac{ \Gamma\left(\beta+n_{kn}^{-ji}+1\right) }{ \Gamma\left(\beta+n_{kv}^{-ji}\right) }\cdot \frac{ \Gamma\left(V\beta+n_{k\cdot}^{-ji}\right) }{ \Gamma\left(V\beta+n_{k\cdot}^{-ji} + 1\right) }\\ &=\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \end{aligned} \end{equation}<p>This is validated in the <a class="link" href="http://arxiv.org/pdf/1201.1657.pdf" target="_blank" rel="noopener" >appendix of this paper</a>.</p> <h5 id="equation">Equation: $f_{k^\text{new}}^{x_{ji}}(x_{ji})$ </h5><p>We also need the prior density of $x_{ji}$ to compute the likelihood that $x_{ji}$ will be seated at a new table.</p> \begin{equation} \begin{aligned} f_{k^{\text{new}}}^{-x_{ji}}(x_{ji}) &= \int f(x_{ji} \,|\, \phi) h(\phi)d\phi_k \\ &=\int \phi_v \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \phi_v^\beta \cdot \prod_{w\neq v} \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta) }{ \Gamma(V\beta+1) }\\ &=\frac{ \Gamma(V\beta) }{ \Gamma(V \beta+1) }\cdot \frac{ \beta\prod_{w}\Gamma(\beta) }{ \prod_w \Gamma(\beta) }\\ &=\frac{ 1 }{ V \beta }\cdot \beta\\ &= \frac{1}{V} \end{aligned} \end{equation}<h5 id="equation-31">Equation (31): $p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})$ </h5><p>These last two derivations give us Equation (31), the likelihood that $t_{ji}=t^{new}$:</p> \begin{equation} \begin{aligned} p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k}) &=\sum_{k=1}^K \left[ \frac{ m_{\cdot k} }{ m_{\cdot\cdot}+\gamma }\cdot \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} \right] \\ &\phantom{=}+ \frac{ \gamma }{ m_{\cdot\cdot}+\gamma } \cdot \frac{1}{V} \end{aligned} \end{equation}<h5 id="equation-32">Equation (32): $p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k})$ </h5><p>From this, we know the conditional distribution of $t_{ji}$ is:</p> \begin{equation} p(t_{ji}=t \,|\, {\bf t}^{-ji}, {\bf k}) \propto \begin{cases} n_{jt\cdot}^{-ji} \cdot \frac{\beta+n_{k_{jt}v}^{-ji}}{V\beta+n_{k_{jt}\cdot}^{-ji}} & {\tiny \text{if } t \text{ previously used,}}\\ {\tiny \alpha_0 \cdot p(x_{ji} \,|\, {\bf t}^{-ji}, t_{ji}=t^{new}, {\bf k})} & {\tiny \text{if } t=t^{\text{new}}}. \end{cases} \end{equation}<h5 id="equation-33">Equation (33): $p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}})$ </h5><p>If the sampled value of $t_{ji}$ is $t^{\text{new}}$, we sample a dish $k_{jt^\text{new}}$ for the table with:</p> \begin{equation} p(k_{jt^\text{new}}=k \,|\, {\bf t}, {\bf k}^{-jt^\text{new}}) \propto \begin{cases} m_{\cdot k}\cdot\frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}} & {\tiny \text{if } k \text{ previously used,}}\\ \frac{ \gamma }{V} & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \end{equation}<h5 id="equation-34">Equation (34): $p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt})$ </h5><p>We need to sample $k_{jt}$ (the dish/topic for table $t$ in restaurant $j$):</p> \begin{equation}\displaystyle p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt}) \propto \begin{cases} m_{\cdot k}^{-jt}\cdot f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } k \text{ previously used,}}\\ \gamma\cdot f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) & {\tiny \text{if } t=k^{\text{new}}}. \end{cases} \end{equation}<p>where $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$ is the “conditional density of ${\bf x}_{jt}$ given all data items associated with mixture component $k$ leaving out ${\bf x}_{jt}$” (Teh, et al). (${\bf x}_{jt}$ is every customer in restaurant $j$ seated at table $t$). $m_{\cdot k}^{-jt}$ is the number of tables (in all franchises) serving dish $k$ when we remove table $jt$.</p> <p>This requires $f_k^{-{\bf x}_{jt}}({\bf x}_{jt})$; this is different from Equation (30), though they look quite similar.</p> \begin{equation} \begin{aligned} f_k^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} f(x_{ji} \,|\, \phi_k) \left[ \prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} f(x_{j'i'} \,|\, \phi_k) \right] h(\phi_k) d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] h(\phi_k)d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} \phi_{k x_{j'i'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} f(x_{j'i'}|\phi_k) \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \\ &=\frac{\displaystyle \int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} \phi_{k x_{j'i'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } {\displaystyle \int \left[ \displaystyle\prod_{x_{j'i'}\not\in {\bf x_{jt}}, z_{j'i'}=k} \phi_{k x_{j'i'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k } \end{aligned} \end{equation}<p>The denominator is</p> \begin{equation} \begin{aligned} \text{denominator}&= \int\left[ \prod_{x_{j'i'}\not\in {\bf x_{jt}}, z_{j'i'}=k} \phi_{k x_{j'i'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}} \prod_w \phi_{kw}^{\beta-1} \right] d\phi_k \\ &=\int\left[ \prod_w \phi_{kw}^{n_{kw}^{-jt}+\beta-1} \right] d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( \sum_w n_{kw}^{-jt}+\beta \right) } \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) }{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) } \end{aligned} \end{equation}<p>The numerator is</p> \begin{equation} \begin{aligned} \text{numerator} &=\int {\prod_{x_{ji}\in {\bf x}_{jt}}} \phi_{k x_{ji}} \left[ \prod_{x_{j'i'}\not\in {\bf x}_{jt}, z_{j'i'}=k} \phi_{k x_{j'i'}} \right] \prod_{w} \phi_{kw}^{\beta-1} d\phi_k \\ &=\int \prod_{w} \phi_{kw}^{ n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta + 1 } d\phi_k \\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( \sum_w n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }\\ &=\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) } \end{aligned} \end{equation}<p>This gives us a closed form version of this conditional distribution:</p> \begin{equation} \begin{aligned} f_k^{-{\bf x_{jt}}}({\bf x}_{jt}) &= \displaystyle\frac{ \prod_w \Gamma\left( n_{kw}^{-jt} + n_{\cdot w}^{jt} + \beta \right) }{ \prod_w \Gamma\left( n_{kw}^{-jt} + \beta \right) } \frac{ \Gamma\left( n_{k\cdot}^{-jt}+V\beta \right) }{ \Gamma \left( n_{k\cdot}^{-jt} + n_{\cdot \cdot}^{jt} + \beta \right) }. \end{aligned} \end{equation}<p>We also need the conditional distribution of $k$ is a new dish: $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$. Shuyo provides without derivation:</p> \begin{equation} \begin{aligned} f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt}) &=\int \left[ \prod_{x_{ji}\in \mathbf{x_{jt}}} f(x_{ji} \,|\, \phi) \right] h(\phi)d\phi_k \\ &=\int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \prod_w \phi_w^{\beta-1} d\phi \\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}} \cdot \prod_w \phi_w^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) } \int \prod_{x_{ji}\in \mathbf{x_{jt}}} \phi_{x_{ji}}^{\left(\beta+1\right)-1} \cdot \prod_{x_{jt}\not\in \mathbf{x_{jt}}} \phi_{x_{jt}}^{\beta-1} d\phi\\ &=\frac{ \Gamma(V\beta) }{ \prod_w \Gamma(\beta) }\cdot \frac{ \prod_{x_{ji}\in \mathbf{x_{jt}}} \Gamma(\beta+1) \prod_{x_{ji}\not\in \mathbf{x_{jt}}}\Gamma(\beta) }{ \Gamma(V\beta+\sum_{x_{ji}\in \mathbf{x_{jt}}} 1) }\\ &=\frac{ \Gamma(V\beta) \prod_w \Gamma(\beta+n_{\cdot w}^{jt}) }{ \Gamma(V\beta+n_{\cdot\cdot}^{jt}) \prod_w \Gamma(\beta) }. \end{aligned} \end{equation}<p>Given these equations for $f_{k}^{-{\bf x}_{jt}}({\bf x}_{jt})$ and $f_{k^\text{new}}^{-{\bf x}_{jt}}({\bf x}_{jt})$, we can draw samples from $p(k_{jt}=k \,|\, {\bf t}, {\bf k}^{-jt})$ by enumeration over topics. We now have a complete Gibbs sampler for the <a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >Posterior sampling in the Chinese restaurant franchise in Teh, et al.</a></p> </article> <article> <h1>Data Science and Agriculture</h1> <p>Thu, 17 Sep 2015 00:00:00 +0000</p> <p>I’m excited to see tools developed for the web being applied to offline domains like agriculture and health. I posed a question on Twitter yesterday:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">Who is hiring at the intersection data science and agriculture?</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/644138076521021440">September 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>I got a number of replies. Companies in this space (not all hiring for DS) include:</p> <ul> <li><a class="link" href="https://www.climate.com/" target="_blank" rel="noopener" >The Climate Corporation</a>: Known for their high profile acquisition by Monsanto, Climate Corp “combines hyper-local weather monitoring, agronomic modeling, and high-resolution weather simulations” “that help farmers improve profitability by making better informed operating and financing decisions”. They are based in San Fransisco.</li> <li><a class="link" href="https://farmlogs.com/" target="_blank" rel="noopener" >FarmLogs</a>: A YCombinator-backed startup in Ann Arbor, MI, FarmLogs is combining satellite maps, soil surveys, GPS data, and more to “expose critical operational data insights” to row crop farmers.</li> <li><a class="link" href="http://dairymaster.ie/careers/data-scientists/" target="_blank" rel="noopener" >Dairymaster</a>: In Ireland, Darymaster is hiring data scientists for their business of dairy equipment manufacturing.</li> <li><a class="link" href="http://www.awhere.com/" target="_blank" rel="noopener" >aWhere</a>: aWhere “delivers agricultural intelligence into the hands of farmers, commercial growers, and policymakers everywhere” by collecting detailed weather data from around the world. They are based in Denver.</li> <li><a class="link" href="http://www.pulsepod.io/" target="_blank" rel="noopener" >pulsepod</a>: This small startup is building hardware to help farmers “know when to plant, fertilize, irrigate, and harvest to achieve quality and yield goals using data from your own field”. They are based in Princeton, NJ.</li> <li><a class="link" href="http://www.granular.ag/" target="_blank" rel="noopener" >Granular</a>: “Granular, a new kind of farm management software and analytics platform, helps you improve efficiency, profit and yield so you are ready to farm more acres.” They are based in San Fransisco.</li> <li><a class="link" href="http://www.precisionhawk.com/" target="_blank" rel="noopener" >PrecisionHawk</a>: Based in my home of Raleigh, NC, PrecisionHawk is “an information delivery company that combines unmanned aerial systems, cutting-edge artificial intelligence and remote sensing technologies to improve business operations and day-to-day decision making.”</li> <li><a class="link" href="http://www.mavrx.com" target="_blank" rel="noopener" >mavrx</a>: “We use aerial imagery to provide actionable insights to the global agriculture industry.” They are based in San Fransisco.</li> <li><a class="link" href="http://www.agsmarts.com/" target="_blank" rel="noopener" >AgSmarts</a>: Memphis-based Agsmarts “is a Precision Agriculture company that automates existing agricultural irrigation systems with our universal retrofit solution to optimize crop yield, save water and AgSmarts minimize input costs via mesh networks of IP-enabled controllers & environmental sensors.”</li> </ul> </article> <article> <h1>Classes Future Programmers Should Take</h1> <p>Sat, 12 Sep 2015 01:55:00 +0000</p> <p>I appreciated <a class="link" href="https://twitter.com/dadgumjames" target="_blank" rel="noopener" >James Hague</a>’s post on <a class="link" href="http://prog21.dadgum.com/210.html" target="_blank" rel="noopener" >Computer Science Courses that Don’t Exist, But Should</a>.</p> <p>I really like <a class="link" href="http://www.se-radio.net/2009/11/episode-148-software-archaeology-with-dave-thomas/" target="_blank" rel="noopener" >Dave Thomas’s idea of a Software Archeology class</a>. I have spent a huge amount of time as a developer reading (vs. writing) code. I wish I’d been taught how to read code effectively.</p> <p>Similarly, I have thought there should be a class (or series of classes) in “interacting with others’ code”. Topics could include inheriting a software project, handing off a software project, writing code using 3rd party libraries, using package repositories, and understanding software licenses. These are such important skills in real world software development, but they seem to be rarely taught in the classroom. Perhaps a follow-up class could be “contributing to open source”.</p> </article> <article> <h1>Dealing with Outliers</h1> <p>Tue, 08 Sep 2015 00:00:00 +0000</p> <p>From <a class="link" href="http://www.pnas.org/content/110/39/15746.full" target="_blank" rel="noopener" >Testicular volume is inversely correlated with nurturing-related brain activity in human fathers</a> in PNAS:</p> <blockquote> <p>One participant’s testes volume measurement was excluded because his value was 2.8 SDs above the mean (mean = 38,064; SD = 11,183) and was more than 13,000 mm^3 larger than any recorded value found in the literature. Of the more than 1,500 healthy, age-matched men in these studies, the largest reported value was 56,000 mm^3, and this participant’s measurement was 69,736 mm^3.</p> </blockquote> </article> <article> <h1>Fitting a Mixture Model with Gibbs Sampling</h1> <p>Wed, 02 Sep 2015 00:00:00 +0000</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">random</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">namedtuple</span><span class="p">,</span> <span class="n">Counter</span> </span></span></code></pre></td></tr></table> </div> </div><p>Suppose we receive some data that looks like the following:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">from_csv</span><span class="p">(</span><span class="s2">"clusters.csv"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">_</span><span class="o">=</span><span class="n">data</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">data</span><span class="o">.</span><span class="n">size</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>1000 </code></pre> <p>It appears that these data exist in three separate clusters. We want to develop a method for finding these <em>latent</em> clusters. One way to start developing a method is to attempt to describe the process that may have generated these data.</p> <p>For simplicity and sanity, let’s assume that each data point is generated independently of the other. Moreover, we will assume that within each cluster, the data points are identically distributed. In this case, we will assume each cluster is normally distributed and that each cluster has the same variance, $\sigma^2$.</p> <p>Given these assumptions, our data could have been generated by the following process. For each data point, randomly select 1 of 3 clusters from the distribution $\text{Discrete}(\pi_1, \pi_2, \pi_3)$. Each cluster $k$ corresponds to a parameter $\theta_k$ for that cluster, sample a data point from $\mathcal{N}(\theta_k, \sigma^2)$.</p> <p>Equivalently, we could consider these data to be generated from a probability distribution with this probability density function:</p> $$ p(x_i \,|\, \pi, \theta_1, \theta_2, \theta_3, \sigma)= \sum_{k=1}^3 \pi_k\cdot \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_k)^2}{2\sigma^2} \right\} $$<p>where $\pi$ is a 3-dimensional vector giving the <em>mixing proportions</em>. In other words, $\pi_k$ describes the proportion of points that occur in cluster $k$.</p> <p>That is, <em>the probability distribution describing $x$ is a linear combination of normal distributions</em>.</p> <p>We want to use this <em>generative</em> model to formulate an algorithm for determining the particular parameters that generated the dataset above. The $\pi$ vector is unknown to us, as is each cluster mean $\theta_k$.</p> <p>We would also like to know $z_i\in\{1, 2, 3\}$, the latent cluster for each point. It turns out that introducing $z_i$ into our model will help us solve for the other values.</p> <p>The joint distribution of our observed data (<code>data</code>) along with the assignment variables is given by:</p> \begin{align} p(\mathbf{x}, \mathbf{z} \,|\, \pi, \theta_1, \theta_2, \theta_3, \sigma)&= p(\mathbf{z} \,|\, \pi) p(\mathbf{x} \,|\, \mathbf{z}, \theta_1, \theta_2, \theta_3, \sigma)\\ &= \prod_{i=1}^N p(z_i \,|\, \pi) \prod_{i=1}^N p(x_i \,|\, z_i, \theta_1, \theta_2, \theta_3, \sigma) \\ &= \prod_{i=1}^N \pi_{z_i} \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_{z_i})^2}{2\sigma^2} \right\}\\ &= \prod_{i=1}^N \left( \pi_{z_i} \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_{z_i})^2}{2\sigma^2} \right\} \right)\\ &= \prod_i^n \prod_k^K \left( \pi_k \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_k)^2}{2\sigma^2} \right\} \right)^{\delta(z_i, k)} \end{align}<h3 id="keeping-everything-straight">Keeping Everything Straight </h3><p>Before moving on, we need to devise a way to keep all our data and parameters straight. Following ideas suggested by <a class="link" href="http://people.csail.mit.edu/bonawitz/Composable%20Probabilistic%20Inference%20with%20Blaise%20-%20Keith%20Bonawitz%20PhD%20Thesis.pdf" target="_blank" rel="noopener" >Keith Bonawitz</a>, let’s define a “state” object to store all of this data.</p> <p>It won’t yet be clear why we are defining some components of <code>state</code>, however we will use each part eventually! As an attempt at clarity, I am using a trailing underscore in the names of members that are fixed. We will update the other parameters as we try to fit the model.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span><span class="lnt">36 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">SuffStat</span> <span class="o">=</span> <span class="n">namedtuple</span><span class="p">(</span><span class="s1">'SuffStat'</span><span class="p">,</span> <span class="s1">'theta N'</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">update_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">N</span> <span class="ow">in</span> <span class="n">Counter</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">points_in_cluster</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">cid</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">cid</span> <span class="o">==</span> <span class="n">cluster_id</span> </span></span><span class="line"><span class="cl"> <span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">points_in_cluster</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">SuffStat</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">initial_state</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">num_clusters</span> <span class="o">=</span> <span class="mi">3</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">1.0</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_ids</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_clusters</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">state</span> <span class="o">=</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="s1">'cluster_ids_'</span><span class="p">:</span> <span class="n">cluster_ids</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'data_'</span><span class="p">:</span> <span class="n">data</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'num_clusters_'</span><span class="p">:</span> <span class="n">num_clusters</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'cluster_variance_'</span><span class="p">:</span> <span class="mf">.01</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'alpha_'</span><span class="p">:</span> <span class="n">alpha</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s1">'hyperparameters_'</span><span class="p">:</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="s2">"mean"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s2">"variance"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="s1">'suffstats'</span><span class="p">:</span> <span class="p">[</span><span class="kc">None</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="kc">None</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'assignment'</span><span class="p">:</span> <span class="p">[</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">cluster_ids</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">data</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'pi'</span><span class="p">:</span> <span class="p">[</span><span class="n">alpha</span> <span class="o">/</span> <span class="n">num_clusters</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">cluster_ids</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'cluster_means'</span><span class="p">:</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="n">update_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">state</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">state</span> <span class="o">=</span> <span class="n">initial_state</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">state</span><span class="o">.</span><span class="n">items</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="n">k</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>num_clusters_ suffstats data_ cluster_means cluster_variance_ cluster_ids_ assignment pi alpha_ hyperparameters_ </code></pre> <h3 id="gibbs-sampling">Gibbs Sampling </h3><p>The <a class="link" href="https://en.wikipedia.org/wiki/Gibbs_sampling" target="_blank" rel="noopener" >theory of Gibbs sampling</a> tells us that given some data $\bf y$ and a probability distribution $p$ parameterized by $\gamma_1, \ldots, \gamma_d$, we can successively draw samples from the distribution by sampling from</p> $$\gamma_j^{(t)}\sim p(\gamma_j \,|\, \gamma_{\neg j}^{(t-1)})$$<p>where $\gamma_{\neg j}^{(t-1)}$ is all current values of $\gamma_i$ except for $\gamma_j$. If we sample long enough, these $\gamma_j$ values will be random samples from $p$.</p> <p>In deriving a Gibbs sampler, it is often helpful to observe that</p> $$ p(\gamma_j \,|\, \gamma_{\neg j}) = \frac{ p(\gamma_1,\ldots,\gamma_d) }{ p(\gamma_{\neg j}) } \propto p(\gamma_1,\ldots,\gamma_d). $$<p>The conditional distribution is proportional to the joint distribution. We will get a lot of mileage from this simple observation by dropping constant terms from the joint distribution (relative to the parameters we are conditioned on).</p> <p>The $\gamma$ values in our model are each of the $\theta_k$ values, the $z_i$ values, and the $\pi_k$ values. Thus, we need to derive the conditional distributions for each of these.</p> <p>Many derivation of Gibbs samplers that I have seen rely on a lot of handwaving and casual appeals to conjugacy. I have tried to add more mathematical details here. I would gladly accept feedback on how to more clearly present the derivations! I have also tried to make the derivations more concrete by immediately providing code to do the computations in this specific case.</p> <h4 id="conditional-distribution-of-assignment">Conditional Distribution of Assignment </h4><p>For berevity, we will use</p> $$ p(z_i=k \,|\, \cdot)= p(z_i=k \,|\, z_{\neg i}, \pi, \theta_1, \theta_2, \theta_3, \sigma, \bf x ). $$<p>Because cluster assignements are conditionally independent given the cluster weights and paramters,</p> \begin{align} p(z_i=k \,|\, \cdot) &\propto \prod_i^n \prod_k^K \left( \pi_k \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_k)^2}{2\sigma^2} \right\} \right)^{\delta(z_i, k)} \\ &\propto \pi_k \cdot \frac{1}{\sigma\sqrt{2\pi}} \text{exp}\left\{ \frac{-(x_i-\theta_k)^2}{2\sigma^2} \right\} \end{align}<p>This equation intuitively makes sense: point $i$ is more likely to be in cluster $k$ if $k$ is itself probable ($\pi_k\gg 0$) and $x_i$ is close to the mean of the cluster $\theta_k$.</p> <p>For each data point $i$, we can compute $p(z_i=k \,|\, \cdot)$ for each of cluster $k$. These values are the unnormalized parameters to a discrete distribution from which we can sample assignments.</p> <p>Below, we define functions for doing this sampling. <code>sample_assignment</code> will generate a sample from the posterior assignment distribution for the specified data point. <code>update_assignment</code> will sample from the posterior assignment for each data point and update the <code>state</code> object.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span><span class="lnt">36 </span><span class="lnt">37 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_assignment_score</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""log p(z_i=k \,|\, \cdot) </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> We compute these scores in log space for numerical stability. </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">x</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">theta</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_means'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">log_pi</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'pi'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">log_pi</span> <span class="o">+</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="o">.</span><span class="n">logpdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="p">,</span> <span class="n">var</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">assigment_probs</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""p(z_i=cid \,|\, \cdot) for cid in cluster_ids </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">log_assignment_score</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]]</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">scores</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">scores</span> <span class="o">/</span> <span class="n">scores</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sample_assignment</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Sample cluster assignment for data_id given current state </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> cf Step 1 of Algorithm 2.1 in Sudderth 2006 </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">p</span> <span class="o">=</span> <span class="n">assigment_probs</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">],</span> <span class="n">p</span><span class="o">=</span><span class="n">p</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">update_assignment</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Update cluster assignment for each data point given current state </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> cf Step 1 of Algorithm 2.1 in Sudderth 2006 </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">data_id</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">]):</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">][</span><span class="n">data_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">sample_assignment</span><span class="p">(</span><span class="n">data_id</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">update_suffstats</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><h4 id="conditional-distribution-of-mixture-weights">Conditional Distribution of Mixture Weights </h4><p>We can similarly derive the conditional distributions of mixture weights by an application of Bayes theorem. Instead of updating each component of $\pi$ separately, we update them together (this is called blocked Gibbs).</p> \begin{align} p(\pi \,|\, \cdot)&= p(\pi \,|\, \bf{z}, \theta_1, \theta_2, \theta_3, \sigma, \mathbf{x}, \alpha )\\ &\propto p(\pi \,|\, \mathbf{x}, \theta_1, \theta_2, \theta_3, \sigma, \alpha ) p(\bf{z}\ \,|\, \mathbf{x}, \theta_1, \theta_2, \theta_3, \sigma, \pi, \alpha )\\ &= p(\pi \,|\, \alpha ) p(\bf{z}\ \,|\, \mathbf{x}, \theta_1, \theta_2, \theta_3, \sigma, \pi, \alpha )\\ &= \prod_{i=1}^K \pi_k^{\alpha/K - 1} \prod_{i=1}^K \pi_k^{\sum_{i=1}^N \delta(z_i, k)} \\ &=\prod_{k=1}^3 \pi_k^{\alpha/K+\sum_{i=1}^N \delta(z_i, k)-1}\\ &\propto \text{Dir}\left( \sum_{i=1}^N \delta(z_i, 1)+\alpha/K, \sum_{i=1}^N \delta(z_i, 2)+\alpha/K, \sum_{i=1}^N \delta(z_i, 3)+\alpha/K \right) \end{align}<p>Here are Python functions to sample from the mixture weights given the current <code>state</code> and to update the mixture weights in the <code>state</code> object.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sample_mixture_weights</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Sample new mixture weights from current state according to </span></span></span><span class="line"><span class="cl"><span class="s2"> a Dirichlet distribution </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> cf Step 2 of Algorithm 2.1 in Sudderth 2006 </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span> <span class="o">=</span> <span class="p">[</span><span class="n">ss</span><span class="p">[</span><span class="n">cid</span><span class="p">]</span><span class="o">.</span><span class="n">N</span> <span class="o">+</span> <span class="n">state</span><span class="p">[</span><span class="s1">'alpha_'</span><span class="p">]</span> <span class="o">/</span> <span class="n">state</span><span class="p">[</span><span class="s1">'num_clusters_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]]</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">stats</span><span class="o">.</span><span class="n">dirichlet</span><span class="p">(</span><span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">update_mixture_weights</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Update state with new mixture weights from current state </span></span></span><span class="line"><span class="cl"><span class="s2"> sampled according to a Dirichlet distribution </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> cf Step 2 of Algorithm 2.1 in Sudderth 2006 </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'pi'</span><span class="p">]</span> <span class="o">=</span> <span class="n">sample_mixture_weights</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><h4 id="conditional-distribution-of-cluster-means">Conditional Distribution of Cluster Means </h4><p>Finally, we need to compute the conditional distribution for the cluster means.</p> <p>We assume the unknown cluster means are distributed according to a normal distribution with hyperparameter mean $\lambda_1$ and variance $\lambda_2^2$. The final step in this derivation comes from the normal-normal conjugacy. For more information see <a class="link" href="http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf" target="_blank" rel="noopener" >section 2.3 of this</a> and <a class="link" href="https://web.archive.org/web/20160304125731/http://fisher.osu.edu/~schroeder.9/AMIS900/ech6.pdf" target="_blank" rel="noopener" >section 6.2 this</a>.)</p> \begin{align} p(\theta_k \,|\, \cdot)&= p(\theta_k \,|\, \bf{z}, \pi, \theta_{\neg k}, \sigma, \bf x, \lambda_1, \lambda_2 ) \\ &\propto p(\left\{x_i \,|\, z_i=k\right\} \,|\, \bf{z}, \pi, \theta_1, \theta_2, \theta_3, \sigma, \lambda_1, \lambda_2) \cdot\\ &\phantom{==}p(\theta_k \,|\, \bf{z}, \pi, \theta_{\neg k}, \sigma, \lambda_1, \lambda_2)\\ &\propto p(\left\{x_i \,|\, z_i=k\right\} \,|\, \mathbf{z}, \theta_k, \sigma) p(\theta_k \,|\, \lambda_1, \lambda_2)\\ &= \mathcal{N}(\theta_k \,|\, \mu_n, \sigma_n)\\ \end{align}$$ \sigma_n^2 = \frac{1}{ \frac{1}{\lambda_2^2} + \frac{N_k}{\sigma^2} } $$<p>and</p> $$\mu_n = \sigma_n^2 \left( \frac{\lambda_1}{\lambda_2^2} + \frac{n\bar{x_k}}{\sigma^2} \right) $$<p>Here is the code for sampling those means and for updating our state accordingly.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sample_cluster_mean</span><span class="p">(</span><span class="n">cluster_id</span><span class="p">,</span> <span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">cluster_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_mean</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'mean'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hp_var</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'hyperparameters_'</span><span class="p">][</span><span class="s1">'variance'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">ss</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'suffstats'</span><span class="p">][</span><span class="n">cluster_id</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">numerator</span> <span class="o">=</span> <span class="n">hp_mean</span> <span class="o">/</span> <span class="n">hp_var</span> <span class="o">+</span> <span class="n">ss</span><span class="o">.</span><span class="n">theta</span> <span class="o">*</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">/</span> <span class="n">cluster_var</span> </span></span><span class="line"><span class="cl"> <span class="n">denominator</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">/</span> <span class="n">hp_var</span> <span class="o">+</span> <span class="n">ss</span><span class="o">.</span><span class="n">N</span> <span class="o">/</span> <span class="n">cluster_var</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">posterior_mu</span> <span class="o">=</span> <span class="n">numerator</span> <span class="o">/</span> <span class="n">denominator</span> </span></span><span class="line"><span class="cl"> <span class="n">posterior_var</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">denominator</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">posterior_mu</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">posterior_var</span><span class="p">))</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">update_cluster_means</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_means'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">sample_cluster_mean</span><span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_ids_'</span><span class="p">]]</span> </span></span></code></pre></td></tr></table> </div> </div><p>Doing each of these three updates in sequence makes a complete <em>Gibbs step</em> for our mixture model. Here is a function to do that:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">update_assignment</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">update_mixture_weights</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">update_cluster_means</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>Initially, we assigned each data point to a random cluster. We can see this by plotting a histogram of each cluster.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">gby</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span> </span></span><span class="line"><span class="cl"> <span class="s1">'data'</span><span class="p">:</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s1">'assignment'</span><span class="p">:</span> <span class="n">state</span><span class="p">[</span><span class="s1">'assignment'</span><span class="p">]}</span> </span></span><span class="line"><span class="cl"> <span class="p">)</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s1">'assignment'</span><span class="p">)[</span><span class="s1">'data'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">hist_data</span> <span class="o">=</span> <span class="p">[</span><span class="n">gby</span><span class="o">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">cid</span><span class="p">)</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">cid</span> <span class="ow">in</span> <span class="n">gby</span><span class="o">.</span><span class="n">groups</span><span class="o">.</span><span class="n">keys</span><span class="p">()]</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">hist_data</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">.5</span> <span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Each time we run <code>gibbs_step</code>, our <code>state</code> is updated with newly sampled assignments. Look what happens to our histogram after 5 steps:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_clusters</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Suddenly, we are seeing clusters that appear very similar to what we would intuitively expect: three Gaussian clusters.</p> <p>Another way to see the progress made by the Gibbs sampler is to plot the change in the model’s log-likelihood after each step. The log likehlihood is given by:</p> $$ \log p(\mathbf{x} \,|\, \pi, \theta_1, \theta_2, \theta_3) \propto \sum_x \log \left( \sum_{k=1}^3 \pi_k \exp \left\{ -(x-\theta_k)^2 / (2\sigma^2) \right\} \right) $$<p>We can define this as a function of our <code>state</code> object:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">log_likelihood</span><span class="p">(</span><span class="n">state</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""Data log-likeliehood </span></span></span><span class="line"><span class="cl"><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2"> Equation 2.153 in Sudderth </span></span></span><span class="line"><span class="cl"><span class="s2"> """</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">ll</span> <span class="o">=</span> <span class="mi">0</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">state</span><span class="p">[</span><span class="s1">'data_'</span><span class="p">]:</span> </span></span><span class="line"><span class="cl"> <span class="n">pi</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'pi'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">mean</span> <span class="o">=</span> <span class="n">state</span><span class="p">[</span><span class="s1">'cluster_means'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="n">sd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s1">'cluster_variance_'</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="n">ll</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">sd</span><span class="p">)</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">ll</span> </span></span></code></pre></td></tr></table> </div> </div><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">state</span> <span class="o">=</span> <span class="n">initial_state</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">ll</span> <span class="o">=</span> <span class="p">[</span><span class="n">log_likelihood</span><span class="p">(</span><span class="n">state</span><span class="p">)]</span> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">gibbs_step</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">ll</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">log_likelihood</span><span class="p">(</span><span class="n">state</span><span class="p">))</span> </span></span><span class="line"><span class="cl"><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">ll</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>See that the log likelihood improves with iterations of the Gibbs sampler. This is what we should expect: the Gibbs sampler finds state configurations that make the data we have seem “likely”. However, the likelihood isn’t strictly monotonic: it jitters up and down. Though it behaves similarly, the Gibbs sampler isn’t optimizing the likelihood function. In its steady state, it is sampling from the posterior distribution. The <code>state</code> after each step of the Gibbs sampler is a sample from the posterior.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">ll</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ylim</span><span class="o">=</span><span class="p">[</span><span class="o">-</span><span class="mi">150</span><span class="p">,</span> <span class="o">-</span><span class="mi">100</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p><a class="link" href="https://tdhopper.com/blog/collapsed-gibbs/" >In another post</a>, I show how we can “collapse” the Gibbs sampler and sampling the assignment parameter without sampling the $\pi$ and $\theta$ values. This collapsed sampler can also be extended to the model with a Dirichet process prior that allows the number of clusters to be a parameter fit by the model.</p> <h2 id="notation-helper">Notation Helper </h2><ul> <li> <p>$N_k$, <code>state['suffstat'][k].N</code>: Number of points in cluster $k$.</p> </li> <li> <p>$\theta_k$, <code>state['suffstat'][k].theta</code>: Mean of cluster $k$.</p> </li> <li> <p>$\lambda_1$, <code>state['hyperparameters_']['mean']</code>: Mean of prior distribution over cluster means.</p> </li> <li> <p>$\lambda_2^2$, <code>state['hyperparameters_']['variance']</code> Variance of prior distribution over cluster means.</p> </li> <li> <p>$\sigma^2$, <code>state[cluster_variance_]</code>: Known, fixed variance of clusters.</p> </li> </ul> <p>The superscript $(t)$ on $\theta_k$, $pi_k$, and $z_i$ indicates the value of that variable at step $t$ of the Gibbs sampler.</p> </article> <article> <h1>Profile in Computational Imagination</h1> <p>Tue, 01 Sep 2015 00:00:00 +0000</p> <p>I recently had the honor of being interviewed by <a class="link" href="http://computationalimagination.com" target="_blank" rel="noopener" >Michael Swenson</a> for his interview series called “Profiles in Computational Imagination”. I talked a bit about my current work, my wandering road to data science, and my love for remote work. You can <a class="link" href="https://web.archive.org/web/20160304120901/http://computationalimagination.com/interview_tim_hopper.php" target="_blank" rel="noopener" >read it here</a>.</p> </article> <article> <h1>On Showing Your Work</h1> <p>Wed, 12 Aug 2015 00:00:00 +0000</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">I'll never forget the great words of one of my best math professors: "If all I wanted was the answer, I sure as hell wouldn't ask you."</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/410071644896382977">December 9, 2013</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>A Programmer's Portfolio</h1> <p>Wed, 05 Aug 2015 00:00:00 +0000</p> <p>I am convinced that a programming student hoping to get a job in that field should be actively building a portfolio online. Turn those class projects, presentations, and reports into Github repositories or blog posts! I felt vindicated as I read this anecdote in <a class="link" href="https://en.wikipedia.org/wiki/Peopleware:_Productive_Projects_and_Teams" target="_blank" rel="noopener" >Peopleware</a>:</p> <blockquote> <p>In the spring of 1979, while teaching together in western Canada,we got a call from a computer science professor at the local technical college. He proposed to stop by our hotel after class one evening and buy us beers in exchange for ideas. That’s the kind of offer we seldom turn down. What we learned from him that evening was almost certainly worth more than whatever he learned from us.</p> </blockquote> <blockquote> <p>The teacher was candid about what he needed to be judged a success in his work: He needed his students to get good job offers and lots of them. “A Harvard diploma is worth something in and of itself, but our diploma isn’t worth squat. If this year’s graduates don’t get hired fast, there are no students next year and I’m out of a job.” So he had developed a formula to make his graduates optimally attractive to the job market. Of course he taught them modern techniques for system construction, including structured analysis and design, data-driven design, information hiding, structured coding, walk throughs, and metrics. He also had them work on real applications for nearby companies and agencies. But the center piece of his formula was the portfolio that all students put together to show samples of their work.</p> </blockquote> <blockquote> <p>He described how his students had been coached to show off their portfolios as part of each interview:</p> </blockquote> <blockquote> <p>“I’ve brought along some samples of the kind of work I do. Here, for instance, is a subroutine in Pascal from one project and a set of COBOL paragraphs from another. As you can see in this portion, we use the loop-with-exit extension advocated by Knuth, but aside from that, it’s pure structured code, pretty much the sort of thing that your company standard calls for. And here is the design that this code was written from. The hierarchies and coupling analysis use Myers’ notation. I designed all of this particular subsystem, and this one little section where we used some Orr methods because the data structure really imposed itself on the process structure. And these are the leveled data flow diagrams that makeup the guts of our specification, and the associated data dictionary. …”</p> </blockquote> <blockquote> <p>In the years since, we’ve often heard more about that obscure technical college and those portfolios. We’ve met recruiters from as far away as Triangle Park, North Carolina, and Tampa, Florida,who regularly converge upon that distant Canadian campus for a shot at its graduates.</p> </blockquote> <blockquote> <p>Of course, this was a clever scheme of the professor’s to give added allure to his graduates, but what struck us most that evening was the report that interviewers were always surprised by the portfolios. That meant they weren’t regularly requiring all candidates to arrive with portfolios. Yet why not? What could be more sensible than asking each candidate to bring along some samples of work to the interview?</p> </blockquote> </article> <article> <h1>Nonparametric Latent Dirichlet Allocation</h1> <p>Mon, 03 Aug 2015 00:00:00 +0000</p> <p><em>Latent Dirichlet Allocation</em> is a <a class="link" href="https://en.wikipedia.org/wiki/Generative_model" target="_blank" rel="noopener" >generative</a> model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a <a class="link" href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/" target="_blank" rel="noopener" >friendly introduction</a> you should read.</p> <p>Because LDA is a <em>generative</em> model, we can simulate the construction of documents by forward-sampling from the model. The generative algorithm is as follows (following <a class="link" href="http://www.arbylon.net/publications/text-est.pdf" target="_blank" rel="noopener" >Heinrich</a>):</p> <ul> <li>for each topic $k\in [1,K]$ do <ul> <li>sample term distribution for topic $\overrightarrow \phi_k \sim \text{Dir}(\overrightarrow \beta)$</li> </ul> </li> <li>for each document $m\in [1, M]$ do <ul> <li>sample topic distribution for document $\overrightarrow\theta_m\sim \text{Dir}(\overrightarrow\alpha)$</li> <li>sample document length $N_m\sim\text{Pois}(\xi)$</li> <li>for all words $n\in [1, N_m]$ in document $m$ do <ul> <li>sample topic index $z_{m,n}\sim\text{Mult}(\overrightarrow\theta_m)$</li> <li>sample term for word $w_{m,n}\sim\text{Mult}(\overrightarrow\phi_{z_{m,n}})$</li> </ul> </li> </ul> </li> </ul> <p>You can implement this with <a class="link" href="https://gist.github.com/tdhopper/521006b60e1311d45509" target="_blank" rel="noopener" >a little bit of code</a> and start to simulate documents.</p> <p>In LDA, we assume each word in the document is generated by a two-step process:</p> <ol> <li>Sample a topic from the topic distribution for the document.</li> <li>Sample a word from the term distribution from the topic.</li> </ol> <p>When we fit the LDA model to a given text corpus with an inference algorithm, our primary objective is to find the set of topic distributions $\underline \Theta$, term distributions $\underline \Phi$ that generated the documents, and latent topic indices $z_{m,n}$ for each word.</p> <p>To run the generative model, we need to specify each of these parameters:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">vocabulary</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'see'</span><span class="p">,</span> <span class="s1">'spot'</span><span class="p">,</span> <span class="s1">'run'</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">num_terms</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocabulary</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">num_topics</span> <span class="o">=</span> <span class="mi">2</span> <span class="c1"># K</span> </span></span><span class="line"><span class="cl"><span class="n">num_documents</span> <span class="o">=</span> <span class="mi">5</span> <span class="c1"># M</span> </span></span><span class="line"><span class="cl"><span class="n">mean_document_length</span> <span class="o">=</span> <span class="mi">5</span> <span class="c1"># xi</span> </span></span><span class="line"><span class="cl"><span class="n">term_dirichlet_parameter</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># beta</span> </span></span><span class="line"><span class="cl"><span class="n">topic_dirichlet_parameter</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># alpha</span> </span></span></code></pre></td></tr></table> </div> </div><p>The term distribution vector $\underline\Phi$ is a collection of samples from a Dirichlet distribution. This describes how our 3 terms are distributed across each of the two topics.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">dirichlet</span><span class="p">,</span> <span class="n">poisson</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="nb">round</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">random</span> <span class="kn">import</span> <span class="n">choice</span> <span class="k">as</span> <span class="n">stl_choice</span> </span></span></code></pre></td></tr></table> </div> </div><div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">term_dirichlet_vector</span> <span class="o">=</span> <span class="n">num_terms</span> <span class="o">*</span> <span class="p">[</span><span class="n">term_dirichlet_parameter</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">term_distributions</span> <span class="o">=</span> <span class="n">dirichlet</span><span class="p">(</span><span class="n">term_dirichlet_vector</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">num_topics</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">term_distributions</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>[[ 0.41 0.02 0.57] [ 0.38 0.36 0.26]] </code></pre> <p>Each document corresponds to a categorical distribution across this distribution of topics (in this case, a 2-dimensional categorical distribution). This categorical distribution is a <em>distribution of distributions</em>; we could look at it as a Dirichlet process!</p> <p>The base base distribution of our Dirichlet process is a uniform distribution of topics (remember, topics are term distributions).</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">base_distribution</span> <span class="o">=</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">stl_choice</span><span class="p">(</span><span class="n">term_distributions</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="c1"># A sample from base_distribution is a distribution over terms</span> </span></span><span class="line"><span class="cl"><span class="c1"># Each of our two topics has equal probability</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">topic</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">Counter</span><span class="p">([</span><span class="nb">tuple</span><span class="p">(</span><span class="n">base_distribution</span><span class="p">())</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">)])</span><span class="o">.</span><span class="n">most_common</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"count:"</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="s2">"topic:"</span><span class="p">,</span> <span class="p">[</span><span class="nb">round</span><span class="p">(</span><span class="n">prob</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">prob</span> <span class="ow">in</span> <span class="n">topic</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>count: 5066 topic: [0.40999999999999998, 0.02, 0.56999999999999995] count: 4934 topic: [0.38, 0.35999999999999999, 0.26000000000000001] </code></pre> <p>Recall that a sample from a Dirichlet process is a distribution that approximates (but varies from) the base distribution. In this case, a sample from the Dirichlet process will be a distribution over topics that varies from the uniform distribution we provided as a base. If we use the stick-breaking metaphor, we are effectively breaking a stick one time and the size of each portion corresponds to the proportion of a topic in the document.</p> <p>To construct a sample from the DP, we need to <a class="link" href="https://tdhopper.com/blog/dirichlet-distribution/" >again define our DP class</a>:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">beta</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">choice</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DirichletProcessSample</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span> <span class="o">=</span> <span class="n">base_measure</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">=</span> <span class="mf">0.</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">remaining</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> </span></span><span class="line"><span class="cl"> <span class="n">i</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="o">.</span><span class="n">roll_die</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">+</span> <span class="p">[</span><span class="n">remaining</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">i</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">i</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">stick_piece</span> <span class="o">=</span> <span class="n">beta</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> <span class="o">*</span> <span class="n">remaining</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">+=</span> <span class="n">stick_piece</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">stick_piece</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">new_value</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_value</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">new_value</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="nd">@staticmethod</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">roll_die</span><span class="p">(</span><span class="n">weights</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">weights</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">choice</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">)),</span> <span class="n">p</span><span class="o">=</span><span class="n">weights</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">None</span> </span></span></code></pre></td></tr></table> </div> </div><p>For each document, we will draw a topic distribution from the Dirichlet process:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">topic_distribution</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="o">=</span><span class="n">base_distribution</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span><span class="o">=</span><span class="n">topic_dirichlet_parameter</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>A sample from this <em>topic</em> distribution is a <em>distribution over terms</em>. However, unlike our base distribution which returns each term distribution with equal probability, the topics will be unevenly weighted.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">topic</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">Counter</span><span class="p">([</span><span class="nb">tuple</span><span class="p">(</span><span class="n">topic_distribution</span><span class="p">())</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">)])</span><span class="o">.</span><span class="n">most_common</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"count:"</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="s2">"topic:"</span><span class="p">,</span> <span class="p">[</span><span class="nb">round</span><span class="p">(</span><span class="n">prob</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">prob</span> <span class="ow">in</span> <span class="n">topic</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>count: 9589 topic: [0.38, 0.35999999999999999, 0.26000000000000001] count: 411 topic: [0.40999999999999998, 0.02, 0.56999999999999995] </code></pre> <p>To generate each word in the document, we draw a sample topic from the topic distribution, and then a term from the term distribution (topic).</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">topic_index</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">documents</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_documents</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_distribution_rvs</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="o">=</span><span class="n">base_distribution</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span><span class="o">=</span><span class="n">topic_dirichlet_parameter</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">document_length</span> <span class="o">=</span> <span class="n">poisson</span><span class="p">(</span><span class="n">mean_document_length</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">document_length</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_distribution</span> <span class="o">=</span> <span class="n">topic_distribution_rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_index</span><span class="p">[</span><span class="n">doc</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">topic_distribution</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="n">documents</span><span class="p">[</span><span class="n">doc</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">choice</span><span class="p">(</span><span class="n">vocabulary</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">topic_distribution</span><span class="p">))</span> </span></span></code></pre></td></tr></table> </div> </div><p>Here are the documents we generated:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">documents</span><span class="o">.</span><span class="n">values</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>['see', 'run', 'see', 'spot', 'see', 'spot'] ['see', 'run', 'see'] ['see', 'run', 'see', 'see', 'run', 'spot', 'spot'] ['run', 'run', 'run', 'spot', 'run'] ['run', 'run', 'see', 'spot', 'run', 'run'] </code></pre> <p>We can see how each topic (term-distribution) is distributed across the documents:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">Counter</span><span class="p">(</span><span class="n">term_dist</span><span class="p">)</span><span class="o">.</span><span class="n">most_common</span><span class="p">()</span> <span class="k">for</span> <span class="n">term_dist</span> <span class="ow">in</span> <span class="n">topic_index</span><span class="o">.</span><span class="n">values</span><span class="p">()):</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"Doc:"</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">topic</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">doc</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="mi">5</span><span class="o">*</span><span class="s2">" "</span><span class="p">,</span> <span class="s2">"count:"</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="s2">"topic:"</span><span class="p">,</span> <span class="p">[</span><span class="nb">round</span><span class="p">(</span><span class="n">prob</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">prob</span> <span class="ow">in</span> <span class="n">topic</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>Doc: 0 count: 6 topic: [0.38, 0.35999999999999999, 0.26000000000000001] Doc: 1 count: 3 topic: [0.40999999999999998, 0.02, 0.56999999999999995] Doc: 2 count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995] count: 2 topic: [0.38, 0.35999999999999999, 0.26000000000000001] Doc: 3 count: 5 topic: [0.38, 0.35999999999999999, 0.26000000000000001] Doc: 4 count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995] count: 1 topic: [0.38, 0.35999999999999999, 0.26000000000000001] </code></pre> <p>To recap: for each document we draw a <em>sample</em> from a Dirichlet <em>Process</em>. The base distribution for the Dirichlet process is a categorical distribution over term distributions; we can think of the base distribution as an $n$-sided die where $n$ is the number of topics and each side of the die is a distribution over terms for that topic. By sampling from the Dirichlet process, we are effectively reweighting the sides of the die (changing the distribution of the topics).</p> <p>For each word in the document, we draw a <em>sample</em> (a term distribution) from the distribution (over term distributions) <em>sampled</em> from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.</p> <p>Given this formulation, we might ask if we can roll an <em>infinite</em> sided die to draw from an unbounded number of topics (term distributions). We can do exactly this with a <em>Hierarchical</em> Dirichlet process. Instead of the base distribution of our Dirichlet process being a <em>finite</em> distribution over topics (term distributions) we will instead make it an infinite Distribution over topics (term distributions) by using yet another Dirichlet process! This base Dirichlet process will have as its base distribution a Dirichlet <em>distribution</em> over terms.</p> <p>We will again draw a <em>sample</em> from a Dirichlet <em>Process</em> for each document. The base distribution for the Dirichlet process is itself a Dirichlet process whose base distribution is a Dirichlet distribution over terms. (Try saying that 5-times fast.) We can think of this as a countably infinite die each side of the die is a distribution over terms for that topic. The sample we draw is a topic (distribution over terms).</p> <p>For each word in the document, we will draw a <em>sample</em> (a term distribution) from the distribution (over term distributions) <em>sampled</em> from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.</p> <p>These last few paragraphs are confusing! Let’s illustrate with code.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">term_dirichlet_vector</span> <span class="o">=</span> <span class="n">num_terms</span> <span class="o">*</span> <span class="p">[</span><span class="n">term_dirichlet_parameter</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">base_distribution</span> <span class="o">=</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">dirichlet</span><span class="p">(</span><span class="n">term_dirichlet_vector</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">base_dp_parameter</span> <span class="o">=</span> <span class="mi">10</span> </span></span><span class="line"><span class="cl"><span class="n">base_dp</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_distribution</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">base_dp_parameter</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>This sample from the base Dirichlet process is our infinite sided die. It is a probability distribution over a countable infinite number of topics.</p> <p>The fact that our die is countably infinite is important. The sampler <code>base_distribution</code> draws topics (term-distributions) from an uncountable set. If we used this as the base distribution of the Dirichlet process below each document would be constructed from a <em>completely unique set of topics</em>. By feeding <code>base_distribution</code> into a Dirichlet Process (stochastic memoizer), we allow the topics to be shared across documents.</p> <p>In other words, <code>base_distribution</code> will never return the same topic twice; however, every topic sampled from <code>base_dp</code> would be sampled an infinite number of times (if we sampled from <code>base_dp</code> forever). At the same time, <code>base_dp</code> will also return an <em>infinite number</em> of topics. In our formulation of the the LDA sampler above, our base distribution only ever returned a finite number of topics (<code>num_topics</code>); there is no <code>num_topics</code> parameter here.</p> <p>Given this setup, we can generate documents from the <em>hierarchical Dirichlet process</em> with an algorithm that is essentially identical to that of the original <em>latent Dirichlet allocation</em> generative sampler:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">nested_dp_parameter</span> <span class="o">=</span> <span class="mi">10</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">topic_index</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">documents</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_documents</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_distribution_rvs</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="o">=</span><span class="n">base_dp</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">alpha</span><span class="o">=</span><span class="n">nested_dp_parameter</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">document_length</span> <span class="o">=</span> <span class="n">poisson</span><span class="p">(</span><span class="n">mean_document_length</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">document_length</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_distribution</span> <span class="o">=</span> <span class="n">topic_distribution_rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">topic_index</span><span class="p">[</span><span class="n">doc</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">topic_distribution</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="n">documents</span><span class="p">[</span><span class="n">doc</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">choice</span><span class="p">(</span><span class="n">vocabulary</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">topic_distribution</span><span class="p">))</span> </span></span></code></pre></td></tr></table> </div> </div><p>Here are the documents we generated:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">documents</span><span class="o">.</span><span class="n">values</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>['spot', 'spot', 'spot', 'spot', 'run'] ['spot', 'spot', 'see', 'spot'] ['spot', 'spot', 'spot', 'see', 'spot', 'spot', 'spot'] ['run', 'run', 'spot', 'spot', 'spot', 'spot', 'spot', 'spot'] ['see', 'run', 'see', 'run', 'run', 'run'] </code></pre> <p>And here are the latent topics used:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">Counter</span><span class="p">(</span><span class="n">term_dist</span><span class="p">)</span><span class="o">.</span><span class="n">most_common</span><span class="p">()</span> <span class="k">for</span> <span class="n">term_dist</span> <span class="ow">in</span> <span class="n">topic_index</span><span class="o">.</span><span class="n">values</span><span class="p">()):</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"Doc:"</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">topic</span><span class="p">,</span> <span class="n">count</span> <span class="ow">in</span> <span class="n">doc</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="mi">5</span><span class="o">*</span><span class="s2">" "</span><span class="p">,</span> <span class="s2">"count:"</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="s2">"topic:"</span><span class="p">,</span> <span class="p">[</span><span class="nb">round</span><span class="p">(</span><span class="n">prob</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">prob</span> <span class="ow">in</span> <span class="n">topic</span><span class="p">])</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>Doc: 0 count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02] count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999] count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999] count: 1 topic: [0.22, 0.40000000000000002, 0.38] Doc: 1 count: 2 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999] count: 1 topic: [0.17999999999999999, 0.79000000000000004, 0.02] count: 1 topic: [0.35999999999999999, 0.55000000000000004, 0.089999999999999997] Doc: 2 count: 4 topic: [0.11, 0.65000000000000002, 0.23999999999999999] count: 2 topic: [0.070000000000000007, 0.65000000000000002, 0.27000000000000002] count: 1 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007] Doc: 3 count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02] count: 2 topic: [0.25, 0.55000000000000004, 0.20000000000000001] count: 2 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007] count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999] count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999] Doc: 4 count: 3 topic: [0.40000000000000002, 0.23000000000000001, 0.37] count: 2 topic: [0.42999999999999999, 0.17999999999999999, 0.40000000000000002] count: 1 topic: [0.23000000000000001, 0.29999999999999999, 0.46000000000000002] </code></pre> <p>Our documents were generated by an unspecified number of topics, and yet the topics were shared across the 5 documents. This is the power of the hierarchical Dirichlet process!</p> <p>This non-parametric formulation of Latent Dirichlet Allocation was first published by <a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >Yee Whye Teh et al</a>.</p> <p>Unfortunately, forward sampling is the easy part. Fitting the model on data requires <a class="link" href="http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf" target="_blank" rel="noopener" >complex MCMC</a> or <a class="link" href="http://www.cs.princeton.edu/~chongw/papers/WangPaisleyBlei2011.pdf" target="_blank" rel="noopener" >variational inference</a>. There are a <a class="link" href="http://www.stats.ox.ac.uk/~teh/software.html" target="_blank" rel="noopener" >limited</a> <a class="link" href="https://github.com/shuyo/iir/blob/master/lda/hdplda2.py" target="_blank" rel="noopener" >number</a> of <a class="link" href="https://github.com/renaud/hdp-faster" target="_blank" rel="noopener" >implementations</a> <a class="link" href="http://www.arbylon.net/resources.html" target="_blank" rel="noopener" >of HDP-LDA</a> available, and none of them are great.</p> </article> <article> <h1>High Quality Code at Quora</h1> <p>Thu, 30 Jul 2015 00:00:00 +0000</p> <p>I love <a class="link" href="http://engineering.quora.com/Moving-Fast-With-High-Code-Quality" target="_blank" rel="noopener" >this new post</a> on Quora’s engineering blog. The post states “high code quality is the long-term boost to development speed” and goes on to explain how they go about accomplishing this.</p> <p>I’ve inherited large code bases at each of my jobs out of grad school, and I’ve spent a lot of thinking about this question. At least on the surface, I love the solutions Quora has in place for ensuring quality code: thoughtful code review, careful testing, style guidelines, static checking, and intentional code cleanup.</p> </article> <article> <h1>Sampling from a Hierarchical Dirichlet Process</h1> <p>Thu, 30 Jul 2015 00:00:00 +0000</p> <p><a class="link" href="https://tdhopper.com/blog/dirichlet-distribution/" >As we saw earlier</a> the Dirichlet process describes the <em>distribution</em> of a random probability distribution. The Dirichlet process takes two parameters: a base distribution $H_0$ and a dispersion parameter $\alpha$. A sample from the Dirichlet process is itself a probability distribution that <em>looks like</em> $H_0$. On average, the larger $\alpha$ is, the closer a sample from $\text{DP}(\alpha H_0)$ will be to $H_0$.</p> <p>Suppose we’re feeling masochistic and want to input a distribution sampled from a Dirichlet process as base distribution to a new Dirichlet process. (It will turn out that there are good reasons for this!) Conceptually this makes sense. But can we construct such a thing in practice? Said another way, can we build a sampler that will draw samples from a probability distribution drawn from these nested Dirichlet processes? We might initially try construct a sample (a probability distribution) from the first Dirichlet process before feeding it into the second.</p> <p>But recall that fully constructing a sample (a probability distribution!) from a Dirichlet process would require drawing a countably infinite number of samples from $H_0$ and from the beta distribution to generate the weights. This would take forever, even with Hadoop!</p> <p><a class="link" href="http://danroy.org/papers/RoyManGooTen-ICMLNPB-2008.pdf" target="_blank" rel="noopener" >Dan Roy, et al</a> helpfully described a technique of using <em>stochastic memoization</em> to construct a distribution sampled from a Dirichlet process in a just-in-time manner. This process provides us with the equivalent of the <a class="link" href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.rvs.html" target="_blank" rel="noopener" >Scipy <code>rvs</code></a> method for the sampled distribution. Stochastic memoization is equivalent to the <a class="link" href="http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/20070921.pdf" target="_blank" rel="noopener" >Chinese restaurant process</a>: sometimes you get seated an an occupied table (i.e. sometimes you’re given a sample you’ve seen before) and sometimes you’re put at a new table (given a unique sample).</p> <p>Here is our memoization class again:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">choice</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">beta</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DirichletProcessSample</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span> <span class="o">=</span> <span class="n">base_measure</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">=</span> <span class="mf">0.</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">remaining</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> </span></span><span class="line"><span class="cl"> <span class="n">i</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="o">.</span><span class="n">roll_die</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">+</span> <span class="p">[</span><span class="n">remaining</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">i</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">i</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">stick_piece</span> <span class="o">=</span> <span class="n">beta</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> <span class="o">*</span> <span class="n">remaining</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">+=</span> <span class="n">stick_piece</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">stick_piece</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">new_value</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_value</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">new_value</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="nd">@staticmethod</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">roll_die</span><span class="p">(</span><span class="n">weights</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">weights</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">choice</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">)),</span> <span class="n">p</span><span class="o">=</span><span class="n">weights</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">None</span> </span></span></code></pre></td></tr></table> </div> </div><p>Let’s illustrate again with a standard normal base measure. We can construct a function <code>base_measure</code> that generates samples from it.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">norm</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">base_measure</span> <span class="o">=</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">norm</span><span class="p">()</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p>Because the normal distribution has continuous support, we can generate samples from it forever and we will never see the same sample twice (in theory). We can illustrate this by drawing from the distribution ten thousand times and seeing that we get ten thousand unique values.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pandas</span> <span class="kn">import</span> <span class="n">Series</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">ndraws</span> <span class="o">=</span> <span class="mi">10000</span> </span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">"Number of unique samples after </span><span class="si">{}</span><span class="s2"> draws:"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">ndraws</span><span class="p">),)</span> </span></span><span class="line"><span class="cl"><span class="n">draws</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span class="n">base_measure</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ndraws</span><span class="p">)])</span> </span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">draws</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span><span class="o">.</span><span class="n">size</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>Number of unique samples after 10000 draws: 10000 </code></pre> <p>However, when we feed the base measure through the stochastic memoization procedure and then sample, we get many duplicate samples. The number of unique samples goes down as $\alpha$ increases.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">norm_dp</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="s2">"Number of unique samples after </span><span class="si">{}</span><span class="s2"> draws:"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">ndraws</span><span class="p">),)</span> </span></span><span class="line"><span class="cl"><span class="n">dp_draws</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span class="n">norm_dp</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ndraws</span><span class="p">)])</span> </span></span><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">dp_draws</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span><span class="o">.</span><span class="n">size</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code>Number of unique samples after 10000 draws: 446 </code></pre> <p>At this point, we have a function <code>dp_draws</code> that returns samples from a probability distribution (specifically, a probability distribution sampled from $\text{DP}(\alpha H_0)$). We can use <code>dp_draws</code> as a base distribution for another Dirichlet process!</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">norm_hdp</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">norm_dp</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>How do we interpret this? <code>norm_dp</code> is a sampler from a probability distribution that looks like the standard normal distribution. <code>norm_hdp</code> is a sampler from a probability distribution that “looks like” the distribution <code>norm_dp</code> samples from.</p> <p>Here is a histogram of samples drawn from <code>norm_dp</code>, our first sampler.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span> </span></span><span class="line"><span class="cl"><span class="n">Series</span><span class="p">(</span><span class="n">norm_dp</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">))</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">_</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Histogram of Samples from norm_dp"</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>And here is a histogram for samples drawn from <code>norm_hdp</code>, our second sampler.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">Series</span><span class="p">(</span><span class="n">norm_hdp</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">))</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">_</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Histogram of Samples from norm_hdp"</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>The second plot doesn’t look very much like the first! The level to which a sample from a Dirichlet process approximates the base distribution is a function of the dispersion parameter $\alpha$. Because I set $\alpha=10$ (which is relatively small), the approximation is fairly course. In terms of memoization, a small $\alpha$ value means the stochastic memoizer will more frequently reuse values already seen instead of drawing new ones.</p> <p>This nesting procedure, where a sample from one Dirichlet process is fed into another Dirichlet process as a base distribution, is more than just a curiousity. It is known as a <a class="link" href="http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf" target="_blank" rel="noopener" >Hierarchical Dirichlet Process, and it plays an important role in the study of Bayesian Nonparametrics</a>.</p> <p>Without the stochastic memoization framework, constructing a sampler for a hierarchical Dirichlet process is a daunting task. We want to be able to draw samples from a distribution drawn from the second level Dirichlet process. However, to be able to do that, we need to be able to draw samples from a distribution sampled from a <em>base distribution of the second-level Dirichlet process</em>: this base distribution is a <em>distribution drawn from the first-level Dirichlet process</em>.</p> <p>Though it appeared that we would need to be able to fully construct the first level sample (by drawing a countably infinite number of samples from the first-level base distribution). However, stochastic memoization allows us to only construct the first distribution just-in-time as it is needed at the second-level.</p> <p>We can define a Python class to encapsulate the Hierarchical Dirichlet Process as a base class of the Dirichlet process.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span><span class="lnt">6 </span><span class="lnt">7 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">HierarchicalDirichletProcessSample</span><span class="p">(</span><span class="n">DirichletProcessSample</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha1</span><span class="p">,</span> <span class="n">alpha2</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">first_level_dp</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">second_level_dp</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">first_level_dp</span><span class="p">,</span> <span class="n">alpha2</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">second_level_dp</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p>Since the Hierarchical DP is a Dirichlet Process inside of Dirichlet process, we must provide it with both a first and second level $\alpha$ value.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">norm_hdp</span> <span class="o">=</span> <span class="n">HierarchicalDirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha1</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">alpha2</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>We can sample directly from the probability distribution drawn from the Hierarchical Dirichlet Process.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">Series</span><span class="p">(</span><span class="n">norm_hdp</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">))</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">_</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Histogram of samples from distribution drawn from Hierarchical DP"</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p><code>norm_hdp</code> is not equivalent to the Hierarchical Dirichlet Process; it samples from a <em>single distribution</em> sampled from this HDP. Each time we instantiate the <code>norm_hdp</code> variable, we are getting a sampler for a unique distribution. Below we sample five times and get five different distributions.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">norm_hdp</span> <span class="o">=</span> <span class="n">HierarchicalDirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha1</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">alpha2</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">_</span><span class="o">=</span><span class="n">Series</span><span class="p">(</span><span class="n">norm_hdp</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">_</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Histogram of samples from distribution drawn from Hierarchical DP"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">_</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p></p> <p></p> <p></p> <p></p> <p>In a later post, I will discuss how these tools are applied in the realm of Bayesian nonparametrics.</p> </article> <article> <h1>10x Engineering</h1> <p>Tue, 28 Jul 2015 00:00:00 +0000</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">How to be a 10x engineer: Incur technical debt fast enough to appear 10x as productive as the ten engineers tasked with cleaning it up.</p>— Brian Degenhardt (@bmdhacks) <a href="https://twitter.com/bmdhacks/status/560949130999365633">January 29, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>Dirichlet Distribution and Dirichlet Processes</h1> <p>Tue, 28 Jul 2015 00:00:00 +0000</p> <h2 id="dirichlet-distribution">Dirichlet Distribution </h2><p>The symmetric <a class="link" href="https://en.wikipedia.org/wiki/Dirichlet_distribution" target="_blank" rel="noopener" >Dirichlet distribution</a> (DD) can be considered a distribution of distributions. Each sample from the DD is a <a class="link" href="https://en.wikipedia.org/wiki/Categorical_distribution" target="_blank" rel="noopener" >categorial distribution</a> over $K$ categories. It is parameterized $G_0$, a distribution over $K$ categories and $\alpha$, a scale factor.</p> <p>The expected value of the DD is $G_0$. The variance of the DD is a function of the scale factor. When $\alpha$ is large, samples from $DD(\alpha\cdot G_0)$ will be very close to $G_0$. When $\alpha$ is small, samples will vary more widely.</p> <p>We demonstrate below by setting $G_0=[.2, .2, .6]$ and varying $\alpha$ from 0.1 to 1000. In each case, the mean of the samples is roughly $G_0$, but the standard deviation is decreases as $\alpha$ increases.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">dirichlet</span> </span></span><span class="line"><span class="cl"><span class="n">np</span><span class="o">.</span><span class="n">set_printoptions</span><span class="p">(</span><span class="n">precision</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">stats</span><span class="p">(</span><span class="n">scale_factor</span><span class="p">,</span> <span class="n">G0</span><span class="o">=</span><span class="p">[</span><span class="mf">.2</span><span class="p">,</span> <span class="mf">.2</span><span class="p">,</span> <span class="mf">.6</span><span class="p">],</span> <span class="n">N</span><span class="o">=</span><span class="mi">10000</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">samples</span> <span class="o">=</span> <span class="n">dirichlet</span><span class="p">(</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">scale_factor</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">G0</span><span class="p">))</span><span class="o">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">N</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">" alpha:"</span><span class="p">,</span> <span class="n">scale_factor</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">" element-wise mean:"</span><span class="p">,</span> <span class="n">samples</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">(</span><span class="s2">"element-wise standard deviation:"</span><span class="p">,</span> <span class="n">samples</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="nb">print</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">scale</span> <span class="ow">in</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">]:</span> </span></span><span class="line"><span class="cl"> <span class="n">stats</span><span class="p">(</span><span class="n">scale</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><pre><code> alpha: 0.1 element-wise mean: [ 0.2 0.2 0.6] element-wise standard deviation: [ 0.38 0.38 0.47] alpha: 1 element-wise mean: [ 0.2 0.2 0.6] element-wise standard deviation: [ 0.28 0.28 0.35] alpha: 10 element-wise mean: [ 0.2 0.2 0.6] element-wise standard deviation: [ 0.12 0.12 0.15] alpha: 100 element-wise mean: [ 0.2 0.2 0.6] element-wise standard deviation: [ 0.04 0.04 0.05] alpha: 1000 element-wise mean: [ 0.2 0.2 0.6] element-wise standard deviation: [ 0.01 0.01 0.02] </code></pre> <h2 id="dirichlet-process">Dirichlet Process </h2><p>The <a class="link" href="https://en.wikipedia.org/wiki/Dirichlet_process" target="_blank" rel="noopener" >Dirichlet Process</a> can be considered a way to <em>generalize</em> the Dirichlet distribution. While the Dirichlet distribution is parameterized by a discrete distribution $G_0$ and generates samples that are similar discrete distributions, the Dirichlet process is parameterized by a generic distribution $H_0$ and generates samples which are distributions similar to $H_0$. The Dirichlet process also has a parameter $\alpha$ that determines how similar how widely samples will vary from $H_0$.</p> <p>We can construct a sample $H$ (recall that $H$ is a probability distribution) from a Dirichlet process $\text{DP}(\alpha H_0)$ by drawing a countably infinite number of samples $\theta_k$ from $H_0$) and setting:</p> $$H=\sum_{k=1}^\infty \pi_k \cdot\delta(x-\theta_k)$$<p>where the $\pi_k$ are carefully chosen weights (more later) that sum to 1. ($\delta$ is the <a class="link" href="https://en.wikipedia.org/wiki/Dirac_delta_function" target="_blank" rel="noopener" >Dirac delta function</a>.)</p> <p>$H$, a sample from $DP(\alpha H_0)$, is a <em>probability distribution</em> that looks similar to $H_0$ (also a distribution). In particular, $H$ is a <em>discrete</em> distribution that takes the value $\theta_k$ with probability $\pi_k$. This sampled distribution $H$ is a discrete distribution <em>even if $H_0$ has continuous support</em>; the <a class="link" href="http://www.statlect.com/glossary/support_of_a_random_variable.htm" target="_blank" rel="noopener" >support</a> of $H$ is a countably infinite subset of the support $H_0$.</p> <p>The weights ($\pi_k$ values) of a Dirichlet process sample related the Dirichlet <em>process</em> back to the Dirichlet <em>distribution</em>.</p> <p><a class="link" href="http://www.arbylon.net/publications/ilda.pdf" target="_blank" rel="noopener" >Gregor Heinrich</a> writes:</p> <blockquote> <p>The defining property of the DP is that its samples have weights $\pi_k$ and locations $\theta_k$ distributed in such a way that when partitioning $S(H)$ into finitely many arbitrary disjoint subsets $S_1, \ldots, S_j$ $J<\infty$, the sums of the weights $\pi_k$ in each of these $J$ subsets are distributed according to a Dirichlet distribution that is parameterized by $\alpha$ and a discrete base distribution (like $G_0$) whose weights are equal to the integrals of the base distribution $H_0$ over the subsets $S_n$.</p> </blockquote> <p>As an example, Heinrich imagines a DP with a standard normal base measure $H_0\sim \mathcal{N}(0,1)$. Let $H$ be a sample from $DP(H_0)$ and partition the real line (the support of a normal distribution) as $S_1=(-\infty, -1]$, $S_2=(-1, 1]$, and $S_3=(1, \infty]$ then</p> $$H(S_1),H(S_2), H(S_3) \sim \text{Dir}\left(\alpha\,\text{erf}(-1), \alpha\,(\text{erf}(1) - \text{erf}(-1)), \alpha\,(1-\text{erf}(1))\right)$$<p>where $H(S_n)$ be the sum of the $\pi_k$ values whose $\theta_k$ lie in $S_n$.</p> <p>These $S_n$ subsets are chosen for convenience, however similar results would hold for <em>any</em> choice of $S_n$. For any sample from a Dirichlet <em>process</em>, we can construct a sample from a Dirichlet <em>distribution</em> by partitioning the support of the sample into a finite number of bins.</p> <p>There are several equivalent ways to choose the $\pi_k$ so that this property is satisfied: the Chinese restaurant process, the stick-breaking process, and the Pólya urn scheme.</p> <p>To generate $\left\{\pi_k\right\}$ according to a stick-breaking process we define $\beta_k$ to be a sample from $\text{Beta}(1,\alpha)$. $\pi_1$ is equal to $\beta_1$. Successive values are defined recursively as</p> $$\pi_k=\beta_k \prod_{j=1}^{k-1}(1-\beta_j).$$<p>Thus, if we want to draw a sample from a Dirichlet process, we could, in theory, sample an infinite number of $\theta_k$ values from the base distribution $H_0$, an infinite number of $\beta_k$ values from the Beta distribution. Of course, sampling an infinite number of values is easier in theory than in practice.</p> <p>However, by noting that the $\pi_k$ values are <em>positive</em> values summing to 1, we note that, in expectation, they must get increasingly small as $k\rightarrow\infty$. Thus, we can reasonably approximate a sample $H\sim DP(\alpha H_0)$ by drawing <em>enough</em> samples such that $\sum_{k=1}^K \pi_k\approx 1$.</p> <p>We use this method below to draw approximate samples from several Dirichlet processes with a standard normal ($\mathcal{N}(0,1)$) base distribution but varying $\alpha$ values.</p> <p>Recall that a single sample from a Dirichlet process is a probability distribution over a countably infinite subset of the support of the base measure.</p> <p>The blue line is the PDF for a standard normal. The black lines represent the $\theta_k$ and $\pi_k$ values; $\theta_k$ is indicated by the position of the black line on the $x$-axis; $\pi_k$ is proportional to the height of each line.</p> <p>We generate enough $\pi_k$ values are generated so their sum is greater than 0.99. When $\alpha$ is small, very few $\theta_k$’s will have corresponding $\pi_k$ values larger than $0.01$. However, as $\alpha$ grows large, the sample becomes a more accurate (though still discrete) approximation of $\mathcal{N}(0,1)$.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">beta</span><span class="p">,</span> <span class="n">norm</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">dirichlet_sample_approximation</span><span class="p">(</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">betas</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="n">betas</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">beta</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">())</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">betas</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">while</span> <span class="nb">sum</span><span class="p">(</span><span class="n">pis</span><span class="p">)</span> <span class="o"><</span> <span class="p">(</span><span class="mf">1.</span><span class="o">-</span><span class="n">tol</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">b</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">betas</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="n">new_beta</span> <span class="o">=</span> <span class="n">beta</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">betas</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_beta</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_beta</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">s</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">pis</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">thetas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">base_measure</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">pis</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">pis</span><span class="p">,</span> <span class="n">thetas</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">plot_normal_dp_approximation</span><span class="p">(</span><span class="n">alpha</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Dirichlet Process Sample with N(0,1) Base Measure"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s2">"alpha: </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">alpha</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span><span class="p">,</span> <span class="n">thetas</span> <span class="o">=</span> <span class="n">dirichlet_sample_approximation</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="n">norm</span><span class="p">()</span><span class="o">.</span><span class="n">rvs</span><span class="p">(),</span> <span class="n">alpha</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">pis</span> <span class="o">=</span> <span class="n">pis</span> <span class="o">*</span> <span class="p">(</span><span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">/</span> <span class="n">pis</span><span class="o">.</span><span class="n">max</span><span class="p">())</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">thetas</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pis</span><span class="p">,</span> <span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">X</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">plot_normal_dp_approximation</span><span class="p">(</span><span class="mf">.1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_normal_dp_approximation</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_normal_dp_approximation</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">plot_normal_dp_approximation</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p></p> <p></p> <p></p> <p>Often we want to draw samples from a <em>distribution sampled from a Dirichlet process</em> instead of from the Dirichlet process itself. Much of the literature on the topic unhelpful refers to this as sampling from a Dirichlet process.</p> <p>Fortunately, we don’t have to draw an infinite number of samples from the base distribution and stick breaking process to do this. Instead, we can draw these samples <em>as they are needed</em>.</p> <p>Suppose, for example, we know a finite number of the $\theta_k$ and $\pi_k$ values for a sample $H\sim \text{Dir}(\alpha H_0)$. For example, we know</p> $$\pi_1=0.5,\; \pi_2=0.3,\; \theta_1=0.1,\; \theta_2=-0.5.$$<p>To sample from $H$, we can generate a uniform random $u$ number between 0 and 1. If $u$ is less than 0.5, our sample is $0.1$. If $0.5<=u<0.8$, our sample is $-0.5$. If $u>=0.8$, our sample (from $H$ will be a new sample $\theta_3$ from $H_0$. At the same time, we should also sample and store $\pi_3$. When we draw our next sample, we will again draw $u\sim\text{Uniform}(0,1)$ but will compare against $\pi_1, \pi_2$, AND $\pi_3$.</p> <p>The class below will take a base distribution $H_0$ and $\alpha$ as arguments to its constructor. The class instance can then be called to generate samples from $H\sim \text{DP}(\alpha H_0)$.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">choice</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">DirichletProcessSample</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span> <span class="o">=</span> <span class="n">base_measure</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">=</span> <span class="mf">0.</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">remaining</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> </span></span><span class="line"><span class="cl"> <span class="n">i</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="o">.</span><span class="n">roll_die</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">+</span> <span class="p">[</span><span class="n">remaining</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">i</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="ow">and</span> <span class="n">i</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">stick_piece</span> <span class="o">=</span> <span class="n">beta</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span><span class="p">)</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> <span class="o">*</span> <span class="n">remaining</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">total_stick_used</span> <span class="o">+=</span> <span class="n">stick_piece</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">stick_piece</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">new_value</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">base_measure</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_value</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">new_value</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="nd">@staticmethod</span> </span></span><span class="line"><span class="cl"> <span class="k">def</span> <span class="nf">roll_die</span><span class="p">(</span><span class="n">weights</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">weights</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">choice</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">)),</span> <span class="n">p</span><span class="o">=</span><span class="n">weights</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">None</span> </span></span></code></pre></td></tr></table> </div> </div><p>This Dirichlet process class could be called <em>stochastic memoization</em>. This idea was first articulated in somewhat abstruse terms by <a class="link" href="http://danroy.org/papers/RoyManGooTen-ICMLNPB-2008.pdf" target="_blank" rel="noopener" >Daniel Roy, et al</a>.</p> <p>Below are histograms of 10000 samples drawn from <em>samples</em> drawn from Dirichlet processes with standard normal base distribution and varying $\alpha$ values.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">base_measure</span> <span class="o">=</span> <span class="k">lambda</span><span class="p">:</span> <span class="n">norm</span><span class="p">()</span><span class="o">.</span><span class="n">rvs</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">n_samples</span> <span class="o">=</span> <span class="mi">10000</span> </span></span><span class="line"><span class="cl"><span class="n">samples</span> <span class="o">=</span> <span class="p">{}</span> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">alpha</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">]:</span> </span></span><span class="line"><span class="cl"> <span class="n">dirichlet_norm</span> <span class="o">=</span> <span class="n">DirichletProcessSample</span><span class="p">(</span><span class="n">base_measure</span><span class="o">=</span><span class="n">base_measure</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">samples</span><span class="p">[</span><span class="s2">"Alpha: </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">alpha</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">dirichlet_norm</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples</span><span class="p">)]</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">_</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span> </span></span></code></pre></td></tr></table> </div> </div><p></p> <p>Note that these histograms look very similar to the corresponding plots of sampled distributions above. However, these histograms are plotting <em>points sampled from a distribution sampled from a Dirichlet process</em> while the plots above were showing approximate <em>distributions samples from the Dirichlet process</em>. Of course, as the number of samples from each $H$ grows large, we would expect the histogram to be a very good empirical approximation of $H$.</p> <p><a class="link" href="https://tdhopper.com/blog/hdp-sampling/" >In a another post</a>, I will look at how this <code>DirichletProcessSample</code> class can be used to draw samples from a <em>hierarchical</em> Dirichlet process.</p> </article> <article> <h1>Handy One-off Webpages</h1> <p>Mon, 27 Jul 2015 00:00:00 +0000</p> <p>I’m starting to love single-page informational websites. For example:</p> <ul> <li><a class="link" href="http://keepachangelog.com" target="_blank" rel="noopener" >Keep a CHANGELOG</a>: <a class="link" href="http://olivierlacan.com/" title="Olivier Lacan" target="_blank" rel="noopener" >Olivier Lacan</a>’s guide to writing a <em>CHANGELOG.md</em> for open source projects.</li> <li><a class="link" href="http://strftime.org" target="_blank" rel="noopener" >strftime</a>: <a class="link" href="https://twitter.com/mccutchen" title="Will McCutchen (@mccutchen) | Twitter" target="_blank" rel="noopener" >Will McCutchen</a>’s little page of the easily forgettable date formatting codes for Python programmers.</li> <li><a class="link" href="http://pyformat.info" target="_blank" rel="noopener" >PyFormat</a>: <a class="link" href="https://twitter.com/ulope" title="Ulrich Petri (@ulope) | Twitter" target="_blank" rel="noopener" >Ulrich Petri</a> and <a class="link" href="https://github.com/zerok" title="zerok (Horst Gutmann) Âˇ GitHub" target="_blank" rel="noopener" >Horst Gutmann</a>’s guide to new style (<code>.format()</code>) string formatting in Python.</li> <li><a class="link" href="https://twofactorauth.org" target="_blank" rel="noopener" >Two Factor Auth</a>: “List of websites and whether or not they support 2FA.”</li> </ul> <p>My website <a class="link" href="https://freespace.tdhopper.com" target="_blank" rel="noopener" >Free Space</a> is in this same vein.</p> <p>Publishing sites like this is free with <a class="link" href="https://netlify.com" target="_blank" rel="noopener" >Netlify</a> and <a class="link" href="https://pages.cloudflare.com" target="_blank" rel="noopener" >Cloudflare Pages</a>. I would love to see more of them created!</p> </article> <article> <h1>Thinking at Work</h1> <p>Wed, 22 Jul 2015 00:00:00 +0000</p> <p>Having worked from home for the last few years, I have a hard time understanding how people get anything done in open-floor plan offices. I would be overwhelmed and frustrated by the noise and commotion.</p> <p>I assumed open-floor plans for software shops were a relatively new invention. However, I just started reading <a class="link" href="https://en.wikipedia.org/wiki/Peopleware:_Productive_Projects_and_Teams" target="_blank" rel="noopener" >Peopleware: Productive Projects and Teams</a>, first published in 1987, and discovered that the first third of the book rails against open floor plan offices. I particularly enjoyed this quote:</p> <blockquote> <p>In my years at Bell Labs, we worked in two-person offices. They were spacious, quiet, and the phones could be diverted. I shared my office with Wendl Thomis, who went on to build a small empire as an electric toy maker. In those days, he was working on the Electronic Switching System fault dictionary. The dictionary scheme relied upon the notion of n-space proximity, a concept that was hairy enough to challenge even Wendel’s powers of concentration. One afternoon, I was bent over a program listing while Wendl was staring into space, his feet propped up on his desk. Our boss came in and asked, “Wendl! What are you doing?” Wendl said, “I’m thinking.” And the boss said, “Can’t you do that at home?”</p> <p>The difference between that Bell Labs environment and a typical modern-day office plan is that in those quiet offices, one at least had the option of thinking on the job. In most of the office space we encounter today, there is enough noise and interruption to make any serious thinking virtually impossible. More is the shame: Your people bring their brains with them every morning. They could put them to work for you at no additional cost if only there were a small measure of peace and quiet in the workplace.</p> </blockquote> </article> <article> <h1>How I Became a Data Scientist Despite Having Been a Math Major</h1> <p>Mon, 11 May 2015 00:00:00 +0000</p> <p>Caution: the following post is laden with qualitative extrapolation of anecdotes and impressions. Perhaps ironically (though perhaps not), it is not a data driven approach to measuring the efficacy of math majors as data scientists. If you have a differing opinion, I would greatly appreciate you to carefully articulate it and share it with the world.</p> <p>I recently started my third “real” job since finishing school; at my first and third jobs I have been a “data scientist”. I was a math major in college (and pretty good at it) and spent a year in the math Ph.D. program at the University of Virginia (and performed well there as well). These two facts alone would not have equipped me for a career in data science. In fact, it remains unclear to me that those two facts alone would have prepared me for any career (with the possible exception of teaching) without significantly more training.</p> <h2 id="there-has-never-been-a-better-time-to-be-a-mathematician">“There has never been a better time to be a mathematician”? </h2><p>When I was in college <a class="link" href="https://web.archive.org/web/20160315082911/https://www.bloomberg.com/news/articles/2006-01-22/math-will-rock-your-world" target="_blank" rel="noopener" >Business Week published an article</a> declaring “There has never been a better time to be a mathematician.” At the time, I saw an enormous disconnect between the piece and what I was being taught in math classes (and thus what I considered to be a “mathematician”). I have come across other pieces lauding this as the age of the mathematicians, and more often than not, I’ve wondered if the author knew what students actually studied in math departments.</p> <h3 id="background-on-me">Background on Me </h3><p>The math courses I had as an undergraduate were:</p> <ul> <li>Linear algebra</li> <li>Discrete math</li> <li>Differential equations (ODEs and numerical)</li> <li>Theory of statistics 1</li> <li>Numerical analysis 1 (numerical linear algebra) and 2 (quadrature)</li> <li>Abstract algebra</li> <li>Number theory</li> <li>Real analysis</li> <li>Complex analysis</li> <li>Intermediate analysis (point set topology)</li> </ul> <p>My program also required a one semester intro to C++ and two semesters of freshman physics. In my year as a math Ph.D. student, I took analysis, algebra, and topology classes; had I stayed in the program, my future coursework would have been similar: pure math where homework problems consistent almost exclusively of proofs done with pen and paper (or in LaTeX).</p> <h2 id="what-is-data-science">What is Data Science? </h2><p>Though my current position occasionally requires mathematical proof, I suspect that is rare among data scientist. While the “data science” demarcation problem is challenging (and I will not seek to solve it here), it seems evident that my curriculum lacked preparation in many essential areas of data science. Chief among these are programming skill, knowledge of experimental statistics, and experience with math modeling.</p> <h3 id="data-science-requires-programming-and-engineering">Data Science Requires Programming and Engineering </h3><p>Few would argue that programming ability is not a key skill of data science. As Drew Conway <a class="link" href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram" target="_blank" rel="noopener" >has argued</a>, a data scientist need not have a degree in computer science, but “Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.” Many of my undergrad peers, having briefly seen C++ freshman year and occasionally used Mathematica to solve ODEs for homework assignments, would have been unaware that manipulation of a file from the command-line was even possible, much less have been able to write a simple sed script; there was little difference with my grad school classmates.</p> <p>Many data science positions require even more than the ability to solve problems with code. As <a class="link" href="http://treycausey.com/software_dev_skills.html" target="_blank" rel="noopener" >Trey Causey has recently explained</a>, many positions require understanding of <em>software engineering</em> skills and tools such as writing reusable code, using version control, software testing, and logging. Though I gained a fair bit of programming skill in college, these skills, now essential in my daily work, remained foreign to me until years later.</p> <h3 id="data-science-requires-applied-statistics">Data Science Requires Applied Statistics </h3><p>My math training had a lack of statistics courses. Though my brief exposure to mathematical statistics has been valuable in picking up machine learning, experimental statistics was missing altogether. Many data science teams are interested in questions of causal inference and design and analysis of experiments; some would make these <em>essential</em> skills for a data scientist. I learned nothing about these topics in math departments. Moreover, machine learning, also a cornerstone of data science, is not a subject I could have even defined until after I was finished with my math coursework; at the end of college, I would have said artificial intelligence was mostly about rule-based systems in Lisp and Prolog.</p> <h3 id="data-science-involves-very-applied-math">Data Science Involves <em>Very Applied Math</em> </h3><p>Even if statistics had play a more prominent role in my coursework, those who have studied statistics know there is often a gulf between understanding textbook statistics and being able to effectively apply statistical models and methods to real world problems. This is only an aspect of a bigger issue: mathematical (including statistical) modeling is an extraordinarily challenging problem, but instruction on effectively model real world problems is absent from many math programs. To this day, defining my problem in mathematical terms one of the hardest problems I face; I am certain that I am not alone on this. Though I am now armed with a wide variety of mathematical models, it is rarely clear exactly which model can or should be applied in a given situation.</p> <p>I suspect that many people, even technical people, are uncertain as to what academic math is beyond undergraduate calculus. Mathematicians mostly work in the logical manipulation of abstractly defined structures. These structures rarely bear any necessary relationship to physical entities or data sets outside the abstractly defined domain of discourse. Though some might argue I am speaking only of “pure” mathematics, this is often true of what is formally known as “applied mathematics”. <a class="link" href="http://www.johndcook.com/veryappliedmath.html" target="_blank" rel="noopener" >John D. Cook has made similar observations</a> about the limitations of pure and applied math (as proper disciplines) in dubbing himself a “very applied mathematician”. Very applied mathematics is “an interest in the grubby work required to see the math actually used and a willingness to carry it out. This involves not just math but also computing, consulting, managing, marketing, etc.” These skills are conspicuously absent from most math curricula I am familiar with.</p> <h2 id="math--data-science">Math → Data Science </h2><p>Given this description of how my schooling left me woefully unprepared for a career in data science, one might ask how I have had two jobs with that title. I can think of several (though probably not all) reasons.</p> <p><strong>First, the academic study of mathematics provides much of the theoretical underpinnings of data science.</strong> Mathematics underlies the study of machine learning, statistics, optimization, data structures, analysis of algorithms, computer architecture, and other important aspects of data science. Knowledge of mathematics (potentially) allows the learner to more quickly grasp each of these fields. For example, learning how <a class="link" href="http://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank" rel="noopener" >principle component analysis</a>—a math model that can be applied and interpreted by someone without formal mathematical training—works will be significantly easier for someone with earlier exposure linear algebra. On a meta-level, training in mathematics forces students to think carefully and solve hard problems; these skills are valuable in many fields, including data science.</p> <p><strong>My second reason is connect to the first: I unwittingly took a number of courses that later played important roles in my data science toolkit.</strong> For example, my current work in Bayesian inference has been made possible by my knowledge of linear algebra, numerical analysis, stochastic processes, measure theory, and mathematical statistics.</p> <p><strong>Third, I did a minor in computer science as an undergraduate.</strong> That provided a solid foundation for me when I decided to get serious about building programming skill in 2010. Though my academic exposure to computer science lacked any software engineer skills, I left college with a solid grasp of basic data structures, analysis of algorithms, complexity theory, and a handful of programming languages.</p> <p><strong>Fourth, I did a masters degree in operations research (after my year as a math PhD student convinced me pure math wasn’t for me).</strong> This provided me with experience in math modeling, a broad knowledge of mathematical optimization (central to machine learning), and the opportunity to take graduate-level machine learning classes.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <p><strong>Fifth, my insatiable curiosity in computers and problem solving has played a key role in my career success.</strong> Eager to learn something about computer programming, I taught myself PHP and SQL as a high school student (to make Tolkien fan sites, incidentally). Having been given small Mathematica-based homework assignments in freshman differential equations, I bought and read a book on programming Mathematica. Throughout college and grad school, I often tried—and sometimes succeeded—to write programs to solve homework problems that professors expected to be solved by hand. This curiosity has proven valuable time and time again as I’ve been required to learn new skills and solve technical problems of all varieties. I’m comfortable jumping in to solve a new problem at work, because I’ve been doing that on my own time for fifteen years.</p> <p><strong>Sixth, I have been been fortunate enough to have employers who have patiently taught me and given me the freedom to learn on my own.</strong> I have learned an enormous amount in my two and a half year professional career, and I don’t anticipate slowing down any time soon. As <a class="link" href="https://twitter.com/mat_kelcey/status/328179275893207042" target="_blank" rel="noopener" >Mat Kelcey</a> has said: always be sure you’re not the smartest one in the room. I am very thankful for three jobs where I’ve been surrounded by smart people who have taught me a lot, and for supervisors who trust me enough to let me learn on my own.</p> <p><strong>Finally,<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> it would be hard for me to overvalue the four and a half years of participation in the data science community on Twitter.</strong> Through Twitter, I have the ear of some of data science’s brightest minds (most of whom I’ve never met in person), and I’ve built a peer network that has helped me find my current and last job. However, I mostly want to emphasize the pedagogical value of Twitter. Every day, I’m updated on the release of new software tools for data science, the best new blog posts for our field, and the musings of of some of my data science heros. Of course, I don’t read every blog post or learn every software tool. But Twitter helps me to recognize which posts are most worth my time, and because of Twitter, I know <em>something</em> instead of <em>nothing</em> about Theano, Scalding, and dplyr.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p> <h2 id="conclusions">Conclusions </h2><p>I don’t know to what extent my experience generalizes<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>, in either the limitations of my education or my analysis of my success, but I am obviously not going to let that stop me from drawing some general conclusions.</p> <h3 id="hiring-data-scientists">Hiring Data Scientists </h3><p>For those hiring data scientists, recognize that mathematics as taught might not be the same mathematics you need from your team. Plenty of people with PhDs in mathematics would be unable to define linear regression or bloom filters. At the same time, recognize that math majors are taught to think well and solve hard problems; these skills shouldn’t be undervalued. Math majors are also experienced in reading and learning math! They may be able to read academic papers and understand difficult (even if new) mathematical more quickly than a computer scientist or social scientist. Given enough practice and training, they would probably be excellent programmers.</p> <h3 id="studying-math">Studying Math </h3><p>For those studying math, recognize that the field you love, in its formal sense, may be keeping you away from enjoyable and lucrative careers. Most of your math professors have spent their adult lives solving math problems on paper or on a chalkboard. They are inexperienced and, possibly, unknowledgeable about <em>very applied mathematics</em>. A successful career in pure mathematics will be very hard and will require you to be very good. While there seem to be lots of jobs in teaching, they will rarely pay well.</p> <p>If you’re still an student, you have a great opportunity to take control of your career path. Consider taking computer science classes (e.g. data structures, algorithms, software engineering, machine learning) and statistics classes (e.g. experimental design, data analysis, data mining).</p> <p>For both students and graduates, recognize your math knowledge becomes very marketable when combined skills such as programming and machine learning; there are a wealth of good books, MOOCs, and blog posts that can help you learn these things. More over, the barrier to entry for getting started with production quality tools has never been lower. Don’t let your coursework be the extent of your education. There is so much more to learn!<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></p> <p><strong>Update (Oct. 2017)</strong> : <a class="link" href="https://www.youtube.com/watch?v=Q_UrZ4pmKmI" target="_blank" rel="noopener" >I gave a talk based on this post</a></p> <p><strong>Update (Mar. 2018)</strong> : I get a lot of emails with questions about this post, so I wrote <a class="link" href="https://tdhopper.com/blog/faq/" >FAQ</a> post trying to answer some of them.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>At the same time, my academic training in operations research failed me, in some aspects, for a successful career in operations research. For example, practical math modeling was not sufficiently emphasized and the skills of computer programming and software development were undervalued. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>Of course, I have plenty of data science skills left to learn. My knowledge of experimental design is still pretty fuzzy. I still struggle with effective mathematical modeling. I haven’t deployed a large scale machine learning system to production. I suck at software logging. I have no idea how deep learning works. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:3"> <p>I have successfully answered more than one interview question by regurgitating knowledge gleaned from tweets. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:4"> <p>Among other reasons, I didn’t really plan to get where I am today. I changed majors no fewer than three times in college (physics, CS, and math) and essentially dropped out of two PhD programs! <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:5"> <p>For example, install <a class="link" href="https://store.continuum.io/cshop/anaconda/" target="_blank" rel="noopener" >Anaconda</a> and start playing with some of <a class="link" href="https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks" target="_blank" rel="noopener" >these IPython notebooks</a>. <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Links for March 23, 2015</h1> <p>Mon, 23 Mar 2015 00:00:00 +0000</p> <p><a class="link" href="https://aws.amazon.com/blogs/aws/start-using-landsat-on-aws/" target="_blank" rel="noopener" >Start Using Landsat on AWS</a>: “The Landsat program has been running since 1972 and is the longest ongoing project to collect such imagery. Landsat 8 is the newest Landsat satellite and it gathers data based on visible, infrared, near-infrared, and thermal-infrared light. … You can now access over 85,000 Landsat 8 scenes” on AWS.</p> <p><a class="link" href="http://www.lurklurk.org/linkers/linkers.html" target="_blank" rel="noopener" >Beginner’s Guide to Linkers</a>: I’m getting back into doing a little C++ programming. Having spent the last 5 years in scripting languages, this was a helpful refresher on compilation.</p> <p><a class="link" href="http://www.labnol.org/internet/auto-forward-gmail-messages/20665/" target="_blank" rel="noopener" >How to Auto-Forward your Gmail Messages in Bulk</a>: Use Google App Scripts to autoforward emails by simply adding a label. I use this to add things to my <a class="link" href="https://support.omnigroup.com/omnifocus-mail-drop" target="_blank" rel="noopener" >Omnifocus task link</a>.</p> <p><a class="link" href="http://math.stackexchange.com/questions/2949/which-one-result-in-mathematics-has-surprised-you-the-most" target="_blank" rel="noopener" >Which one result in mathematics has surprised you the most?</a>: On Mathematics Stack Exchange. It might have been <a class="link" href="http://en.wikipedia.org/wiki/Huffman_coding" target="_blank" rel="noopener" >Huffman Coding</a> for me.</p> <p><a class="link" href="https://www.youtube.com/watch?v=3OHNYIg6N5Y&feature=youtube_gdata_player" target="_blank" rel="noopener" >Ruby Midwest 2013 The Most Important Optimization: Happiness</a>: Ernie Miller explains why he doesn’t let his career trump his happiness.</p> <p><a class="link" href="http://tonyfischetti.github.io/sake/" target="_blank" rel="noopener" >Sake by tonyfischetti</a>: Something of a modern GNU Make: “Sake is a way to easily design, share, build, and visualize workflows with intricate interdependencies. Sake is self-documenting because the instructions for building a project also serve as the documentation of the project’s workflow.”</p> <p><a class="link" href="https://github.com/n1k0/SublimeHighlight" target="_blank" rel="noopener" >n1k0/SublimeHighlight</a>: “An humble SublimeText package for exporting highlighted code as RTF or HTML.”</p> </article> <article> <h1>Concatenate MP3 Files</h1> <p>Sun, 15 Mar 2015 00:00:00 +0000</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">TIL `cat` will correctly join MP3 files <a href="http://t.co/mmc4NN2Kmp">http://t.co/mmc4NN2Kmp</a></p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/577149891077619713">March 15, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>From Stackoverflow:</p> <blockquote> <p>An MP3 file is nothing more than the raw MPEG2-Layer 3 (audio) stream data, there is no file level header structure with, for example, duration, original source, encoding info.</p> </blockquote> <p>Thus</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">cat file1.mp3 file2.mp3 > out.mp3 </span></span></code></pre></td></tr></table> </div> </div><p>can join MP3 files.</p> </article> <article> <h1>Links for March 13, 2015</h1> <p>Fri, 13 Mar 2015 00:00:00 +0000</p> <p><a class="link" href="http://stackoverflow.com/questions/21360361/how-to-dynamically-update-a-plot-in-a-loop-in-ipython-notebook-within-one-cell" target="_blank" rel="noopener" >Dynamically Update a Plot in IPython</a>: One thing I miss about Mathematica is <a class="link" href="http://reference.wolfram.com/language/ref/Animate.html" target="_blank" rel="noopener" >Animate</a> and <a class="link" href="http://reference.wolfram.com/language/ref/Manipulate.html" target="_blank" rel="noopener" >Manipulate</a>. IPython is slowing getting similar functionality. Here’s how to dynamically update a plot.</p> <p><a class="link" href="https://github.com/jiahao/ipython-profile" target="_blank" rel="noopener" >Jiahao’s IPython Notebook customizations</a>: Drop this CSS file on your machine, and suddenly your IPython notebooks look quite beautiful!</p> <p><a class="link" href="http://www.duetdisplay.com/" target="_blank" rel="noopener" >Duet Display</a>: I tried Air Display a few years ago, and it wasn’t worth the hassle. But Duet Display is a fantastic way to turn your iPad into an external display.</p> <p><a class="link" href="https://github.com/jbmouret/matplotlib_for_papers" target="_blank" rel="noopener" >Creating publication-quality figures with Matplotlib</a>: Plotting in Python frustrates me to no end. But here’s a nice tutorial on creating nice figures in with Matplotlib.</p> <p><a class="link" href="https://pypi.python.org/pypi/retrying" target="_blank" rel="noopener" >retrying 1.3.3 : Python Package Index</a>: Python decorators “to simplify the task of adding retry behavior to just about anything.” These work like a charm!</p> </article> <article> <h1>Introduction to PySpark</h1> <p>Sat, 28 Feb 2015 00:00:00 +0000</p> <p>I gave a talk at the Research Triangle Analysts meetup about Pyspark. It wasn’t recorded, but you can <a class="link" href="http://nbviewer.jupyter.org/format/slides/github/tdhopper/rta-pyspark-presentation/blob/master/slides.ipynb#/" target="_blank" rel="noopener" >see the IPython notebook I presented from</a>.</p> </article> <article> <h1>Links for January 19, 2015</h1> <p>Mon, 19 Jan 2015 00:00:00 +0000</p> <p><a class="link" href="http://www.mattblodgett.com/2015/01/but-where-do-people-work-in-this-office.html" target="_blank" rel="noopener" >Matt Blodgett: But Where Do People Work in This Office?</a>: “After looking through tons of cool office photos of many of the hottest companies in the Valley, I started to play a fun game I made up called ‘spot the desks’. I’ll show you what I mean.”</p> <p><a class="link" href="http://blog.stackoverflow.com/2015/01/why-we-still-believe-in-private-offices/" target="_blank" rel="noopener" >Why We (Still) Believe in Private Offices</a>: Joel Spolsky and Fog Creek Software have been relentless defenders of quite, private offices for developers. They continue that here.</p> <p><a class="link" href="http://www.alfredforum.com/topic/4686-pandoctor-an-alfred-gui-for-pandoc/" target="_blank" rel="noopener" >Pandoctor: An Alfred GUI for Pandoc</a>: If you use <a class="link" href="http://johnmacfarlane.net/pandoc/" target="_blank" rel="noopener" >Pandoc</a> and <a class="link" href="http://alfredapp.com/" target="_blank" rel="noopener" >Alfred</a>, this is worth trying.</p> <p><a class="link" href="https://github.com/tdhopper/alfred-hop" target="_blank" rel="noopener" >Alfred Hop</a>: I use a little bash tool called <a class="link" href="https://github.com/Cue/hop" target="_blank" rel="noopener" >Hop</a> to bookmark frequently used directories. I made this tool to give me quick access to my bookmarks from Alfred.</p> <p><a class="link" href="https://github.com/realpython/discover-flask" target="_blank" rel="noopener" >Discover Flask</a>: Flask, the lightweight Python framework, is a joy to use. Here’s a nice introduction to it.</p> <p><a class="link" href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html" target="_blank" rel="noopener" >Introduction to dplyr</a>: I haven’t used R much since leaving my last job, but the ecosystem has been booming with great tools; dplyr is one of them.</p> </article> <article> <h1>Links for December 29, 2014</h1> <p>Mon, 29 Dec 2014 00:00:00 +0000</p> <p><a class="link" href="https://www.youtube.com/watch?v=gurZp2JoD-s" target="_blank" rel="noopener" >Sublime: Nice Features & Plugins</a>: A brief talk introducing my favorite editor.</p> <p><a class="link" href="https://github.com/spamwax/alfred-pinboard" target="_blank" rel="noopener" >Alfred Workflow for Pinboard</a>: I’ve started using <a class="link" href="http://pinboard.in/" target="_blank" rel="noopener" >Pinboard</a> a bit for organizing links. Here’s something that has the chance of getting me much deeper into pinboard: a powerful Alfred Workflow for interacting with Pinboard from your Mac’s keyboard. HT <a class="link" href="http://www.bachyaproductions.com/link-mash-december-23-2014/" target="_blank" rel="noopener" >Aaron Bachya</a></p> <p><a class="link" href="https://github.com/alvarotrigo/fullPage.js" target="_blank" rel="noopener" >fullPage.js</a>: I’ve been using this jquery plugin in a forthcoming project. It makes it really easy to create slide-like single page websites.</p> <p><a class="link" href="http://benkurtovic.com/2014/06/01/obfuscating-hello-world.html" target="_blank" rel="noopener" >Obfuscating “Hello world!”</a>: The author attempts to write the worst ‘hello world’ possibe in Python. He does a good job.</p> <p><a class="link" href="http://ryanmo.co/2014/05/10/show-time-in-multiple-time-zones-with-textexpander/" target="_blank" rel="noopener" >Show Time in Multiple Time Zones with TextExpander</a>: As I spend more time working with people in different time zones, tools like this help remove the cognitive challenge of translating time.</p> </article> <article> <h1>Links for December 22, 2014</h1> <p>Mon, 22 Dec 2014 00:00:00 +0000</p> <p><a class="link" href="http://unix4lyfe.org/time/?v=1" target="_blank" rel="noopener" >Time</a>: Programmers all hate time, timezones, etc. Here are some helpful “notes about time”.</p> <p><a class="link" href="http://strftime.org/" target="_blank" rel="noopener" >Python strftime reference</a>: Speaking of time: “A quick reference for Python’s strftime formatting directives.” I have to look this stuff each time I need it.</p> <p><a class="link" href="https://github.com/earwig/git-repo-updater" target="_blank" rel="noopener" >gitup</a>: “A console script that allows you to easily update multiple git repositories at once”</p> <p><a class="link" href="http://www.slate.com/articles/podcasts/working/2014/12/google_software_engineer_nina_kang_talks_about_the_differences_between_writing.html" target="_blank" rel="noopener" >The “How Does a Google Coder Work?” Edition </a>: I enjoyed this interview. My favorite quote? “When you’re reading code is it as clear as reading English?” “If I’m reading C++ code, it’s clearer.”</p> <p><a class="link" href="http://www-stat.wharton.upenn.edu/~steele/HoldingPen/SunsetSalvo.pdf" target="_blank" rel="noopener" >Sunset Salvo</a>: John Turkey discusses practical data analysis and statistical humility.</p> <p><a class="link" href="https://stat.duke.edu/bnp10/index.html%3Fpage_id=1.html" target="_blank" rel="noopener" >10th Conference on Bayesian Nonparametrics</a>: This is coming up in my own back yard. I’m excited!</p> <p><a class="link" href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43146.pdf" target="_blank" rel="noopener" >Machine Learning: The High-Interest Credit Card of Technical Debt</a>: I haven’t read this in detail, but the premise makes tons of sense to me: “It is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning.”</p> </article> <article> <h1>Should I get a Ph.D.?</h1> <p>Mon, 08 Dec 2014 00:00:00 +0000</p> <p>Last year, I published nine interviews with Internet friends about how an academically-minded, 22-year-old college senior should work on a Ph.D. Many people have told me the interviews have been helpful for them or that they’ve emailed them to others.</p> <p>I hope this continues to be a valuable resource. I’d encourage you to share this with anyone you know who is thinking through this question.</p> <ul> <li><a class="link" href="https://s3.us-east-1.amazonaws.com/tdhopper.com/shouldigetaphd.pdf" target="_blank" rel="noopener" >Download PDF version</a></li> <li><a class="link" href="https://s3.us-east-1.amazonaws.com/tdhopper.com/shouldigetaphd.epub" target="_blank" rel="noopener" >Download ePUB version</a></li> </ul> <h2 id="excerpts">Excerpts </h2><h3 id="dr-john-d-cookfreelance-consultant-and-blogger">DR. JOHN D. COOK, FREELANCE CONSULTANT AND BLOGGER </h3><blockquote> <p>There are basically two reasons to get a Ph.D.: personal satisfaction, and credentials for a job requiring a Ph.D.</p> </blockquote> <blockquote> <p>If a professor has never worked outside of academia, I’d be skeptical of anything he or she says about “the real world.”</p> </blockquote> <h3 id="dr-paul-rubinprofessor-emeritus">DR. PAUL RUBIN, PROFESSOR EMERITUS </h3><blockquote> <p>If you do not enjoy doing research, pursuing a Ph.D. will be difficult, unfulfilling and possibly pointless (since you will not want a job with research expectations).</p> </blockquote> <blockquote> <p>Having both a masters and doctorate in mathematics is no better than having just a doctorate.</p> </blockquote> <h3 id="dr-eric-jonaspostdoc-and-serial-entrepreneur">DR. ERIC JONAS, POSTDOC AND SERIAL ENTREPRENEUR </h3><blockquote> <p>My best advice to an undergraduate curious about the “experience” of graduate school is: work in a lab while you’re an undergrad.</p> </blockquote> <blockquote> <p>A lot of people think of a PhD as being like an undergraduate degree in that you’ve “learned a lot of material”. This is false.</p> </blockquote> <h3 id="mr-carl-vogeldata-scientist">MR. CARL VOGEL, DATA SCIENTIST </h3><blockquote> <p>The world is full of miserable grad students.</p> </blockquote> <blockquote> <p>Successful grad students aren’t like normal humans.</p> </blockquote> <h3 id="dr-melissa-santosdata-scientist">DR. MELISSA SANTOS, DATA SCIENTIST </h3><blockquote> <p>To some extent, the process of getting the Ph.D. helped me have the mindset of putting together methods and being creative in my approach to problems that I’m not sure I would have with just the masters degrees.</p> </blockquote> <blockquote> <p>The only reason you HAVE to do a Ph.D. is to become a professor. </p> </blockquote> <h3 id="dr-paul-harperprofessor-of-operations-research">DR. PAUL HARPER, PROFESSOR OF OPERATIONS RESEARCH </h3><blockquote> <p>A very common misconception is that applicants can simply pick a supervisor of their choice, but this requires mutual consent.</p> </blockquote> <blockquote> <p>Perhaps the best way to summarise the life of a Ph.D. student is to look at the awesome Ph.D. comics (<a class="link" href="http://phdcomics.com/" target="_blank" rel="noopener" >phdcomics.com</a>) by Jorge Cham, which are spot on</p> </blockquote> <h3 id="dr-laura-mclayprofessor-of-operations-research">DR. LAURA MCLAY, PROFESSOR OF OPERATIONS RESEARCH </h3><blockquote> <p>It’s important to think about how a Ph.D. fits in with other life decisions. I definitely felt like it would be hard to go back to graduate school if I started another career</p> </blockquote> <blockquote> <p>Let me be clear: a Ph.D. is not a Masters degree plus a little more coursework and a small project. </p> </blockquote> <h3 id="mr-mike-nuterecovering-actuary-and-phd-student">MR. MIKE NUTE, RECOVERING ACTUARY AND PH.D. STUDENT </h3><blockquote> <p>More specifically, if you don’t think that getting the Ph.D. is going to be fun on its own, then there’s a strong chance you’ll be miserable and it will end badly.</p> </blockquote> <blockquote> <p>This reinforces the last point above: the only real reason to do a Ph.D. program is for love of the subject.</p> </blockquote> <h3 id="dr-oscar-boykinsoftware-engineer">DR. OSCAR BOYKIN, SOFTWARE ENGINEER </h3><blockquote> <p>The number one question: does he or she have a burning desire to do a PhD?</p> </blockquote> <blockquote> <p>As a professor, you are running a startup that can never be profitable: you are always raising money and hiring.</p> </blockquote> </article> <article> <h1>Links for December 6, 2014</h1> <p>Sat, 06 Dec 2014 00:00:00 +0000</p> <p><a class="link" href="http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/journal.pdf" target="_blank" rel="noopener" >Sketching as a Tool for Numerical Linear Algebra</a>: A neat paper on sketching algorithms for linear algebra. No, not that kind of sketching. “One first compresses it to a much smaller matrix by multiplying it by a (usually) random matrix with certain properties. Much of the expensive computation can then be performed on the smaller matrix, thereby accelerating the solution for the original problem.”</p> <p><a class="link" href="https://www.coursera.org/course/maps" target="_blank" rel="noopener" >Maps and the Geospatial Revolution</a>: Coursera is teaching a class in the spring on how geospatical technology has changed our world.</p> <p><a class="link" href="http://www.gis.usu.edu/~chrisg/python/2009/" target="_blank" rel="noopener" >Geoprocessing with Python using Open Source GIS</a>: Speaking of geospatial technology, here are some slides and problems from a class on “Geoprocessing with Python”.</p> <p><a class="link" href="http://blog.pluralsight.com/how-to-use-bash-command-line-history" target="_blank" rel="noopener" >How to use the bash command line history</a>: Bash’s history can do more than I realized!</p> <p><a class="link" href="http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/" target="_blank" rel="noopener" >A geometric interpretation of the covariance matrix</a>: Superb little post explaning covariance matrices with pictures and geometry.</p> </article> <article> <h1>Links for December 4, 2014</h1> <p>Thu, 04 Dec 2014 00:00:00 +0000</p> <p><a class="link" href="http://mathematica.stackexchange.com/questions/66538/how-do-i-draw-a-pair-of-buttocks" target="_blank" rel="noopener" >How do I draw a pair of buttocks?</a>: Have you ever wondered how to plot a pair of buttocks in Mathematica? Of course you have.</p> <p><a class="link" href="http://arxiv.org/abs/1411.5018" target="_blank" rel="noopener" >Frequentism and Bayesianism: A Python-driven Primer</a>: Jake Vanderplas wrote a “brief, semi-technical comparison” of frequentist and Bayesian statistical inference using examples in Python.</p> <p><a class="link" href="https://github.com/EducationalTestingService/skll" target="_blank" rel="noopener" >skll</a>: Dan Blanchard released version 1.0 of his very cool command line tool for doing experiments with scikit-learn.</p> <p><a class="link" href="https://codeascraft.com/2014/11/17/personalized-recommendations-at-etsy/" target="_blank" rel="noopener" >Personalized Recommendations at Etsy</a>: A fantastic post from Etsy’s engineering blog on building scalable, personalized recommendations using linear algebra and locally sensitive hashing. I like math.</p> <p><a class="link" href="http://www.pixelmonkey.org/2014/11/02/clojonic" target="_blank" rel="noopener" >Pythonic Clojure</a>: Andrew Montalenti wrote a post analyzing Clojure from a Python programmer’s perspective. It’s great.</p> </article> <article> <h1>Links for November 24, 2014</h1> <p>Mon, 24 Nov 2014 00:00:00 +0000</p> <p><a class="link" href="https://github.com/stripe/brushfire" target="_blank" rel="noopener" >brushfire</a>: Avi Bryant has been building a ‘Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.’ Fun stuff!</p> <p><a class="link" href="http://stackoverflow.com/questions/500607/what-are-the-lesser-known-but-useful-data-structures" target="_blank" rel="noopener" >What are the lesser known but useful data structures?</a>: I always enjoy StackOverflow questions like this, <strong>but it is not considered a good, on-topic question for this site</strong>, of course.</p> <p><a class="link" href="https://github.com/vhf/free-programming-books/blob/master/free-programming-books.md" target="_blank" rel="noopener" >Free Programming Books</a>: A huge, crowd-sourced list of free programming books by language and topic.</p> <p><a class="link" href="https://www.ml.cmu.edu/research/phd-dissertations.html" target="_blank" rel="noopener" >PhD Dissertations-Machine Learning Department</a>: Seven years of ML PhD dissertations from Carnegie Mellon University. I wish I had time to read <a class="link" href="http://reports-archive.adm.cs.cmu.edu/anon/cald/CMU-CALD-05-107.pdf" target="_blank" rel="noopener" >Tools for Large Graph Mining</a>.</p> </article> <article> <h1>Links for November 17, 2014</h1> <p>Mon, 17 Nov 2014 00:00:00 +0000</p> <p><a class="link" href="https://www.hackerschool.com/blog/14-there-is-no-magic-virtualenv-edition" target="_blank" rel="noopener" >There’s no magic: virtualenv edition</a>: I didn’t really get virtualenvs until long after I started programming Python, though they’re now an essential part of my toolkit. This is a great post explaining how they work.</p> <p><a class="link" href="http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html" target="_blank" rel="noopener" >Traps for the Unwary in Python’s Import System</a>: “Python’s import system is powerful, but also quite complicated.”</p> <p><a class="link" href="https://github.com/Psycojoker/pyfmt" target="_blank" rel="noopener" >pyfmt</a>: I recently learned about <a class="link" href="https://golang.org/cmd/gofmt/" target="_blank" rel="noopener" >gofmt</a> for auto-formatting Go code. Here’s a similar tool for Python.</p> <p><a class="link" href="https://groups.google.com/forum/#!msg/comp.lang.java/aSPAJO05LIU/ushhUIQQ-ogJ" target="_blank" rel="noopener" >Q: Setting User-Agent Field?</a>: A 1996 question in comp.lang.java on how to set the user agent field for a Java crawler. The signature on the question? <em>Thanks, Larry Page</em></p> <p><a class="link" href="https://github.com/alecthomas/importmagic" target="_blank" rel="noopener" >alecthomas/importmagic</a>: Python tool and Sublime extension for automatically adding imports.</p> </article> <article> <h1>Links for November 12, 2014</h1> <p>Wed, 12 Nov 2014 00:00:00 +0000</p> <p><a class="link" href="http://amazonpickingchallenge.org/" target="_blank" rel="noopener" >Amazon Picking Challenge</a>: Kiva Systems (where I interned in 2011) is setting up a robotics challenging for picking items off warehouse shelves.</p> <p><a class="link" href="https://pypi.python.org/pypi/contexttimer/0.3.1" target="_blank" rel="noopener" >contexttimer 0.3.1</a>: A handy Python context manger and decorator for timing things.</p> <p><a class="link" href="http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/" target="_blank" rel="noopener" >How-to: Translate from MapReduce to Apache Spark</a>: This is a helpful bit from Cloudera on moving algorithms from Mapreduce to Spark.</p> <p><a class="link" href="https://pypi.python.org/pypi/combinatorics/1.4.3" target="_blank" rel="noopener" >combinatorics 1.4.3</a>: Here’s a Python module adding some combinatorial functions to the language.</p> <p><a class="link" href="https://mail.python.org/pipermail/python-3000/2006-November/004643.html" target="_blank" rel="noopener" >Special methods and interface-based type system</a>: Guido van Rossum explains (in 2006) why Python uses <code>len(x)</code> instead of <code>x.len()</code>.</p> </article> <article> <h1>Links for November 3, 2014</h1> <p>Mon, 03 Nov 2014 00:00:00 +0000</p> <p><a class="link" href="https://aws.amazon.com/datasets" target="_blank" rel="noopener" >Public Data Sets : Amazon Web Services</a>: Amazon hosts a number of publicly datasets on AWS (including the common crawl corpus and the “Marvel Universe Social Graph”).</p> <p><a class="link" href="https://www.youtube.com/watch?v=muMRNYPmQxE" target="_blank" rel="noopener" >Rapid Web Prototyping with Lightweight Tools</a>: I’ve shared this before, but my boss Andrew did a fantastic tutorial last year on Flask, Jinja2, MongoDB, and Twitter Bootstrap. Combined with Heroku, it’s surprisingly easy to get a website running these days.</p> <p><a class="link" href="https://github.com/wichert/rest_toolkit" target="_blank" rel="noopener" >rest_toolkit</a>: REST has been my obsession of late. Here’s a little Python package for quickly writing RESTful APIs.</p> <p><a class="link" href="http://www.davar.net/PROGRAM/EXTRACTS/CRAFTJOY.HTM" target="_blank" rel="noopener" >The Joys of the Craft</a>: A quote from Fred Brooks’ <em>The Mythical Man-Month</em> on why programming is fun.</p> <p><a class="link" href="http://unix.stackexchange.com/questions/77077/how-do-i-use-pushd-and-popd-commands" target="_blank" rel="noopener" >How do I use pushd and popd commands?</a>: I recently learned bash has push and popd commands for temporarily changing directories. This is very handy for scripting.</p> </article> <article> <h1>Links for November 1, 2014</h1> <p>Sat, 01 Nov 2014 00:00:00 +0000</p> <p><a class="link" href="http://www.joelonsoftware.com/articles/Unicode.html" target="_blank" rel="noopener" >The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>: I guess the title says it all. By Joel Spolsky.</p> <p><a class="link" href="http://hyperpolyglot.org/unix-shells" target="_blank" rel="noopener" >Unix Shells - Hyperpolyglot</a>: Very cool comparison of basic command syntax in Bash, Fish, Ksh, Tcsh, and Zsh.</p> <p><a class="link" href="http://blog.sanctum.geek.nz/better-bash-history/" target="_blank" rel="noopener" >Better Bash history</a>: I’m pretty stuck on Bash at the moment. Here’s a way to get a better history in Bash. (Other shells often improve on Bash’s history.)</p> <p><a class="link" href="http://usaddress.readthedocs.org/en/latest/" target="_blank" rel="noopener" >usaddress 0.1</a>: I always love seeing a Python library for something I’ve tried to do poorly on my own: “usaddress is a python library for parsing unstructured address strings into address components, using advanced NLP methods.”</p> <p><a class="link" href="https://pythonhosted.org/more-itertools/api.html" target="_blank" rel="noopener" >more-itertools</a>: A great extension to the helpful <a class="link" href="https://docs.python.org/2/library/itertools.html" target="_blank" rel="noopener" >itertools</a> module in Python. Some particularly helpful functions: <code>chunked</code>, <code>first</code>, <code>peekaboo</code>, and <code>take</code>. Unfortunately, it doesn’t have Python 3 support at the moment.</p> </article> <article> <h1>Pyspark's AggregateByKey Method</h1> <p>Fri, 03 Oct 2014 00:00:00 +0000</p> <p>The <a class="link" href="https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html" target="_blank" rel="noopener" >pyspark documentation</a> doesn’t include an example for the aggregateByKey RDD method. I didn’t find any nice examples online, so I wrote my own.</p> <p>Here’s what the documentation does say:</p> <p><code>aggregateByKey(self, zeroValue, seqFunc, combFunc, numPartitions=None)</code></p> <blockquote> <p>Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.</p> </blockquote> <p><code>reduceByKey</code> and <code>aggregateByKey</code> are much more efficient than <code>groupByKey</code> and should be used for aggregations as much as possible.</p> <p>In the example below, I create an RDD that is a short list of characters. My functions will aggregate the functions together with concatenation. I added brackets to the two types of concatenation to help give you an idea of what <code>aggregateByKey</code> is doing.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Welcome to </span></span><span class="line"><span class="cl"> ____ __ </span></span><span class="line"><span class="cl"> / __/__ ___ _____/ /__ </span></span><span class="line"><span class="cl"> _\ \/ _ \/ _ `/ __/ '_/ </span></span><span class="line"><span class="cl"> /__ / .__/\_,_/_/ /_/\_\ version 1.1.0 </span></span><span class="line"><span class="cl"> /_/ </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">Using Python version 2.7.5 (default, Mar 9 2014 22:15:05) </span></span><span class="line"><span class="cl">SparkContext available as sc. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">In [1]: # Create rdd that is a list of characters </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">In [2]: sc.parallelize(list("aaaaabbbbcccdd")) \ </span></span><span class="line"><span class="cl"> ...: .map(lambda letter: (letter, {"value": letter})) \ </span></span><span class="line"><span class="cl"> ...: .aggregateByKey( </span></span><span class="line"><span class="cl"> ...: # Value to start aggregation (passed as s to `lambda s, d`) </span></span><span class="line"><span class="cl"> ...: "start", </span></span><span class="line"><span class="cl"> ...: # Function to join final data type (string) and rdd data type </span></span><span class="line"><span class="cl"> ...: lambda s, d: "[ %s %s ]" % (s, d["value"]), </span></span><span class="line"><span class="cl"> ...: # Function to join two final data types. </span></span><span class="line"><span class="cl"> ...: lambda s1, s2: "{ %s %s }" % (s1, s2), </span></span><span class="line"><span class="cl"> ...: ) \ </span></span><span class="line"><span class="cl"> ...: .collect() </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">Out[2]: </span></span><span class="line"><span class="cl">[('a', '{ { [ start a ] [ [ start a ] a ] } [ [ start a ] a ] }'), </span></span><span class="line"><span class="cl"> ('c', '{ [ start c ] [ [ start c ] c ] }'), </span></span><span class="line"><span class="cl"> ('b', '{ { [ [ start b ] b ] [ start b ] } [ start b ] }'), </span></span><span class="line"><span class="cl"> ('d', '[ [ start d ] d ]')] </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Links for September 30, 2014</h1> <p>Tue, 30 Sep 2014 00:00:00 +0000</p> <p><a class="link" href="https://github.com/kadirpekel/hammock/blob/master/README.rst" target="_blank" rel="noopener" >Hammock</a>: A lightweight wrapper around the Python requests module to convert REST APIs into “dead simple programmatic APIs”. It’s a clever idea. I’ll have to play around with it before I can come up with a firm opinion.</p> <p><a class="link" href="https://github.com/mitsuhiko/pipsi#readme" target="_blank" rel="noopener" >pipsi</a>: Pipsi wraps pip and virtualenv to allow you to install Python command line utilities without polluting your global environment.</p> <p><a class="link" href="http://nvie.com/posts/writing-a-cli-in-python-in-under-60-seconds/" target="_blank" rel="noopener" >Writing a Command-Line Tool in Python</a>: Speaking of Python command line utilities, here’s a little post from Vincent Driessen on writing them.</p> <p><a class="link" href="http://nvie.com/posts/iterators-vs-generators/" target="_blank" rel="noopener" >Iterables vs. Iterators vs. Generators</a>: Vincent has been on a roll lately. He also wrote this “little pocket reference on iterables, iterators and generators” in Python.</p> <p><a class="link" href="http://mcfunley.com/design-for-continuous-experimentation" target="_blank" rel="noopener" >Design for Continuous Experimentation: Talk and Slides</a>: I didn’t watch the lecture, but Dan McKinley’s slides on web experimentation are excellent.</p> <p><a class="link" href="http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/" target="_blank" rel="noopener" >Apache Spark: A Delight for Developers</a>: I’ve been playing with PySpark lately, and it really is fun.</p> </article> <article> <h1>Links for September 25, 2014</h1> <p>Thu, 25 Sep 2014 00:00:00 +0000</p> <p><a class="link" href="http://plato.stanford.edu/entries/statistics/" target="_blank" rel="noopener" >Philosophy of Statistics (Stanford Encyclopedia of Philosophy)</a>: I suspect that a lot of the Bayesian vs Frequentist debates ignore the important epistemological underpinnings of statistics. I haven’t finished reading this yet, but I wonder if it might help.</p> <p><a class="link" href="https://ifttt.com/sunlightfoundation" target="_blank" rel="noopener" >Connect Sunlight Foundation to anything</a>: “The Sunlight Foundation is a nonpartisan non-profit organization that uses the power of the Internet to catalyze greater U.S. Government openness and transparency.” They now of an <a class="link" href="https://ifttt.com" target="_blank" rel="noopener" >IFTTT</a> channel. Get push notifications when the president signs a bill!</p> <p><a class="link" href="http://furbo.org/2014/09/03/the-terminal/" target="_blank" rel="noopener" >furbo.org · The Terminal</a>: <a class="link" href="http://www.twitter.com/chockenberry" target="_blank" rel="noopener" >Craig Hockenberry</a> wrote a massive post on how he uses the Terminal on OS X for fun and profit. You will learn things.</p> <p><a class="link" href="http://snapsnapsnap.photos/a-sneak-peek-at-camera-6-manual-controls-are-coming-soon-to-you/" target="_blank" rel="noopener" >A sneak peek at Camera+ 6… manual controls are coming soon to you! </a>: I’ve been a Camera+ user on iOS for a long time. The new version coming out soon is very exciting.</p> <p><a class="link" href="http://githut.info/" target="_blank" rel="noopener" >GitHut - Programming Languages and GitHub</a>: A very clever visualization of various languages represented on Github and of the properties of their respective repositories.</p> </article> <article> <h1>Links for September 20, 2014</h1> <p>Sat, 20 Sep 2014 00:00:00 +0000</p> <p><a class="link" href="http://www.jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/" target="_blank" rel="noopener" >Open Sourcing a Python Project the Right Way</a>: Great stuff that should be taught in school: “Most Python developers have written at least one tool, script, library or framework that others would find useful. My goal in this article is to make the process of open-sourcing existing Python code as clear and painless as possible.”</p> <p><a class="link" href="https://github.com/elasticsearch/elasticsearch-dsl-py/blob/master/README.rst" target="_blank" rel="noopener" >elasticsearch/elasticsearch-dsl-py</a>: Elasticsearch is an incredible datastore. Unfortunately, its JSON-based query language is tedious, at best. Here’s a nice higher-level Python DSL being developed for it. It’s great!</p> <p><a class="link" href="http://www.podcastinghandbook.co/podcasting-equipment-guide/" target="_blank" rel="noopener" >Equipment Guide — The Podcasting Handbook</a>: Dan Benjamin of 5by5 podcasting fame is writing a book on podcasting. Here’s his brief equipment guide.</p> <p><a class="link" href="https://github.com/bachya/pinpress" target="_blank" rel="noopener" >bachya/pinpress</a>: Aaron Bach put together a neat Ruby script that he uses to generate his link posts. This is similar to but better than my <a class="link" href="https://github.com/tdhopper/sundry" target="_blank" rel="noopener" >sundry</a> tool.</p> <p><a class="link" href="http://there4development.com/markdown-resume/" target="_blank" rel="noopener" >Markdown Resume Builder</a>: I haven’t tried this yet, but I like the idea: a Markdown based resume format that can be converted into HTML or PDF.</p> <p><a class="link" href="http://git-scm.com/book/en/Git-Basics-Tips-and-Tricks#Auto-Completion" target="_blank" rel="noopener" >Git - Tips and Tricks</a>: Enabling autocomplete in Git is something I should have done long ago.</p> <p><a class="link" href="http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/" target="_blank" rel="noopener" >Apache Storm Design Pattern—Micro Batching</a>: Micro batching is a valuable tool when doing stream processing. Horton Works put up a helpful post outlining three ways of doing it.</p> </article> <article> <h1>Quickly Converting Python Dict to JSON</h1> <p>Thu, 18 Sep 2014 00:00:00 +0000</p> <p>Recently, I’ve spent a lot of time going back and forth between Python dicts and JSON. For some reason, I decided last week that I’d be useful to be able to quickly convert a Python dict to pretty printed JSON.</p> <p>I created a <a class="link" href="http://smilesoftware.com/TextExpander/index.html" target="_blank" rel="noopener" >TextExpander</a> snippet that takes a Python dict from the clipboard, converts it to JSON, and pastes it.</p> <p>Here are the details:</p> <figure> </figure> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="ch">#!/usr/bin/env python</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">os</span><span class="o">,</span> <span class="nn">json</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">subprocess</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">getClipboardData</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">p</span> <span class="o">=</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">Popen</span><span class="p">([</span><span class="s1">'pbpaste'</span><span class="p">],</span> <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="o">.</span><span class="n">PIPE</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">retcode</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">wait</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="n">data</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">data</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">cb</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">getClipboardData</span><span class="p">())</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="nb">print</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">cb</span><span class="p">,</span> <span class="n">sort_keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">separators</span><span class="o">=</span><span class="p">(</span><span class="s1">','</span><span class="p">,</span> <span class="s1">': '</span><span class="p">))</span> </span></span></code></pre></td></tr></table> </div> </div> </article> <article> <h1>Links for September 10, 2014</h1> <p>Wed, 10 Sep 2014 00:00:00 +0000</p> <p><a class="link" href="http://textract.readthedocs.org/en/latest/" target="_blank" rel="noopener" >textract</a>: textract is a Python module and a command line tool for text extraction from many file formats. It cleverly pulls together many libraries into a consistent API.</p> <p><a class="link" href="https://github.com/semirook/flask-kit" target="_blank" rel="noopener" >Flask Kit</a>: I’ve been reading a lot about Flask (the Python web server) lately. Flask Kit is a little tool to give some structure to new Flask projects.</p> <p><a class="link" href="https://github.com/audreyr/cookiecutter" target="_blank" rel="noopener" >cookiecutter</a>: I was looking for this recently, but it I couldn’t find it. “A command-line utility that creates projects from cookiecutters (project templates). E.g. Python package projects, jQuery plugin projects.” There’s even a <a class="link" href="https://github.com/sloria/cookiecutter-flask" target="_blank" rel="noopener" >Flask template</a>!</p> <p><a class="link" href="http://www.sciencedaily.com/releases/2010/08/100830094930.htm" target="_blank" rel="noopener" >Over 50? You Probably Prefer Negative Stories About Young People</a>: A research paper from a few years ago show that older people prefer to read negative news about young people. “In fact, older readers who chose to read negative stories about young individuals actually get a small boost in their self-esteem.”</p> <p><a class="link" href="http://www.npr.org/blogs/money/2014/08/29/344034815/episode-564-the-signature" target="_blank" rel="noopener" >Episode 564: The Signature</a>: The fantastic Planet Money podcast explains why signatures are meaningless in a modern age. My scribbles have become even worse since listening to this.</p> <p><a class="link" href="https://github.com/thieman/github-selfies" target="_blank" rel="noopener" >github-selfies</a>: Here’s a Chrome and Firefox extension that allows you to quickly embed gif selfies in Github posts. Caution: <a class="link" href="http://www.threechords.org/blog/how-animated-gif-selfies-fixed-our-teams-morale/" target="_blank" rel="noopener" >may lead to improved team morale</a>.</p> </article> <article> <h1>Shortcut to Switch to Previous Git Branch</h1> <p>Wed, 10 Sep 2014 00:00:00 +0000</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">TIL how to checkout the last branch you were on in Git (before the current):<br><br>git checkout -</p>— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/509767391103172608">September 10, 2014</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> </article> <article> <h1>Links for August 30, 2014</h1> <p>Sat, 30 Aug 2014 00:00:00 +0000</p> <p><a class="link" href="http://jimhester.github.io/ggplot2ToGgvis/" target="_blank" rel="noopener" >Ggplot2 To Ggvis</a>: I’m a huge fan of ggplot2 for data visualization in R. Here’s a brief tutorial for ggplot2 users to learn ggvis for generating <em>interactive</em> plots in R using the grammar of graphics.</p> <p><a class="link" href="http://danielfrg.com/blog/2014/08/01/storm-sklearn/" target="_blank" rel="noopener" >From zero to storm cluster for scikit-learn classification | Daniel Rodriguez</a>: This is a very cool, if brief, blog post on using <a class="link" href="https://github.com/Parsely/streamparse" title="Parsely/streamparse Âˇ GitHub" target="_blank" rel="noopener" >streamparse</a>, my company’s open source wrapper for <a class="link" href="https://storm.incubator.apache.org/" title="Storm, distributed and fault-tolerant realtime computation" target="_blank" rel="noopener" >Apache Storm</a>, and <a class="link" href="http://scikit-learn.org/stable/" title="scikit-learn: machine learning in Python — scikit-learn 0.15.1 documentation" target="_blank" rel="noopener" >scikit-learn</a>, <a class="link" href="https://www.youtube.com/watch?v=2kx19t8bNMU" title="An Introduction to Scikit-Learn - YouTube" target="_blank" rel="noopener" >my favorite machine learning library</a>, to do machine learning on data streams.</p> <p><a class="link" href="http://www.pixelmonkey.org/2010/11/03/pythonic-means-idiomatic-and-tasteful" target="_blank" rel="noopener" >Pythonic means idiomatic and tasteful</a>: My boss Andrew recently shared an old blogpost he wrote on what it means for code to be Pythonic; I think he’s right on track.</p> <blockquote> <p>Pythonic isn’t just idiomatic Python — it’s tasteful Python. It’s less an objective property of code, more a compliment bestowed onto especially nice Python code.</p> </blockquote> <p><a class="link" href="https://github.com/deanishe/alfred-repos" target="_blank" rel="noopener" >git workflow</a>: In my ever continuing attempt to be able to run my entire life from <a class="link" href="http://www.alfredapp.com/" title="Alfred App - Productivity App for Mac OS X" target="_blank" rel="noopener" >Alfred</a>, I recently installed this workflow that makes git repositories on my computer easily searchable.</p> <p><a class="link" href="http://www.deanishe.net/alfred-workflow/index.html" target="_blank" rel="noopener" >Alfred-Workflow</a>: Speaking of Alfred, here’s a handy Python library that makes it easy to write your own (if you’re a Python programmer).</p> <p><a class="link" href="http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html" target="_blank" rel="noopener" >Squashing commits with rebase</a>: Turns out you can use git rebase to clean up your commits before you push them to a remote repository. This can be a great way to make the commits your team sees more meaningful; don’t abuse it.</p> </article> <article> <h1>Tracking Weight Loss with R, Hazel, Withings, and IFTTT</h1> <p>Sat, 30 Aug 2014 00:00:00 +0000</p> <p>As I have <a class="link" href="https://tdhopper.com/blog/2013/Jul/26/noisy-series-and-body-weight/" >noted</a> <a class="link" href="https://tdhopper.com/blog/2013/Nov/28/noisy-series-and-body-weight-take-2/" >before</a>, body weight is a noisy thing. Day to day, your weight will probably fluctuate by several pounds. If you’re trying to lose weight, this noise can cause unfounded frustration and premature excitement.</p> <p>When I started a serious weight loss plan a year and a half ago, I bought a wifi-enabled <a class="link" href="http://www.withings.com" target="_blank" rel="noopener" >Withings Scale</a>. The scale allows me to automatically sync my weight with <a class="link" href="http://www.bustan.net/" target="_blank" rel="noopener" >Montior Your Weight</a>, <a class="link" href="http://www.myfitnesspal.com" target="_blank" rel="noopener" >MyFitnessPal</a>, <a class="link" href="http://runkeeper.com/home" title="RunKeeper" target="_blank" rel="noopener" >RunKeeper</a>, and other fitness apps on my phone. IFTTT also has <a class="link" href="https://ifttt.com/withings" title="Withings Channel - IFTTT" target="_blank" rel="noopener" >great Withings support</a> allowing me to push my weight to various other web services.</p> <p>One IFTTT rule I have appends my weight to a text file in Dropbox. This file looks like this:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">263.86 August 21, 2014 at 05:56AM </span></span><span class="line"><span class="cl">264.62 August 22, 2014 at 08:27AM </span></span><span class="line"><span class="cl">264.56 August 23, 2014 at 09:41AM </span></span><span class="line"><span class="cl">263.99 August 24, 2014 at 08:02AM </span></span><span class="line"><span class="cl">265.64 August 25, 2014 at 08:08AM </span></span><span class="line"><span class="cl">267.4 August 26, 2014 at 08:16AM </span></span><span class="line"><span class="cl">265.25 August 27, 2014 at 09:08AM </span></span><span class="line"><span class="cl">264.17 August 28, 2014 at 07:21AM </span></span><span class="line"><span class="cl">264.03 August 29, 2014 at 08:43AM </span></span><span class="line"><span class="cl">262.71 August 30, 2014 at 08:47AM </span></span></code></pre></td></tr></table> </div> </div><p>For a few months, I have been experimenting with using this time series to give myself a less-noisy update on my weight, and I’ve come up with a decent solution.</p> <p>This <a class="link" href="http://www.r-project.org/" target="_blank" rel="noopener" >R script</a> will take my weight time series, resample it, smooth it with a rolling median over the last month, and write summary stats to a text file in my Dropbox. It’s not the prettiest script, but it gets the job done for now.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span><span class="lnt">36 </span><span class="lnt">37 </span><span class="lnt">38 </span><span class="lnt">39 </span><span class="lnt">40 </span><span class="lnt">41 </span><span class="lnt">42 </span><span class="lnt">43 </span><span class="lnt">44 </span><span class="lnt">45 </span><span class="lnt">46 </span><span class="lnt">47 </span><span class="lnt">48 </span><span class="lnt">49 </span><span class="lnt">50 </span><span class="lnt">51 </span><span class="lnt">52 </span><span class="lnt">53 </span><span class="lnt">54 </span><span class="lnt">55 </span><span class="lnt">56 </span><span class="lnt">57 </span><span class="lnt">58 </span><span class="lnt">59 </span><span class="lnt">60 </span><span class="lnt">61 </span><span class="lnt">62 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-R" data-lang="R"><span class="line"><span class="cl"><span class="n">INPUT_PATH</span> <span class="o"><-</span> <span class="s">"~/Dropbox/Text Notes/Weight.txt"</span> </span></span><span class="line"><span class="cl"><span class="n">OUTPUT_PATH</span> <span class="o"><-</span> <span class="s">"~/Dropbox/Text Notes/Weight Stats.txt"</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nf">library</span><span class="p">(</span><span class="n">zoo</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># READ FILE</span> </span></span><span class="line"><span class="cl"><span class="n">con</span> <span class="o"><-</span> <span class="nf">file</span><span class="p">(</span><span class="n">INPUT_PATH</span><span class="p">,</span> <span class="s">"rt"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">lines</span> <span class="o"><-</span> <span class="nf">readLines</span><span class="p">(</span><span class="n">con</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nf">close</span><span class="p">(</span><span class="n">con</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># PARSE INTO LISTS OF WEIGHTS AND DATES</span> </span></span><span class="line"><span class="cl"><span class="n">parse.line</span> <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">s</span> <span class="o"><-</span> <span class="nf">strsplit</span><span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">" "</span><span class="p">)</span><span class="n">[[1]]</span> </span></span><span class="line"><span class="cl"> <span class="n">date.str</span> <span class="o"><-</span> <span class="nf">paste</span><span class="p">(</span><span class="n">s[2</span><span class="o">:</span><span class="m">10</span><span class="n">][</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">s[2</span><span class="o">:</span><span class="m">10</span><span class="n">]</span><span class="p">)</span><span class="n">]</span><span class="p">,</span> <span class="n">collapse</span><span class="o">=</span><span class="s">" "</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">date</span> <span class="o"><-</span> <span class="nf">mdy_hm</span><span class="p">(</span><span class="n">date.str</span><span class="p">,</span> <span class="n">quiet</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">l</span> <span class="o"><-</span> <span class="nf">list</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">s[1]</span><span class="p">),</span> <span class="n">date</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="nf">names</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="s">"weight"</span><span class="p">,</span> <span class="s">"date"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">l</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="n">list.weight.date</span> <span class="o"><-</span> <span class="nf">lapply</span><span class="p">(</span><span class="n">lines</span><span class="p">,</span> <span class="n">parse.line</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">weights</span> <span class="o"><-</span> <span class="nf">lapply</span><span class="p">(</span><span class="n">list.weight.date</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">X</span><span class="o">$</span><span class="n">weight</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">dates</span> <span class="o"><-</span> <span class="nf">lapply</span><span class="p">(</span><span class="n">list.weight.date</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">X</span><span class="o">$</span><span class="n">date</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># BUILD DATA FRAME</span> </span></span><span class="line"><span class="cl"><span class="n">df</span> <span class="o"><-</span> <span class="nf">data.frame</span><span class="p">(</span><span class="n">weight</span> <span class="o">=</span> <span class="nf">unlist</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="n">date</span> <span class="o">=</span> <span class="nf">do.call</span><span class="p">(</span><span class="s">"c"</span><span class="p">,</span> <span class="n">dates</span><span class="p">)</span> <span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># CREATE TIME SERIES AND RESAMPLE</span> </span></span><span class="line"><span class="cl"><span class="n">ts</span> <span class="o"><-</span> <span class="nf">zoo</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">weight</span><span class="p">),</span> <span class="n">df</span><span class="o">$</span><span class="n">date</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">ts</span> <span class="o"><-</span> <span class="nf">aggregate</span><span class="p">(</span><span class="n">ts</span><span class="p">,</span> <span class="nf">time</span><span class="p">(</span><span class="n">ts</span><span class="p">),</span> <span class="n">tail</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">g</span> <span class="o"><-</span> <span class="nf">round</span><span class="p">(</span><span class="nf">seq</span><span class="p">(</span><span class="nf">start</span><span class="p">(</span><span class="n">ts</span><span class="p">),</span> <span class="nf">end</span><span class="p">(</span><span class="n">ts</span><span class="p">),</span> <span class="m">60</span> <span class="o">*</span> <span class="m">60</span> <span class="o">*</span> <span class="m">24</span><span class="p">),</span> <span class="s">"days"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">ts</span> <span class="o"><-</span> <span class="nf">na.approx</span><span class="p">(</span><span class="n">ts</span><span class="p">,</span> <span class="n">xout</span> <span class="o">=</span> <span class="n">g</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># FUNCTION TO GET WEIGHT N-DAYS AGO IF WEIGHT IS SMOOTHED BY ROLLING MEDIAN</span> </span></span><span class="line"><span class="cl"><span class="c1"># OVER A GIVEN (smooth.n) NUMBER OF DAYS</span> </span></span><span class="line"><span class="cl"><span class="n">days.ago</span> <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span><span class="n">days</span><span class="p">,</span> <span class="n">smooth.n</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">date</span> <span class="o"><-</span> <span class="nf">head</span><span class="p">(</span><span class="nf">tail</span><span class="p">(</span><span class="nf">index</span><span class="p">(</span><span class="n">ts</span><span class="p">),</span><span class="n">days</span> <span class="o">+</span> <span class="m">1</span><span class="p">),</span><span class="m">1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">smoothed</span> <span class="o"><-</span> <span class="nf">rollmedianr</span><span class="p">(</span><span class="n">ts</span><span class="p">,</span> <span class="n">smooth.n</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="nf">as.numeric</span><span class="p">(</span><span class="n">smoothed[date]</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1"># SMOOTH WEIGHT BY 29 DAYS AND GENERATE SOME SUMMARY STATS</span> </span></span><span class="line"><span class="cl"><span class="n">days</span> <span class="o">=</span> <span class="m">29</span> </span></span><span class="line"><span class="cl"><span class="n">current.weight</span> <span class="o"><-</span> <span class="nf">days.ago</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">days</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">current.weight</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">current.weight</span><span class="o">-</span><span class="nf">days.ago</span><span class="p">(</span><span class="m">7</span><span class="p">,</span> <span class="n">days</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="n">current.weight</span><span class="o">-</span><span class="nf">days.ago</span><span class="p">(</span><span class="m">30</span><span class="p">,</span> <span class="n">days</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="n">current.weight</span><span class="o">-</span><span class="nf">days.ago</span><span class="p">(</span><span class="m">365</span><span class="p">,</span> <span class="n">days</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="n">current.weight</span><span class="o">-</span><span class="nf">max</span><span class="p">(</span><span class="n">ts</span><span class="p">))</span> </span></span><span class="line"><span class="cl"><span class="n">x</span> <span class="o">=</span> <span class="nf">round</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="m">1</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nf">names</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="s">"current"</span><span class="p">,</span> <span class="s">"7days"</span><span class="p">,</span> <span class="s">"30days"</span><span class="p">,</span> <span class="s">"365days"</span><span class="p">,</span> <span class="s">"max"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">fileConn</span><span class="o"><-</span><span class="nf">file</span><span class="p">(</span><span class="n">OUTPUT_PATH</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">w</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="nf">paste</span><span class="p">(</span><span class="s">"Weight (lbs):"</span><span class="p">,</span> <span class="n">x[</span><span class="s">"current"</span><span class="n">]</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nf">paste</span><span class="p">(</span><span class="s">"Total Δ:"</span><span class="p">,</span> <span class="n">x[</span><span class="s">"max"</span><span class="n">]</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nf">paste</span><span class="p">(</span><span class="s">"1 Week Δ:"</span><span class="p">,</span> <span class="n">x[</span><span class="s">"7days"</span><span class="n">]</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nf">paste</span><span class="p">(</span><span class="s">"1 Month Δ:"</span><span class="p">,</span> <span class="n">x[</span><span class="s">"30days"</span><span class="n">]</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nf">paste</span><span class="p">(</span><span class="s">"1 Year Δ:"</span><span class="p">,</span> <span class="n">x[</span><span class="s">"365days"</span><span class="n">]</span><span class="p">))</span> </span></span><span class="line"><span class="cl"><span class="nf">writeLines</span><span class="p">(</span><span class="n">w</span><span class="p">,</span><span class="n">fileConn</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="nf">close</span><span class="p">(</span><span class="n">fileConn</span><span class="p">)</span> </span></span></code></pre></td></tr></table> </div> </div><p>The output looks something like this:</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span><span class="lnt">2 </span><span class="lnt">3 </span><span class="lnt">4 </span><span class="lnt">5 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Weight (lbs): 265.7 </span></span><span class="line"><span class="cl">Total Δ: -112 </span></span><span class="line"><span class="cl">1 Week Δ: -0.8 </span></span><span class="line"><span class="cl">1 Month Δ: -4.8 </span></span><span class="line"><span class="cl">1 Year Δ: -75 </span></span></code></pre></td></tr></table> </div> </div><p>I want this script to be run every time my weight is updated, so I created a second IFTTT rule that will create a new file in my Dropbox, called <em>new_weight_measurement</em>, every time I weigh in. On my Mac Mini, I have a <a class="link" href="http://www.noodlesoft.com/hazel.php" title="Noodlesoft | Hazel" target="_blank" rel="noopener" >Hazel</a> rule to watch for a file of this name to be created. When Hazel sees the file, it runs my R script and deletes that file.</p> <p>My Hazel rule looks like this:</p> <figure> </figure> <p>The ’embedded script’ that is run is the R script above; I just have to tell Hazel to use the <code>Rscript</code> shell.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p> <figure> </figure> <p>At this point, every time I step on my scale, a text file with readable statistics about my smoothed weight appear in my Dropbox folder.</p> <p>Of course, I want this updated information to be pushed directly too me. Hazel is again the perfect tool for the job. I have a second Hazel rule that watches for <em>Weight Stats.txt</em> to be created. Hazel can pass the path of the updated file into any script of your choice. You could, for example, use <a class="link" href="http://www.mailgun.com/" title="Transactional Email API Service for Developers - Mailgun" target="_blank" rel="noopener" >Mailgun</a> to email it to yourself or <a class="link" href="https://pushover.net/" title="Pushover: Simple Notifications for Android, iOS, and Desktop" target="_blank" rel="noopener" >Pushover</a> to push it to your mobile devices. Obviously, I want to tweet mine.</p> <p>I have a Twitter account called <a class="link" href="https://twitter.com/hopsfitness" target="_blank" rel="noopener" >@hopsfitness</a> where I’ve recently been tracking my fitness progress. On my Mac Mini, I have <a class="link" href="https://github.com/sferik/t" title="t GitHub" target="_blank" rel="noopener" >t</a> configured to access @hopsfitness from the command line. Thus, tweeting my updated statistics is just a matter of a little shell script executed by Hazel:</p> <figure> </figure> <p>Since this data goes to Twitter, I can get it painlessly pushed to my phone: Twitter still allows you subscribe to accounts via text message, which I’ve done with @hopsfitness. A minute or so after I step on my scale, I get a text with useful information about where I am and where I’m going; this is much preferable to the noisy weight I see on my scale.</p> <p><strong>Update</strong> (2014-12-06): I replaced my R script with a Python/pandas script. It requires Python 3 (to render the delta characters).</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt"> 1 </span><span class="lnt"> 2 </span><span class="lnt"> 3 </span><span class="lnt"> 4 </span><span class="lnt"> 5 </span><span class="lnt"> 6 </span><span class="lnt"> 7 </span><span class="lnt"> 8 </span><span class="lnt"> 9 </span><span class="lnt">10 </span><span class="lnt">11 </span><span class="lnt">12 </span><span class="lnt">13 </span><span class="lnt">14 </span><span class="lnt">15 </span><span class="lnt">16 </span><span class="lnt">17 </span><span class="lnt">18 </span><span class="lnt">19 </span><span class="lnt">20 </span><span class="lnt">21 </span><span class="lnt">22 </span><span class="lnt">23 </span><span class="lnt">24 </span><span class="lnt">25 </span><span class="lnt">26 </span><span class="lnt">27 </span><span class="lnt">28 </span><span class="lnt">29 </span><span class="lnt">30 </span><span class="lnt">31 </span><span class="lnt">32 </span><span class="lnt">33 </span><span class="lnt">34 </span><span class="lnt">35 </span><span class="lnt">36 </span><span class="lnt">37 </span><span class="lnt">38 </span><span class="lnt">39 </span><span class="lnt">40 </span><span class="lnt">41 </span><span class="lnt">42 </span><span class="lnt">43 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">dateutil</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">random</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">os.path</span> <span class="kn">import</span> <span class="n">expanduser</span><span class="p">,</span> <span class="n">join</span> </span></span><span class="line"><span class="cl"><span class="n">home</span> <span class="o">=</span> <span class="n">expanduser</span><span class="p">(</span><span class="s2">"~"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">join</span><span class="p">(</span><span class="n">home</span><span class="p">,</span> <span class="s2">"Dropbox/Text Notes/Weight.txt"</span><span class="p">),</span> <span class="s2">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">lines</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">parse_line</span><span class="p">(</span><span class="n">line</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">s</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">" "</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="n">weight</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> </span></span><span class="line"><span class="cl"> <span class="n">date</span> <span class="o">=</span> <span class="n">dateutil</span><span class="o">.</span><span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">4</span><span class="p">]))</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">date</span><span class="p">,</span> <span class="n">weight</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">weight</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">parse_line</span><span class="p">(</span><span class="n">l</span><span class="p">)</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">"date"</span><span class="p">,</span> <span class="s2">"weight"</span><span class="p">])</span> \ </span></span><span class="line"><span class="cl"> <span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s2">"date"</span><span class="p">)</span> \ </span></span><span class="line"><span class="cl"> <span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s2">"1D"</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s2">"mean"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">weight</span><span class="p">[</span><span class="s2">"missing"</span><span class="p">]</span> <span class="o">=</span> <span class="n">weight</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">weight</span><span class="o">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">weight</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">interpolate</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s2">"linear"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">std</span> <span class="o">=</span> <span class="n">weight</span><span class="o">.</span><span class="n">weight</span><span class="o">.</span><span class="n">diff</span><span class="p">()</span><span class="o">.</span><span class="n">dropna</span><span class="p">()</span><span class="o">.</span><span class="n">std</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">noise</span> <span class="o">=</span> <span class="n">weight</span><span class="o">.</span><span class="n">missing</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">missing</span><span class="p">:</span> <span class="n">random</span><span class="o">.</span><span class="n">normalvariate</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">std</span><span class="p">)</span> <span class="k">if</span> <span class="n">missing</span> <span class="k">else</span> <span class="mi">0</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">weight</span><span class="o">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">weight</span><span class="o">.</span><span class="n">weight</span> <span class="o">+</span> <span class="n">noise</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">smoothed</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">ewma</span><span class="p">(</span><span class="n">weight</span><span class="o">.</span><span class="n">weight</span><span class="p">,</span> <span class="n">span</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">current</span> <span class="o">=</span> <span class="n">smoothed</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> </span></span><span class="line"><span class="cl"><span class="n">stats</span> <span class="o">=</span> <span class="s2">""" </span></span></span><span class="line"><span class="cl"><span class="s2">Weight (lbs): </span><span class="si">%(weight).1f</span><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2">Total Δ: </span><span class="si">%(total).1f</span><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2">1 Week Δ: </span><span class="si">%(week).1f</span><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2">1 Month Δ: </span><span class="si">%(month).1f</span><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2">1 Year Δ: </span><span class="si">%(year).1f</span><span class="s2"> </span></span></span><span class="line"><span class="cl"><span class="s2">"""</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="o">%</span> <span class="p">{</span><span class="s2">"weight"</span><span class="p">:</span> <span class="n">current</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="s2">"total"</span><span class="p">:</span> <span class="n">current</span> <span class="o">-</span> <span class="n">smoothed</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s2">"week"</span><span class="p">:</span> <span class="n">current</span> <span class="o">-</span> <span class="n">smoothed</span><span class="p">[</span><span class="o">-</span><span class="mi">8</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s2">"month"</span><span class="p">:</span> <span class="n">current</span> <span class="o">-</span> <span class="n">smoothed</span><span class="p">[</span><span class="o">-</span><span class="mi">32</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="s2">"year"</span><span class="p">:</span> <span class="n">current</span> <span class="o">-</span> <span class="n">smoothed</span><span class="p">[</span><span class="o">-</span><span class="mi">366</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">join</span><span class="p">(</span><span class="n">home</span><span class="p">,</span> <span class="s2">"Dropbox/Text Notes/Weight Stats.txt"</span><span class="p">),</span> <span class="s2">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="nb">bytes</span><span class="p">(</span><span class="n">stats</span><span class="p">,</span> <span class="s1">'UTF-8'</span><span class="p">))</span> </span></span></code></pre></td></tr></table> </div> </div><div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>This assumes your input file is formatted like mine, but you could easily adjust the first part of the code for other formats. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>You can <a class="link" href="http://www.r-project.org/" title="The R Project for Statistical Computing" target="_blank" rel="noopener" >download R here</a>; installing it should add <code>Rscript</code> to your system path. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Keeping IPython Notebooks Running in the Background</h1> <p>Thu, 28 Aug 2014 00:00:00 +0000</p> <p>I spend a lot of time in <a class="link" href="http://ipython.org/notebook.html" title="The IPython Notebook — IPython" target="_blank" rel="noopener" >IPython Notebooks</a> for work. One of the few annoyances of IPython Notebooks is that they require keeping a terminal window open to run the notebook server and kernel. I routinely launch a Notebook kernel in a directory where I keep my work related notebooks. Earlier this week, I started to wonder if there was a way for me to keep this kernel running all the time <em>without</em> having to keep a terminal window open..</p> <p>If you’ve ever tried to do chron-like automation on OS X, you’ve surely come across <a class="link" href="http://launchd.info/" target="_blank" rel="noopener" >launchd</a>, “a unified, open-source service management framework for starting, stopping and managing daemons, applications, processes, and script”. You’ve probably also gotten frustated with launchd and given up.</p> <p>I recently started using <a class="link" href="http://www.soma-zone.com/LaunchControl/" title="soma-zone: LaunchControl" target="_blank" rel="noopener" >LaunchControl</a> “a fully-featured launchd GUI” for launchd; it’s pretty nice and worth $10. It occurred to me that LaunchControl would be a good way to keep my Notebook kernel running in the background.</p> <p>I created a LaunchControl to run the following command.</p> <div class="highlight"><div class="chroma"> <table class="lntable"><tr><td class="lntd"> <pre tabindex="0" class="chroma"><code><span class="lnt">1 </span></code></pre></td> <td class="lntd"> <pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/usr/local/bin/IPython notebook --matplotlib inline --port=9777 --browser=false </span></span></code></pre></td></tr></table> </div> </div><p>This launches an IPython Notebook kernel accessible on port 9777; setting the browser flag to something other than an installed browser prevents a browser window from opening when the kernel is launch.</p> <p>I added three other launchd keys in LaunchControl:</p> <ul> <li>A <em>Working Directory</em> key to tell LaunchControl to start my notebook in my desired folder.</li> <li>A <em>Run At Load</em> key to tell it to start my kernel as soon as I load the job.</li> <li>And a <em>Keep alive</em> key to tell LaunchControl to restart my kernel should the process ever die.</li> </ul> <p>Here’s how it looks in LaunchControl:</p> <figure> </figure> <p>After I created it, I just had to save and load, and I was off to the races; the IPython kernel starts and runs in the background. I can access my Notebooks by navigating to 127.0.0.1:9777 in my browser. Actually, I added <code>127.0.0.1 parsely.scratch</code> to my hosts file so I can access my Notebooks at parsely.scratch:9777. This works nicely with Chrome’s autocomplete feature. I’m avoiding the temptation to run nginx and give it an even prettier url.</p> </article> <article> <h1>Links for August 28, 2014</h1> <p>Thu, 28 Aug 2014 00:00:00 +0000</p> <p><a class="link" href="http://stackoverflow.com/questions/2161406/how-do-i-generate-a-uniform-random-integer-partition" target="_blank" rel="noopener" >How do I generate a uniform random integer partition?</a>: This week, I wanted to generate random partitions of integers. Unsurprisingly, stackoverflow pulled through with a Python snippet to do just that.</p> <p><a class="link" href="http://support.alfredapp.com/kb:firefox-chrome-bookmarks" target="_blank" rel="noopener" >Firefox and Chrome Bookmarks</a>: I love <a class="link" href="http://www.alfredapp.org" target="_blank" rel="noopener" >Alfred</a> as a launcher in OS X. I use it many, many times a day. I just found this helpful workflow for quickly searching and opening my Chrome bookmarks.</p> <p><a class="link" href="http://www.youneedabudget.com/blog/2014/ynab-for-ipad-is-here/" target="_blank" rel="noopener" >YNAB for iPad is Here</a>: YNAB has been the best thing to ever happen to my financial life. I use it to track all my finances. They just released a beautiful iPad app. Importantly, it brings the ability to modify a budget to mobile!</p> <p><a class="link" href="http://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/" target="_blank" rel="noopener" >Distributed systems theory for the distributed systems engineer</a>: I work on distributed systems these days. I need to read some of these papers.</p> </article> <article> <h1>Links for August 25, 2014</h1> <p>Mon, 25 Aug 2014 00:00:00 +0000</p> <p><a class="link" href="http://stackoverflow.com/questions/352098/how-can-i-pretty-print-json/1920585#1920585" target="_blank" rel="noopener" >How can I pretty-print JSON at the command line?</a>: I needed to pretty print some JSON at the command line earlier today. The easiest way might be to pipe it through <code>python -m json.tool</code>.</p> <p><a class="link" href="http://blog.iansinnott.com/integrating-alfred-and-keyboard-maestro/" target="_blank" rel="noopener" >Integrating Alfred & Keyboard Maestro</a>: I love <a class="link" href="http://www.keyboardmaestro.com/main/" target="_blank" rel="noopener" >Keyboard Maestro</a> for automating all kinds of things on my Mac, but I’m reaching a limit of keyboard shortcuts I can remember. Here’s an Alfred workflow for launching macros instead.</p> <p><a class="link" href="https://github.com/Parsely/streamparse/releases/tag/v1.0.0" target="_blank" rel="noopener" >streamparse 1.0.0</a>: My team at <a class="link" href="http://www.parsely.com" target="_blank" rel="noopener" >Parsely</a> is building a tool for easily writing Storm topologies (for processing large volumes of streaming data) in Python. We just released 1.0.0!</p> <p><a class="link" href="http://brettterpstra.com/projects/te-tools/" target="_blank" rel="noopener" >TextExpander Tools</a>: Brett Terpstra, the king of Mac hacks, has some really handy tools for TextExpander.</p> <p><a class="link" href="http://www.gnu.org/software/parallel/" target="_blank" rel="noopener" >GNU Parallel</a>: GNU parallel is a shell tool for executing jobs in parallel using one or more computers using xargs-like syntax. Pretty cool. HT <a class="link" href="http://www.twitter.com/oceankidbilly" target="_blank" rel="noopener" >@oceankidbilly</a>.</p> </article> <article> <h1>Links for August 23, 2014</h1> <p>Sat, 23 Aug 2014 00:00:00 +0000</p> <p><a class="link" href="http://crsmithdev.com/arrow/" target="_blank" rel="noopener" >Arrow: better dates and times for Python</a>: Arrow is a slick Python library “that offers a sensible, human-friendly approach to creating, manipulating, formatting and converting dates, times, and timestamps”. It’s a friendly alternative to datetime.</p> <p><a class="link" href="http://penandpants.com/2014/03/09/docker-via-homebrew/" target="_blank" rel="noopener" >Docker via Homebrew</a>: I’m starting to use Docker (“Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications”) on occasion. Here are easy install instructions for Mac users.</p> <p><a class="link" href="https://www.coursera.org/course/mmds" target="_blank" rel="noopener" >Mining Massive Datasets MOOC</a>: I’m terrible at completing MOOCs, but I’m <em>really</em> interested in this new on on Mining Massive Datasets.</p> <p><a class="link" href="https://chrome.google.com/webstore/detail/url-pinner/lchefjdnocignejmkklgakfmnjhiimjh" target="_blank" rel="noopener" >URL Pinner - Chrome Web Store</a>: URL Pinner is one of my favorite Chrome Extensions. I use it to automatically pin my Gmail and Rdio windows (which I almost always have open).</p> <p><a class="link" href="http://www.seengee.co.uk/2012/09/08/using-multitail-for-monitoring-multiple-log-files/" target="_blank" rel="noopener" >Using multitail for monitoring multiple log files</a>: If you work with distributed systems, you’re probably used to SSH-ing into multiple machines to access logs. Multitool might save you some time.</p> <p><a class="link" href="http://www.smbc-comics.com/?id=3454" target="_blank" rel="noopener" >Saturday Morning Breakfast Cereal</a>: SBMC shows how job interviews would go if we were more honest.</p> </article> <article> <h1>Noisy Series and Body Weight Part 2</h1> <p>Thu, 28 Nov 2013 00:00:00 +0000</p> <p>Back in July, I posted some <a class="link" href="https://tdhopper.com/blog/2013/Jul/26/noisy-series-and-body-weight/" >analysis of my attempt at weight loss</a>. Now that I’m four months further down the line, I thought I’d post a follow-up.</p> <p>I continue to be fascinated with how noisy my weight time series is. While I’ve continued to lose weight over time, my weight goes up two out of five mornings.</p> <p>Here’s a plot of the time series of my change in weight. Note how often the change is positive, i.e. I appear to have gained weight:</p> <figure> </figure> <p>This volatility can hide the fact that I’m making progress! When I put a regression line through the points, you can see that the average change slightly below zero:<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <figure> </figure> <p>I have wondered recently if my average change in weight is correlated with the day of the week. My hypothesis is that my weight tends to go up over the weekends, so I created a boxplot of my change in weight categorized by day.</p> <figure> </figure> <p>Indeed, on Sundays and Mondays (i.e. weight change from Saturday morning to Sunday morning and Sunday morning to Monday morning) my median weight change is slightly above zero. This makes sense to me: on Saturdays, I’m more likely to be doing things with friends, and thus I have less control over my meals.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p> <p>I wish I had a good explanation for why the change on Friday is so dramatic, but I don’t. Any guesses?</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I mentioned this to my college roommate who is a financial planner. He noted how similar this is to investing; it’s a constant battle for him to convince his clients to look at <em>average</em> behavior instead of daily changes. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>Also, beer. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Tweeting Primes</h1> <p>Fri, 08 Nov 2013 00:00:00 +0000</p> <p>I recently discovered the Twitter account <a class="link" href="https://twitter.com/_primes_" target="_blank" rel="noopener" >@<em>primes</em></a>. Every hour, they tweet the subsequent prime number. This made me wonder two things. First, what is the largest prime that you can tweet (in base-10 encoding in 140 characters).<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> Second, how long until they get there.</p> <p>Doing some quick calculations in Mathematica, I believe the largest 140 digit prime is the following:</p> <p>9999999999999999999999999999999999999999999999</br> 9999999999999999999999999999999999999999999999</br> 999999999999999999999999999999999999999999999997</p> <p>Wolfram Alpha <a class="link" href="http://www.wolframalpha.com/input/?i=IsPrime%5B99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999997" target="_blank" rel="noopener" >confirms that this is prime</a> and that the <a class="link" href="http://www.wolframalpha.com/input/?i=Length%5BIntegerDigits%5BNextPrime%5B99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999997%5D%5D%5D" target="_blank" rel="noopener" >next prime is 141 characters</a>.</p> <p>As for how long it would take, recall that the number of primes less than $n$ <a class="link" href="http://en.wikipedia.org/wiki/Prime_number_theorem" target="_blank" rel="noopener" >is approximately</a> $\frac{n}{\ln n}$. The number of primes less than $10^{141}$ is approximately</p> $$\pi(10^{140}) = \frac{10^{140}}{140\cdot \ln 10} = 3.1\cdot 10^{137}.$$<p>That’s $3\cdot 10^{57}$ times the estimated number of atoms in the universe. Looks like <a class="link" href="https://twitter.com/_primes_" target="_blank" rel="noopener" >@<em>primes</em></a> should be able to tweet for a while.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>The <a class="link" href="http://en.wikipedia.org/wiki/Mersenne_primes" target="_blank" rel="noopener" >largest known prime</a> is $2^{57,885,161} − 1$ and has 17,425,170 digits. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Sublime Text and Markdown</h1> <p>Sat, 19 Oct 2013 00:00:00 +0000</p> <p>I have largely moved from Textmate to Sublime Text 2 for text editing. Among other reasons, Sublime Text is cross platform, and I use Windows at work and a Mac at home. I have also started writing as much as I can in <a class="link" href="http://daringfireball.net/projects/markdown/" target="_blank" rel="noopener" >Markdown</a>.</p> <p>I intended to write a blog post about using Sublime Text as a tool for writing Markdown. However, the inimitable Federico Viticci, of <a class="link" href="http://www.macstories.net/" target="_blank" rel="noopener" >macstories.net</a>, has <a class="link" href="http://www.macstories.net/roundups/sublime-text-2-and-markdown-tips-tricks-and-links/" target="_blank" rel="noopener" >already written that post</a>, so I will simply refer you there.</p> </article> <article> <h1>The Incessant Commentary on Being Tall</h1> <p>Wed, 02 Oct 2013 00:00:00 +0000</p> <p>Ralph Keyes, <a class="link" href="http://www.ralphkeyes.com/height/" target="_blank" rel="noopener" >The Height of Your Life</a>:</p> <blockquote> <p>I’ve heard this sort of thing repeatedly from tall men. It’s not the incessant commentary about their height that is so annoying, it’s the stupefying boredom of it all. Were anyone to say something original or witty or different in any way, the constant chatter thrown their way might at least be entertaining. But soon after reaching their full height, tall people realize to their horror that the lifetime’s commentary to which they’ve been sentenced comes mostly from those with least to say.</p> </blockquote> </article> <article> <h1>Noisy Series and Body Weight</h1> <p>Fri, 26 Jul 2013 00:00:00 +0000</p> <p>I put on some weight during my time in grad school, and this spring, I decided to do something about it. In April, I started using <a class="link" href="https://www.myfitnesspal.com" target="_blank" rel="noopener" >MyFitnessPal</a> to track my food intake and exercise, and I run a net calorie deficit every day. Thankfully this seems to be working.</p> <p>In May, I bought a <a class="link" href="http://www.amazon.com/Withings-Wireless-Scale-WS-30-Black/dp/B00AXYL4M6/ref=sr_1_4?ie=UTF8&qid=1374864050&sr=8-4&keywords=withings" target="_blank" rel="noopener" >Withings WS-30</a> wireless scale. When I first heard about this wifi scales, I thought they sounded like a gimmick, however the Withings has become a helpful tool in the weight loss process.</p> <p>Every morning, I step on the scale and my weight is automatically broadcast to MyFitnessPal, <a class="link" href="https://itunes.apple.com/us/app/monitor-your-weight/id413313086?mt=8" target="_blank" rel="noopener" >Monitor Your Weight</a> on iOS, and a text file in my Dropbox folder (via <a class="link" href="https://ifttt.com" target="_blank" rel="noopener" >IFTTT</a> and Withings’ API). MyFitnessPal adjusts my daily calorie limit by my weight, Monitor Your Weight is a great tool for visualizing progressing, and I use the text file to import a ggplot time series of my weight into <a class="link" href="https://dayoneapp.com" target="_blank" rel="noopener" >Day One</a> each month.</p> <p>An interesting aspect of my weight time series is how noisy it is. (No doubt this is true for others as well.) On many mornings, my weight is up from the day before (despite a fairly consistent net caloric deficit). As you can see from the plot, my weight jumps up and down daily even though the overall trend is downward.</p> <figure> </figure> <p>I have been wondering what percentage of days I actually lose weight, so I decided to find out. The plot below is a histogram of my weight change from day to day.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <figure> </figure> <p>The data appears nearly Gaussian around 0! (In fact, the p-value on the Shapiro-Wilk normality test is 0.11, arguably not small enough to reject the null hypothesis that the data are <em>not</em> normally distributed.) Fortunately the mean of the differences is actually about -0.24 (pounds/day), and my progress is downward.</p> <p>In total, I lost weight on 48 days, gained on 33, and stayed the same on 4% of the days. That means I’ve steadly lost weight while only moving down on 56% of days. I guess I don’t need to be depressed every time my weight jumps up slightly….</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>This isn’t 100% true. I’m hiding the fact that I missed weighing-in on some days. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Guide to Monte Carlo Methods?</h1> <p>Tue, 23 Jul 2013 00:00:00 +0000</p> <p>I have started to realize that <a class="link" href="http://en.wikipedia.org/wiki/Monte_Carlo_method" target="_blank" rel="noopener" >Monte Carlo methods</a> of various kinds keep coming up in my work. Despite significant application of Monte Carlo in my grad school research, I think I only know enough to be dangerous. I’d like to get a better grasp on Monte Carlo methods (especially MCMC and simulation).</p> <p>I <a class="link" href="https://twitter.com/tdhopper/status/359642533950787585" target="_blank" rel="noopener" >asked on Twitter</a> if anyone had a recommended reference that was readable and practical. Despite my love of measure theory, what I want is <em>Monte Carlo Methods for the <a class="link" href="http://www.johndcook.com/veryappliedmath.html" target="_blank" rel="noopener" >Very Applied Mathematician</a></em>, not a theoretical text.</p> <p>I got several recommendations. I’m not sure that any are exactly what I’m looking for, but I am certainly going to look deeper into them. Interestingly, they are all Springer books.</p> <p>Several people recommended Glasserman’s <a class="link" href="http://www.amazon.com/Financial-Engineering-Stochastic-Modelling-Probability/dp/0387004513" target="_blank" rel="noopener" >Monte Carlo Methods in Financial Engineering</a>. I don’t work in the financial sector, so it’s hard for me to evaluate the table of contents to tell how well it generalizes.</p> <p>Someone else recommended both <a class="link" href="http://www.amazon.com/dp/038787836X" target="_blank" rel="noopener" >Explorations in Monte Carlo Methods</a> and <a class="link" href="http://www.amazon.com/Handbook-Chapman-Handbooks-Statistical-Methods/dp/1420079417" target="_blank" rel="noopener" >Handbook of Markov Chain Monte Carlo</a> for two levels of MCMC.</p> <p>Finally, I got a recommendation for <a class="link" href="http://www.amazon.com/Introducing-Monte-Carlo-Methods-ebook/dp/B00440E0JO/ref=sr_1_1?s=books&ie=UTF8&qid=1374616614&sr=1-1&keywords=introducing+monte+carlo+methods+with+R" target="_blank" rel="noopener" >Introducing Monte Carlo Methods with R</a>. This might be closest to what I’m looking for. It appears to cover a breadth of topics, and it includes lots of code.</p> </article> <article> <h1>Bandit Algorithms for Website Optimization</h1> <p>Tue, 29 Jan 2013 00:00:00 +0000</p> <p>I had the honor of being a technical reviewer for John Myles White’s outstanding book, “<a class="link" href="https://amzn.to/4g8G1K8" target="_blank" rel="noopener" >Bandit Algorithms for Website Optimization: Developing, Deploying, and Debugging</a>.” I strongly suggest it to anyone curious about the subject. This book is an excellent start to understanding the field and provides practical advice on applying these algorithms effectively.</p> <p><a class="link" href="https://amzn.to/3SXD9a8" target="_blank" rel="noopener" >Check it out!</a></p> </article> <article> <h1>Introduction to Scikit-Learn</h1> <p>Mon, 21 Jan 2013 00:00:00 +0000</p> <p>I gave a talk at a recent Research Triangle Analysts meetup on scikit-learn, the excellent machine learning library for Python. You can see the <a class="link" href="http://nbviewer.jupyter.org/format/slides/github/tdhopper/Research-Triangle-Analysts--Intro-to-scikit-learn/blob/master/Intro%20to%20Scikit-Learn.ipynb" target="_blank" rel="noopener" >IPython notebook that I presented from</a>.</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Bringing Operations Research into the 21st Century with Online Video</h1> <p>Sat, 01 Dec 2012 00:00:00 +0000</p> <p>I presented at INFORMS 2012 on <a class="link" href="https://www.youtube.com/watch?v=0gfBH4mC_iU" target="_blank" rel="noopener" >Bringing Operations Research into the 21st Century with Online Video</a>. You can see the recording on <a class="link" href="https://www.youtube.com/watch?v=0gfBH4mC_iU" target="_blank" rel="noopener" >Youtube</a>.</p> <div class="video-wrapper"> </iframe> </div> </article> <article> <h1>Pickle and Redis</h1> <p>Mon, 22 Oct 2012 00:00:00 +0000</p> <p>I gave a talk at PyCarolinas 2012 about using Pickle and Redis to persist data with Python. It wasn’t recorded, but you can <a class="link" href="http://nbviewer.jupyter.org/github/tdhopper/Pickle-and-Redis/blob/master/Pickle%20and%20Redis.ipynb" target="_blank" rel="noopener" >see the IPython notebook I presented from</a>.</p> </article> <article> <h1>Operations Research and Computer Programming</h1> <p>Thu, 12 Jul 2012 00:00:00 +0000</p> <p>For part of my sophomore year of college, I was a computer science major. When I realized that I loved my CS theory courses while my classmates hated them, I decide to major in math instead. I enjoyed the programming classes enough, but programming is not what I wanted to spend my time doing.</p> <p>The summer after my junior year, I was accepted to a <em>math</em> REU at Rochester Institute of Technology. The first thing my adviser <a class="link" href="http://www.cs.rit.edu/~spr/" target="_blank" rel="noopener" >Stanislaw Radziszowski</a> asked me was whether or not I could program! I spent the whole summer programming combinatorial graph theory-related algorithms in C<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p> <p>Now I, like many of my operations research classmates, spend much of my time programming. Despite the importance of writing code for solving operations research problems, I am surprised how little programming is discussed. The <a class="link" href="http://www.or.ncsu.edu/admission/" target="_blank" rel="noopener" >admissions page for my program</a> says nothing about programming ability, but it is implicitly assumed that programming is a skill that students have.</p> <p>Moreover, I suspect the operations research-specific parts of the research behind many journal articles is only a fraction of the actual work done by the authors. Much of the required work is implementation and debugging of their algorithms. Yet, articles contain little-to-no discussion of the actual code. Even worse, the code is often not published or reviewed. I can only imagine how many coding errors underly the results of peer-reviewed papers.</p> <p><a class="link" href="https://kuomarc.wordpress.com/2012/07/02/top-hackers-we-need-you-for-operations-research/" target="_blank" rel="noopener" >Marc Kuo recently blogged</a> about how operations researchers need to get with the program (pun intended). His post kicked of tons of discussion in its <a class="link" href="https://kuomarc.wordpress.com/2012/07/02/top-hackers-we-need-you-for-operations-research/#comments" target="_blank" rel="noopener" >comments</a>, on <a class="link" href="https://plus.google.com/110464871801965858778/posts/QDtHN7PcsCL" target="_blank" rel="noopener" >Google+</a>, on <a class="link" href="http://news.ycombinator.com/item?id=4191847" target="_blank" rel="noopener" >Hacker News</a>, and on <a class="link" href="http://www.or-exchange.com/questions/5783/which-programming-skillsexperience-are-essential-for-being-a-good-or-researcher-if-any?utm_medium=twitter&utm_source=twitterfeed" target="_blank" rel="noopener" >OR-Exchange</a>.</p> <p>This discussion came at a good time for me. I’m in the middle of my first big coding project of my PhD research. Despite completing a computer science minor and spending two summers doing nothing but coding, I never learned good software engineering practices. I decided at the beginning of the summer to force myself not to just write this code to get the job done but to write <em>good</em> code.</p> <p>To start, I finally started using <a class="link" href="http://git-scm.com/" target="_blank" rel="noopener" >git</a> and <a class="link" href="https://github.com/" target="_blank" rel="noopener" >github</a> for version control. I have tried several times before, but I have always found it rather confusing<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>. <a class="link" href="http://www.vogella.com/articles/Git/article.html" target="_blank" rel="noopener" >This git tutorial</a> finally got me over the hump. Now I can easily branch my code into different versions, and I have the ability to go back to old versions when I screw something up.</p> <p>Second, I started teaching myself about <a class="link" href="http://en.wikipedia.org/wiki/Unit_testing" target="_blank" rel="noopener" >unit testing</a>. Code testing was never mentioned in any of my classes in college, and I never hear operations researchers talk about it. Again, I have no doubt that the code behind much published work is full of mistakes. Operations researchers need good testing practices?<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p> <p>Third, I’m trying to write clean, object-oriented, well-commented code. My intention is to publish this code on <a class="link" href="https://github.com/tdhopper" target="_blank" rel="noopener" >github</a> when the corresponding paper is published. I want my results to be easily reproducible by others and open to scrutiny. I would also like my code to be reusable for future research. My design patterns might not be quite there yet, but I’m trying to move in that direction.</p> <p>I realized that I’ve used the word <em>I</em> as much as Stephen Wolfram blog post. I have no desire to toot my own horn here; I’m just thankful this conversation is happening, and I want to continue it. Good software is crucial to good operations research (both in the academy and out), and yet academic operations researchers, in my experience, talk very little about good software engineering practices. We can do better.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I’m eternally indebted to my brilliant research partner Evan who taught me how to use bash, vim, and subversion, among other things. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>I feel vindicated by a <a class="link" href="http://news.ycombinator.com/item?id=4199772" target="_blank" rel="noopener" >recent thread on Hacker News</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:3"> <p>Incidentally, here’s a<a class="link" href="http://www.quora.com/How-do-people-write-automated-tests-for-their-stochastic-algorithms" target="_blank" rel="noopener" >n interesting Quora thread</a> about testing stochastic algorithms. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Notes on the Derivation of Least Squares Policy Iteration</h1> <p>Mon, 18 Jun 2012 00:00:00 +0000</p> <p><a class="link" href="https://tdhopper.com/uploads/lspi.pdf" >Here are my notes</a> on the derivation of the Least Squares Policy Iteration (LSPI) algorithm. The notes are based on the <a class="link" href="https://users.cs.duke.edu/~parr/jmlr03.pdf" target="_blank" rel="noopener" >original paper by Lagoudakis and Parr</a>.</p> <script type="text/javascript" src= '/js/pdf-js/build/pdf.js'></script> <style> #embed-pdf-container { position: relative; width: 100%; height: auto; min-height: 20vh; } .pdf-canvas { border: 1px solid black; direction: ltr; width: 100%; height: auto; display: none; } #the-canvas { border: 1px solid black; direction: ltr; width: 100%; height: auto; display: none; } .pdf-loadingWrapper { display: none; justify-content: center; align-items: center; width: 100%; height: 350px; } .pdf-loading { display: inline-block; width: 50px; height: 50px; border: 3px solid #d2d0d0;; border-radius: 50%; border-top-color: #383838; animation: spin 1s ease-in-out infinite; -webkit-animation: spin 1s ease-in-out infinite; } #overlayText { word-wrap: break-word; display: grid; justify-content: end; } #overlayText a { position: relative; top: 10px; right: 4px; color: #000; margin: auto; background-color: #eeeeee; padding: 0.3em 1em; border: solid 2px; border-radius: 12px; border-color: #00000030; text-decoration: none; } #overlayText svg { height: clamp(1em, 2vw, 1.4em); width: clamp(1em, 2vw, 1.4em); } @keyframes spin { to { -webkit-transform: rotate(360deg); } } @-webkit-keyframes spin { to { -webkit-transform: rotate(360deg); } } </style><div class="embed-pdf-container" id="embed-pdf-container-8750537f"> <div class="pdf-loadingWrapper" id="pdf-loadingWrapper-8750537f"> <div class="pdf-loading" id="pdf-loading-8750537f"></div> </div> <div id="overlayText"> <a href="https://tdhopper.com/uploads/lspi.pdf" aria-label="Download" download> <svg aria-hidden="true" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 18 18"> <path d="M9 13c.3 0 .5-.1.7-.3L15.4 7 14 5.6l-4 4V1H8v8.6l-4-4L2.6 7l5.7 5.7c.2.2.4.3.7.3zm-7 2h14v2H2z" /> </svg> </a> </div> <canvas class="pdf-canvas" id="pdf-canvas-8750537f"></canvas> </div> <div class="pdf-paginator" id="pdf-paginator-8750537f"> <button id="pdf-prev-8750537f">Previous</button> <button id="pdf-next-8750537f">Next</button> <span> <span class="pdf-pagenum" id="pdf-pagenum-8750537f"></span> / <span class="pdf-pagecount" id="pdf-pagecount-8750537f"></span> </span> <a class="pdf-source" id="pdf-source-8750537f" href="https://tdhopper.com/uploads/lspi.pdf">[pdf]</a> </div> <noscript> View the PDF file <a class="pdf-source" id="pdf-source-noscript-8750537f" href="https://tdhopper.com/uploads/lspi.pdf">here</a>. </noscript> <script type="text/javascript"> (function(){ var url = '\/uploads\/lspi.pdf'; var hidePaginator = "" === "true"; var hideLoader = "" === "true"; var selectedPageNum = parseInt("") || 1; var pdfjsLib = window['pdfjs-dist/build/pdf']; if (pdfjsLib.GlobalWorkerOptions.workerSrc == '') pdfjsLib.GlobalWorkerOptions.workerSrc = "https:\/\/tdhopper.com\/" + '/js/pdf-js/build/pdf.worker.js'; var pdfDoc = null, pageNum = selectedPageNum, pageRendering = false, pageNumPending = null, scale = 3, canvas = document.getElementById('pdf-canvas-8750537f'), ctx = canvas.getContext('2d'), paginator = document.getElementById("pdf-paginator-8750537f"), loadingWrapper = document.getElementById('pdf-loadingWrapper-8750537f'); showPaginator(); showLoader(); function renderPage(num) { pageRendering = true; pdfDoc.getPage(num).then(function(page) { var viewport = page.getViewport({scale: scale}); canvas.height = viewport.height; canvas.width = viewport.width; var renderContext = { canvasContext: ctx, viewport: viewport }; var renderTask = page.render(renderContext); renderTask.promise.then(function() { pageRendering = false; showContent(); if (pageNumPending !== null) { renderPage(pageNumPending); pageNumPending = null; } }); }); document.getElementById('pdf-pagenum-8750537f').textContent = num; } function showContent() { loadingWrapper.style.display = 'none'; canvas.style.display = 'block'; } function showLoader() { if(hideLoader) return loadingWrapper.style.display = 'flex'; canvas.style.display = 'none'; } function showPaginator() { if(hidePaginator) return paginator.style.display = 'block'; } function queueRenderPage(num) { if (pageRendering) { pageNumPending = num; } else { renderPage(num); } } function onPrevPage() { if (pageNum <= 1) { return; } pageNum--; queueRenderPage(pageNum); } document.getElementById('pdf-prev-8750537f').addEventListener('click', onPrevPage); function onNextPage() { if (pageNum >= pdfDoc.numPages) { return; } pageNum++; queueRenderPage(pageNum); } document.getElementById('pdf-next-8750537f').addEventListener('click', onNextPage); pdfjsLib.getDocument(url).promise.then(function(pdfDoc_) { pdfDoc = pdfDoc_; var numPages = pdfDoc.numPages; document.getElementById('pdf-pagecount-8750537f').textContent = numPages; if(pageNum > numPages) { pageNum = numPages } renderPage(pageNum); }); })(); </script> </article> <article> <h1>Calculus Haikus and Limericks</h1> <p>Tue, 08 May 2012 00:00:00 +0000</p> <p>I offered my calculus students bonus points to write a limerick or haiku on their final exam. I got some great answers! And the students seem to enjoy it. My favorite limerick (on improper integrals):</p> <blockquote> <p>An integral has bounds from zero to one<br /> You finish the problem and think you are done<br /> Then your mind double checks<br /> The equation is one over x<br > And the improper integral has a different sum</p> </blockquote> <p>My favorite haiku gives the formula for integration by parts:</p> <blockquote> <p>Int of mu dv:<br /> Equal to mu v minus<br /> Int of v d mu</p> </blockquote> <p>Here are some other good ones:</p> <blockquote> <p>Doing Integrals… <br /> Oh How I shall miss thee, <br /> Calc! Goodbye, It’s been real?</p> <p>Calculus is hard <br /> Derivative and limit <br /> ¡Yo no se hombre!</p> <p>MacLaurin Series <br /> is an infinite series <br /> Centered at zero</p> <p>Calculus is great<br /> I’m sure I’ll use this in life<br /> Or maybe I won’t</p> <p>Studying ’til 5<br /> And almost sleeping thru test<br /> Was a real bad call</p> <p>Finding integrals<br /> Is used to find area<br /> Underneath the curve</p> <p>Calc is fun to do<br /> If you like to integrate<br /> x from a to b</p> <p>Calculus two, sigh<br /> Why must you torment me so?<br /> I thought we were friends</p> <p>F of e to x<br /> Derived is e to the x<br /> Lone function like it</p> <p>I like calculus<br /> The feeling is mutual<br /> But not all the time</p> <p>Integrals are fun<br /> u-sub can be tricky… yes!<br /> Don’t forget plus C</p> <p>Calculus is hard<br /> Eight A.M. is too early<br /> Had a good time though</p> <p>The ratio test:<br /> If the limit equals one<br /> Use another test</p> <p>Does this series telescope?<br /> For my sake, I truly hope<br /> Comparisons are icky,<br /> Integrals make me sickly,<br /> Turns out the answer is nope. :-(</p> </blockquote> <p>Here are a few from when I taught at UVA:</p> <blockquote> <p>An integral is<br /> A derivative reversed<br /> Don’t forget constants!</p> <p>Thanks to L’Hopital<br /> We can use derivatives<br /> to find a limit!</p> <p>Find the area,<br /> Between the two stated curves,<br /> Using integrals.</p> <p>Natural log of Zero,<br /> does not exist, but<br /> Ln of 1 doesn’t</p> <p>Calculus is great<br /> But only if taught by<br /> Tim At U of V.A.</p> </blockquote> </article> <article> <h1>Teaching Students to Fail</h1> <p>Fri, 13 Apr 2012 00:00:00 +0000</p> <p>I am teaching the dreaded calculus II this semester. I’ve known many students who flew through calc I in college (having taken calculus in high school) only to receive a reality check from calc II the next semester. In the US, calc II often involves a significant section on “techniques of integration” where students learn techniques such as partial fractions, trig substitutions, integration by parts. Unlike much of differential calculus, which is taught in calc I, and unlike much of the math taught before college, integration is harder to do algorithmically. That is, a calc II professor cannot simply outline surefire steps guaranteed to give an antiderivative for any function. The inimitable Robert Ghrist explains it this way in his “<a class="link" href="http://www.math.upenn.edu/~ghrist/FLCT/" target="_blank" rel="noopener" >funny little calculus text</a>”:</p> <p><a class="link" href="https://tdhopper.com/uploads/2012/01/Screen-Shot-2012-01-10-at-5.09.27-PM.png" ><figure> </figure> </a></p> <p>Of course, algorithms for computing many antiderivatives do exist (and are used in Maple, Mathematica, and Wolfram Alpha), but I’d be fired if I tried to take my undergrads through <em>Symbolic Integration I: Transcendental Functions</em>..</p> <p>Instead, calculus II teachers teach a handful of methods and attempt to teach students intuition for where to use what technique. Even more important, I try to teach my students the skill of trying a method, seeing that the method does not work, then trying something else. Try-fail-try-fail-try again. I do not think that high school students learn that skill—a skill vital to success in calc II and every discipline requiring analytical problem-solving. Yesterday, my adviser and I were discussing the first big research problem that I’ll be tackling this summer. He noted that our first attempt at solving a massive problem would probably fail; they usually do.</p> <p>Fortunately, I’ve been failing at solving problems, at least since taking number theory with Dan Krider in 2003. I know what Edison meant by “I have not failed. I’ve just found 10,000 ways that won’t work.” I hope my students are learning how to fail and how to try again. However, I think that kids need to learn earlier. High school assignments should not be set up for students to succeed the first time, every time. Somehow, teachers need to allow students to take risks, learn from their mistakes, and rebound. I’d love to hear feedback from students who are learning these lessons and teachers trying to teach them.</p> </article> <article> <h1>Average Area of a Random Hull</h1> <p>Thu, 05 Apr 2012 00:00:00 +0000</p> <p>Yesterday, someone on MathOverflow asked</p> <blockquote> <p>Consider $n$ points generated randomly and uniformly on a unit square. What is the expected value of the area (as a function of $n$) enclosed by the convex hull of the set of points?</p> </blockquote> <p>Someone quickly cited <a class="link" href="http://www.math.kth.se/~johanph/area12.pdf" target="_blank" rel="noopener" >2004 paper</a> provides an analytical result for the cases where $n=3$ and $n=4$:</p> <blockquote> <p>For $n=3$ the expected value is $11/144$ and for $n=4$ it is $11/72$.</p> </blockquote> <p>This is certainly a nontrivial result. However, the value can be approximated by generating a large number of random points, finding the area of the convex hull, and averaging the areas. Of course, finding the convex hull and the area of the convex hull of a set of points requires a little work. Mathematica provides functions for generating random points and finding the area of the convex hull of a set of points quickly. As a result, I was able to perform a Monte Carlo simulation for the $n=3$ and $n=4$ case in a couple of lines of Mathematica code:</p> <p><a class="link" href="https://tdhopper.com/uploads/2012/04/Screen-Shot-2012-04-04-at-11.29.04-AM.png" ><figure> </figure> </a></p> <p>Sampling 5000 cases for each returned results fairly close to the predicted average.</p> </article> <article> <h1>My Email Analytics</h1> <p>Thu, 05 Apr 2012 00:00:00 +0000</p> <p>Last month, Stephen Wolfram did a blog post on the <a class="link" href="http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/" target="_blank" rel="noopener" >Personal Analytics</a> of his life. For years, he’s recorded every phone call, keyboard stroke, email, and step. He made beautiful graphs to show his activity over the years. A Wolfram Alpha developer just posted a <a class="link" href="http://blog.wolfram.com/2012/04/05/analyzing-your-email-with-mathematica/" target="_blank" rel="noopener" >Mathematica notebook</a> on the Wolfram blog allowing anyone to do the same email analysis that Wolfram did with any IMAP email account. Of course, I dropped what I was doing to try it out with my Gmail account. At first, it failed to finish processing my incoming email because the JVM ran out of memory. It took me a while to figure out how to tell JLink to let Java have more memory<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. Here’s a plot of emails sent by from Gmail me over the last six years:</p> <figure> </figure> <p>I started using Gmail regularly after I graduated from college in 2008 (once my college Exchange-based email account was gone). My emailing was noticeably sparse from May 2008-May 2009. During that time I was a teacher, and I didn’t spend nearly as much time on a computer as I do now. You can also see a gap during the summer of 2011. I was working at Kiva Systems at the time and primarily used my company email. On the horizontal axis, you notice I’m pretty silent between 11 PM and 7 AM. I need my sleep, and I never work at night! My emailing is light from 6-9 PM too. Here’s a graph of my email received over the past six years. It comes in pretty heavy from 8 AM to midnight!</p> <figure> </figure> <p>The thick line just under 6 AM is the Google Calendar email updates I used to get every morning. I stopped getting those once I got an iPod Touch this past Christmas. I can’t remember what email used to come at 3 AM for a few years. This next graphic shows the average number of emails I receive per day for each month. My amount of emailing ramped up once I started using email. Notice the downward trend on incoming emails recently: I’ve been unsubscribing from unnecessary mailing lists and circulars. My emailing had pretty serious peak last May right before I moved to Boston. Not sure why.</p> <figure> </figure> <p>Here we have a histogram of the time at which I send emails. Apparently I’m most likely to send an email just after 10 PM. I wouldn’t have guess that. Don’t expect to hear from me after midnight!</p> <figure> </figure> <p>As a good operations researcher, I wondered if I received email according to a <a class="link" href="http://en.wikipedia.org/wiki/Poisson_process" target="_blank" rel="noopener" >Poisson process</a>. I pulled the email time stamp data into R. I get email pretty steadily between 8 AM and 10 PM. I looked at the emails that arrived in that interval since September 2011. The mean interarrival time is 0.53 hours. The standard deviation is 0.92. If it were a poisson process, interarrival times would be exponentially distributed, and the mean and standard deviation would be equal. Below is a histogram of the interarrival times of my emails. The red line is an exponential distribution with the rate set to 1 over the mean interarrival time of my emails. It’s not a terrible fit!<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> One reason the email arrival rate might not be exponential is that I frequently have back-and-forth email conversations with people, which skews the distribution towards short interarrival times. I might do some more statistics later, but I have homework to do.</p> <figure> </figure> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>ReinstallJava[CommandLine -> “java”, JVMArguments -> “-Xmx4024m”] <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>That’s not a official statistical statement! <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Operations Research, Machine Learning, and Optimization</h1> <p>Tue, 03 Apr 2012 00:00:00 +0000</p> <p>Over the past 18 months, I’ve been slowly learning some machine learning. One thing I’ve noticed is that most of the math in machine learning is optimization. Regression is typically minimization of some error term. Support vector machines are a quadratic optimization problem with linear constraints. Learning a neural network is <em>simply</em> solving a nonconvex optimization problem. Clustering often takes the form of expectation-maximization. I’m currently learning Bayesian network structure learning which is an extremely difficult combinatorial optimization problem.</p> <p>Yesterday on Twitter, I commented that I am surprised at how little operations research people and machine learning people talk. Most of the math of OR is, like machine learning, optimization. All the same <a class="link" href="http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions" target="_blank" rel="noopener" >theorems</a> apply, and we use many of the same <a class="link" href="http://en.wikipedia.org/wiki/Newton%27s_method" target="_blank" rel="noopener" >algorithms</a>; we just apply them in different ways. I got helpful feedback from the nerds that follow me. <a class="link" href="https://twitter.com/#!/jefflinderoth" target="_blank" rel="noopener" >Jeff Linderoth</a> pointed to the recent book <a class="link" href="http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12674" target="_blank" rel="noopener" >Optimization for Machine Learning</a> by his colleague (et al) Stephen J. Wright at University of Wisconsin, Madison. From what I can tell, Wright is an OR guy in a computer scientists clothing. There’s a <a class="link" href="http://videolectures.net/nips2010_wright_oaml/" target="_blank" rel="noopener" >two-hour lecture</a> by Wright on the same topic that I look forward to watching. Jeff also pointed to the work of his colleague <a class="link" href="http://pages.cs.wisc.edu/~brecht/publications.html" target="_blank" rel="noopener" >Ben Recht</a> who’s looking at the optimization problems in ML from a theoretical standpoint. </p> <p><a class="link" href="https://twitter.com/#!/pykpyk" target="_blank" rel="noopener" >Paul Kerl</a> linked to <a class="link" href="http://users.eecs.northwestern.edu/~nocedal/publications.html" target="_blank" rel="noopener" >Jorge Nocedal</a>’s work at Northwestern. Nocedal and Recht seem to have feet in both worlds. <a class="link" href="https://twitter.com/johnmyleswhite" target="_blank" rel="noopener" >John Myles White</a> noted that the legendary optimizer Stephen Boyd came to the New York Academy of Science’s Machine Learning event last year. I also came across a 2006 paper on <a class="link" href="http://jmlr.csail.mit.edu/papers/volume7/MLOPT-intro06a/MLOPT-intro06a.pdf" target="_blank" rel="noopener" >The Interplay of Optimization and Machine Learning Research</a>. The authors note some difference between an OR and ML perspective on optimization:</p> <blockquote> <p>We observe that the qualities of good optimization algorithms from the machine learning and optimization perspectives can be quite different. Mathematical programming puts a premium on accuracy, speed, and robustness. Since generalization is the bottom line in machine learning and training is normally done off-line, accuracy and small speed improvements are of little concern in machine learning. Machine learning prefers simpler algorithms that work in reasonable computational time for specific classes of problems<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p> </blockquote> <p>A bigger question might be where optimization lies as a discipline. Since I’ve been in OR, I’ve always considered optimization as a subfield of OR. But as I read applied OR literature, I find it jarring to see the details of <em>solving</em> a difficult optimization problem mixed with the application of the solution to a real world problem. Of course, both ML and OR require practitioners to understand how the algorithms work. Optimization problems are hard, and a black box solution rarely works for any of us. But perhaps optimization will become a field of its own that OR and ML can both feed from instead of the two working independently.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I haven’t read the whole paper (that’s from the abstract), but I’m not entirely convinced that is true. Modern machine learning often requires large scale problems to be solved quickly on-line, while optimizers often solve a problem offline and speed is negotiable. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Stephen Wolfram's AMA</h1> <p>Mon, 05 Mar 2012 00:00:00 +0000</p> <p>Stephen Wolfram, of Wolfram Research and Mathematica fame, did a <a class="link" href="http://www.reddit.com/r/IAmA/comments/qisot/im_stephen_wolfram_mathematica_nks_wolframalpha/" target="_blank" rel="noopener" >Q&A (i.e. AMA) on Reddit</a> today. I just enjoyed reading through his answers. A few interesting answers stood out to me.</p> <p>Someone ask Wolfram’s opinion on P=NP. <a class="link" href="http://www.reddit.com/r/IAmA/comments/qisot/im_stephen_wolfram_mathematica_nks_wolframalpha/c3xy53q" target="_blank" rel="noopener" >He thinks it’s undecidable</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p> <p>Some smart aleck <a class="link" href="http://www.reddit.com/r/IAmA/comments/qisot/im_stephen_wolfram_mathematica_nks_wolframalpha/c3xxmie" target="_blank" rel="noopener" >threw the Riemann hypothesis at him</a>. Interestingly, Wolfram also suspects this is undecidable<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p> <p>One questioner asked about open sourcing old versions of Mathematica. <a class="link" href="http://www.reddit.com/r/IAmA/comments/qisot/im_stephen_wolfram_mathematica_nks_wolframalpha/c3xz8by" target="_blank" rel="noopener" >Wolfram responded very winsomely</a>, in my view. I didn’t know that they’ve thought about making the core language more freely available. I’d like to see that.</p> <p>His most interesting answer is <a class="link" href="http://www.reddit.com/r/IAmA/comments/qisot/im_stephen_wolfram_mathematica_nks_wolframalpha/c3xxhvt" target="_blank" rel="noopener" >his opinion on Matlab</a>. He argues that Matlab has remained matrix-centric when so much of contemporary mathematics goes beyond that. “In the complete web of algorithms in Mathematica, things that can reasonably be represented as numerical matrices are perhaps 5 or 10% of the total.” However, Wolfram believes that Mathematica isn’t outdone by Maple in the realm of matrices.</p> <p>Wolfram relays that a major goal of Mathematica is “to make a single coherent system in which one can work, and in which everything fits nicely together.” <a class="link" href="http://www.tdhopper.com/blog/mathematica-a-love-story/" target="_blank" rel="noopener" >I argued</a> that that’s one thing they’ve done quite well.</p> <p>I appreciate Wolfram doing this. I continue to be optimistic about Mathematica as a product, and I hope they have a bright future ahead of them.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>See the Wikipedia page on <a class="link" href="http://en.wikipedia.org/wiki/Undecidable_problem" target="_blank" rel="noopener" >undecidability</a> for more. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>Both the Riemann hypothesis and P=NP have been around for many years and have a big bounty on solving them: http://www.claymath.org/millennium/. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>The Spelling Corrector that Got Me Interested in Machine Learning</h1> <p>Fri, 02 Mar 2012 00:00:00 +0000</p> <p>Although I was a computer science minor, I’d never heard of statistical machine learning until after college. Now I dabble in machine learning on the side. In the long run, I’m interested in studying the intersection of operations research and learning, i.e. intelligent optimization systems. Two years ago, I stumbled across Peter Norvig’s essay <a class="link" href="http://norvig.com/spell-correct.html" target="_blank" rel="noopener" >How to Write a Spelling Corrector</a>. Google is a notoriously good spelling corrector; just try googling “<a class="link" href="http://www.google.com/#hl=en&sclient=psy-ab&q=spellign&pbx=1&oq=spellign&aq=f&aqi=g-s4&aql=&gs_sm=3&gs_upl=355l2969l0l3671l8l8l0l0l0l0l152l775l4.4l8l0&gs_l=hp.3..0i10l4.355l2969l0l3672l8l8l0l0l0l0l152l775l4j4l8l0&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=6274c5cd76393bc1&biw=1257&bih=802" target="_blank" rel="noopener" >spellign</a>.” I find that Google knows what word I’m trying to spell even when an application’s built-in spell check fail. Norvig explains how to use Bayes theorem to write a pretty-good spelling corrector in 21 lines of Python. In college, I had good grounding in probability and computing, but I don’t recall having seen the two mixed so elegantly. I’m pleased to now know that Norvig was only scratching the surface! This week, I’ve been watching Sebastian Thurn’s lectures on <a class="link" href="http://www.udacity.com/" target="_blank" rel="noopener" >Kalman Filters</a> for AI. Probabilistic techniques for filtering noisy data can be used, for example, in a robot keeping track of objects moving around it.</p> </article> <article> <h1>Mathematica: A Love Story</h1> <p>Fri, 10 Feb 2012 00:00:00 +0000</p> <p>As an undergraduate, our math department used Wolfram Research’s <a class="link" href="http://www.wolfram.com/mathematica/" target="_blank" rel="noopener" >Mathematica</a> heavily for instruction in a number of classes. Initially, I found it perplexing and frustrating. While most of my peers remained in that state (and never used it again after those classes), I soon found myself ordering and reading <a class="link" href="http://www.amazon.com/Introduction-Programming-Mathematica-Paul-Wellin/dp/0521846781/ref=sr_1_4?ie=UTF8&qid=1328824356&sr=8-4" target="_blank" rel="noopener" >An Introduction to Programming with Mathematica</a>.</p> <p>Seven years later, I find myself using Mathematica almost daily. As a student, it is one of the most helpful tools at my disposal, and it has saved me countless hours of tedious computation by hand. I’m not sure I can express all the ways in which I appreciate it, but I hope to share some.</p> <p>I admit that I primarily use Mathematica as a glorified calculator. Most of my code is single use code to help me with a homework assignment. I have written some longer code for class projects, but rarely more than a few hundred lines. However, for the work that I have had over the past seven years, it is exactly the right tool, and I don’t know of any other language which comprehensively offers all the features I need within its core language.</p> <p>One other note to the Redditors and cynics (but I repeat myself): I’m not recommending or encouraging programmers to jump ship from their main languages to Mathematica. I’m not suggesting that Mathematica doesn’t have any shortcomings. I’m not arguing that Mathematica is good for everything. I’m well aware that Mathematica is an expensive, closed platform. I’m well aware that Mathematic has the worst undo ever. I’m not writing an advertisement or getting paid by Wolfram. I’m simply shared the story of a program that has become an invaluable part of my schooling.</p> <h2 id="1-powerful-symbolic-computations">1. Powerful Symbolic Computations </h2><p>Perhaps the thing Mathematica is most well-known for is symbolic computations. The oldest Mathematica file I have on my computer is a single line of code that I apparently used on a differential equations quiz in 2005. In it, I did a partial fraction decomposition: the bane of calculus 2 students, but easy for a computer.</p> <p></p> <p></p> <p>One of the benefits of Mathematica, is the elegance of typesetting in both the input and the output. Wolfram has taken great care to make Mathematica an aesthetically excellent experience, and I’m grateful for that.</p> <p>These days, I avoid doing algebraic manipulations by hand at all costs. It’s not worth it to me to risk making errors that might trickle down into my work. I let computers handle such things for me. Thus, when I’m doing homework, I usually have a notebook open filled with one-off expressions like</p> <p></p> <p></p> <p>Of course, it can solve much harder problems too. Integration is no problem. Here’s a triple integral I solved in my electricity and magnetism class sophomore year. (I wish I remembered what it all means.)</p> <p></p> <p></p> <p>The output is messy because Mathematica tried to solve the integral as generally as possible. We can get a more clear answer by clarifying some assumptions we’re making about the parameters.</p> <p></p> <p></p> <h2 id="2-functional-programming">2. Functional Programming </h2><p>Like R, Mathematica allows procedural programming.</p> <p></p> <p></p> <p></p> <p>However, again like R, Mathematica is really built for functional programming. <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/FunctionalOperationsOverview.html" target="_blank" rel="noopener" >Wolfram has a great tutorial on the topic, but let me share a brief example from my own use. </a></p> <p>On a homework assignment this week, I wanted to measure the total tardiness for various schedules in a single machine problem. Each of the four jobs had a total processing time, given by {2,4,6,8}, and a due date {4,14,10,16}. The tardiness of a job for a given schedule (i.e. ordering of the jobs) is 0 if the job is completed on time and how late it is otherwise. First I set processing time and due dates for the four jobs:</p> <p></p> <p>Any permutation of the job indices {1,2,3,4} gives a valid schedule. Suppose we want to know the total lateness of the schedule x={1,4,3,2}. The ordering of processing times is given by p[[x]] so the time when each job is completed is a running total of the processing times:</p> <p></p> <p></p> <p>The lateness of each job is defined by the completion time minus the due date:</p> <p></p> <p></p> <p>To get the tardiness, we want the max of the lateness and zero. There are a number of ways to do this, but one is to apply a max function to each element of the list:</p> <p></p> <p></p> <p>(The <em>#</em> and <em>&</em> are part of Mathematica’s notation for <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/PureFunctions.html" target="_blank" rel="noopener" >pure functions</a>.) Or, more succinctly,</p> <p></p> <p></p> <p>In a non-functional language, this would have required a for-loop and several lines of code. In a functional language, it naturally fits into a single line. Getting the total tardiness adds no more complexity:</p> <p></p> <p></p> <p>since % returns the last line evaluated.</p> <p>This is only a simple example of a functional operation in Mathematica. Expressions can become much more complex. All my Mathematica code is littered with functional expressions, but rarely will you see a for-loop or a while-loop in my code. And I like it that way.</p> <p>Oh, and if you want to parallelize these operations: <a class="link" href="https://reference.wolfram.com/language/guide/ParallelComputing.html" target="_blank" rel="noopener" >not a problem</a>.</p> <h2 id="3-optimization">3. Optimization </h2><p>As a student of operations research, I spend a lot of time solving optimization problems. Solving optimization problems of many flavors is built right into Mathematica. Solving linear programs given the matrices is easy with the LinearProgramming function. Because most of the problems I’ve solved up to this point have been “toy” problems for class, I can’t attest to Mathematica’s ability to handle large-scale problems, but they <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/ConstrainedOptimizationLinearProgramming.html" target="_blank" rel="noopener" >claim</a> to be able to handle large problems. Mathematica’s ExampleData function gives easy access to many data sets, including <a class="link" href="http://www.netlib.org/lp/data/readme" target="_blank" rel="noopener" >NetLib.org</a>’s LP problems. Mathematica could solve this problem with 6072 rows and 12230 columns in 60 seconds on my 11-inch Macbook Air.</p> <p></p> <p></p> <p>The built-in solver certainly isn’t as robust as CPLEX or other commercial solver, but it does, at least, provide <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/ConstrainedOptimizationLinearProgramming.html" target="_blank" rel="noopener" >several solution methods</a>.</p> <p>I most often find myself using the Minimize and Maximize functions with explicit constraints:</p> <p></p> <p></p> <p>In the future, I hope to do a post on using Mathematica as a pseudo-modeling language. You can see the <a class="link" href="http://reference.wolfram.com/mathematica/guide/Optimization.html" target="_blank" rel="noopener" >documentation</a> for a number of other optimization related functions.</p> <p>Recently, I’ve been working with stochastic dynamic programming problems (i.e. Markov decision processes). Mathematica offers <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/FunctionsThatRememberValuesTheyHaveFound.html" target="_blank" rel="noopener" >the easiest memoization I’ve ever seen in any language</a>. Combined with functional aspects, I can solve dynamic programs with relatively little code.</p> <h2 id="4-graphics">4. Graphics </h2><p>When I am doing school work, I want to be able to do complicated computations and then visualize the results quickly. Because of how tightly knit the native Mathematica graphics are built into the core language, I don’t have to go out of my way to do this.</p> <p>Last semester, I wanted to demonstrate a Monte Carlo algorithm for navigating a maze. Over 200 iterations, a relatively simple solver (built, of course, in Mathematica) could find an optimal path through a 4 by 3 maze. for a report I was writing.</p> <p>Working straight from the output of the solver, in about twenty lines of code, I output a grid showing candidate solution every other iteration (the green cells indicate cells where the action is optimal). Followed by an Export function, the graphic was ready to be included in my file. All of this without having to open another program or import any graphics packages.</p> <p><a class="link" href="https://tdhopper.com/uploads/2012/02/maze.png" ><figure> </figure> </a></p> <h2 id="5-documentation">5. Documentation </h2><p>Wolfram has been careful to write readable and thorough documentation for Mathematica. Though Mathematica is not free software, its 10,000+ pages of documentation are <a class="link" href="http://reference.wolfram.com/mathematica/guide/Mathematica.html" target="_blank" rel="noopener" >available online</a>. Not only does the documentation for every function (usually including bullet points with Basic Examples, Scope, Generalizations & Extensions, Applications, Properties & Relations, and Neat Examples), it’s full of tutorials on various aspects of the language. Of you read the help inside of Mathematica, the files are simply notebooks, so the code can be evaluated within the documentation. I think you’d be hard pressed to find a language with better documentation.</p> <h2 id="6-naming-conventions">6. Naming Conventions </h2><p>If Mathematica wins one debate hands down, its naming conventions. By their own <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/NamingConventions.html" target="_blank" rel="noopener" >standards</a>, “As with most Mathematica functions, the names are usually complete English words, fully spelled out.” If you know the mathematical name for something, you can probably guess the Mathematica form.</p> <p>Stephen Wolfram wrote a <a class="link" href="http://blog.wolfram.com/2008/01/10/ten-thousand-hours-of-design-reviews/" target="_blank" rel="noopener" >blog post</a> a few years ago on his personal role in naming Mathematica functions. Perhaps not too different from his late friend Steve Jobs, Wolfram desires intense control of the finest details of his products.</p> <blockquote> <p>I just realized that over the course of the decade during which were developing Mathematica 6—and accelerating greatly towards the end—I spent altogether about 10,000 hours doing what we call “design reviews” for Mathematica 6, trying to make all those new functions and pieces of functionality in Mathematica 6 be as clean and simple as possible, and all fit together.</p> </blockquote> <p>I think this has paid off.</p> <p>Some people would complain about a language such an enormous number of named expressions, but Wolfram (the man and the company) have been so careful in constructing it that it doesn’t feel bloated.</p> <h2 id="7-interactivity">7. Interactivity </h2><p>In version 7, Wolfram introduced interactivity into Mathematica. The Manipulate function is one I have found extremely valuable. It allows you to parametrize an expression and adjust the parameters while seeing results in real-time. For example, you could use Manipulate to adjust the region over which a function is plotted:</p> <p></p> <p></p> <p>A side benefit to the careful construction of the language is that functions with related behavior often have interchangeable expression lists. Manipulate can be replaced with Animate with no other changes.</p> <p></p> <p></p> <p>Or Table for that matter:</p> <p></p> <p></p> <h2 id="8-continual-development">8. Continual Development </h2><p>Thankfully, Wolfram hasn’t given up on Mathematica. It’s been in development now for nearly 24 years. Mathematica 7 (released in November 2008) introduced interactivity features, access to many data sets, and built-in parallel computing, among <a class="link" href="http://www.wolfram.com/products/mathematica/newin7/" target="_blank" rel="noopener" >other things</a>.</p> <p>Mathematica 8, released in November 2010, brought integration with Wolfram Alpha and free form input. I find myself using this frequently when I’m teaching calculus. For example:</p> <p></p> <p></p> <p>Mathematica 8 also brought incredible probability computations. What’s the probability that a standard normal random variable is less than a Uniform(0,1) random variable? No problem.</p> <p></p> <p></p> <p>The <a class="link" href="http://www.wolfram.com/mathematica/new-in-8/" target="_blank" rel="noopener" >list of things new in version 8</a> goes on.</p> <h2 id="9-comprehensiveness">9. Comprehensiveness </h2><p>A feature of Mathematica that is hard to articulate is the comprehensiveness of the features I’ve already mentioned plus many more. It’s a full featured programming language, but the core language also extends to the depths of applied and pure mathematics. Symbolic manipulation? Check. Numerical methods? Check. <a class="link" href="http://www.wolfram.com/mathematica/new-in-8/group-theory-algorithms/" target="_blank" rel="noopener" >Abstract algebra</a>? Check. <a class="link" href="http://reference.wolfram.com/mathematica/guide/GraphsAndNetworks.html" target="_blank" rel="noopener" >Graph theory</a>? <a class="link" href="http://reference.wolfram.com/mathematica/tutorial/BasicStatistics.html" target="_blank" rel="noopener" >Statistics</a>? <a class="link" href="http://reference.wolfram.com/mathematica/howto/CustomizePlotsAndGraphics.html" target="_blank" rel="noopener" >Visualization</a>? <a class="link" href="http://reference.wolfram.com/mathematica/guide/Optimization.html" target="_blank" rel="noopener" >Optimization</a>? <a class="link" href="http://reference.wolfram.com/mathematica/guide/ProcessingTextualData.html" target="_blank" rel="noopener" >String Processing</a>? <a class="link" href="http://reference.wolfram.com/mathematica/guide/DifferentialEquations.html" target="_blank" rel="noopener" >Differential equations</a>? <a class="link" href="http://reference.wolfram.com/mathematica/ref/ChemicalData.html" target="_blank" rel="noopener" >Computational chemistry</a>? <a class="link" href="http://reference.wolfram.com/mathematica/guide/Calculus.html" target="_blank" rel="noopener" >Calculus</a>? Check. Check. Check.</p> <p>The comprehensiveness of <em>Mathematica</em>’s functionality along with dynamic typing and functional programming allows me to write code to do complicated tasks very quickly. I love it.</p> <h2 id="conclusion">Conclusion </h2><p>No doubt, Mathematica has its limitations<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>: Worlds worst undo. Not object-oriented. Closed platform. Expensive. No autosave. No data frame structure.</p> <p>However, for me, it’s an invaluable tool. Last semester, I saw a less computer savvy fellow graduate student writing out a huge table by hand. I don’t recall the name of what he was doing, but it was something to do with measuring the distance between permutations. I told him I could do it for him in a single line of Mathematica.</p> <p>In just a few minutes I wrote him the following code. It ended up taking me more than one line, but I wrote the code much faster than he was generating it by hand. (His table was actually for the 4-permutation case, so it was 24x24 instead of 6x6.)</p> <p></p> <p></p> <p>Using Mathematica for little things like this. It allows me to spend my time and brain power on the things that computers can’t handle.</p> <p>I love Mathematica. And maybe you will too.</p> <p>(I wrote this post in Mathematica. You can check out the <a class="link" href="https://github.com/tdhopper/Mathematica-A-Love-Story" target="_blank" rel="noopener" >notebook</a> here.)</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>I might follow with a post on that very point <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>Mathematicians, Engineers, and Businessmen on NPC Problems</h1> <p>Thu, 02 Feb 2012 00:00:00 +0000</p> <p>I assume most people who are nerdy enough to read this blog are nerdy enough to know about the <a class="link" href="http://www.claymath.org/millennium/P_vs_NP/" target="_blank" rel="noopener" >$\mathcal{P}$ vs $\mathcal{NP}$ problem</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. I first learned about this problem taking computer science classes in college, and it all seemed very theoretical at the time. Now that I study operations research, the problem is very real. Operations researchers are often limited in their pursuits by the challenges of $\mathcal{NP}$-hard <a class="link" href="http://en.wikipedia.org/wiki/Integer_program" target="_blank" rel="noopener" >problems</a>, and many operations researchers spend their careers trying to solve hard problems. Last semester, I came across <a class="link" href="http://www.core.ucl.ac.be/~wolsey/" target="_blank" rel="noopener" >Laurence Wolsey</a>’s beautiful description of how various people might view this problem. (Of course, the inimitable Randall Munroe has offered <a class="link" href="http://xkcd.com/664" target="_blank" rel="noopener" >a similar look</a> at the issue.)</p> <blockquote> <p>A <em>pessimist</em> might say that as most problems appear to be hard (i.e., their decision version lies in $\mathcal{NPC}$), we have no hope of solving instances of large size (because in the worst case we cannot hope to do better than enumeration), and so we should give up. A <em>mathematician (optimist)</em> might set out to become famous by proving that $\mathcal{P=NP}$. A <em>mathematician (pessimist)</em> might set out to become famous by proving that $\mathcal{P\neq NP}$. A <em>mathematician (thoughtful)</em> might decide to ask a different question: Can I find an algorithm that is guaranteed to find a solution “close to optimal” in polynomial time in all cases<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>. A <em>probabilist (thoughtful)</em> might also ask a different question: Can I find an algorithm that runs in polynomial time with high probability and that is guaranteed to find an optimal or “close to optimal” solution with high probability? An <em>engineer</em> would start looking for a heuristic algorithm that produces practically usable solutions. Your <em>boss</em> might say: I don’t care a damn about integer programming theory. You just worry about our scheduling problem. Give me a feasible production schedule for tomorrow in which William Brown and Daughters’ order is out of the door by 4 P.M. A <em>struggling professor</em> might say: Great. Previously I was trying to develop one algorithm to solve all integer programs, and publishing one paper every two years explaining why I was not succeeding. Now I know that I might as well study each $\mathcal{NP}$ problem individually. As there are thousands of them, I should be able to write twenty papers a year. Needless to say they are all right. There is no easy and rapid solution, but the problems will not go away, and more and more fascinating and important practical problems are being formulated as integer programs. So in spite of the $\mathcal{NP}$-completeness theory, using an appropriate combination of theory, algorithms, experience, and intensive calculation, verifiably good solutions for large instances can and must be found<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>.</p> </blockquote> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>If that’s not the case, you might check out this page: <a class="link" href="http://simple.wikipedia.org/wiki/P_versus_NP" target="_blank" rel="noopener" >http://simple.wikipedia.org/wiki/P_versus_NP</a>. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p>See <a class="link" href="http://en.wikipedia.org/wiki/Polynomial_approximation_scheme" target="_blank" rel="noopener" >http://en.wikipedia.org/wiki/Polynomial_approximation_scheme</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:3"> <p>From <a class="link" href="http://books.google.com/books/about/Integer_programming.html?id=x7RvQgAACAAJ" target="_blank" rel="noopener" >Integer Programming</a>, Page 87. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>The Efficient Research Hypothesis</h1> <p>Fri, 27 Jan 2012 00:00:00 +0000</p> <p>In economics, some have asserted the efficient market hypothesis. The idea is that market prices take into account all the information currently available. If the efficient market hypothesis holds, an investor couldn’t consistently beat the market because his knowledge about the markets is no better than anyone else’s. An accompanying “joke” is that an economist would never bend over to pick up a $20 bill on the street. If there was really $20 to be taken, someone would have taken it already.</p> <p>As I’m moving towards research and away from classes in my PhD program, I sometimes find myself believing the efficient research hypothesis: if an idea I have is good and correct, someone must have had it already. I have the same temptation outside of school. On occasion, I consider an idea for a website or computer program but then decide that if it were really a good idea, someone would have had it already. This can’t always be right. I don’t know if markets are efficient or not; I’m neither an economist nor an investor. But the progression of knowledge certainly is not efficient. New things are always available to be learned and studied, and these things aren’t always obvious from the information currently available. Research (and other projects) requires more than just thinking of a good idea; it requires sweat and elbow grease.</p> </article> <article> <h1>From Nuclear Weapons to Operations Research</h1> <p>Wed, 25 Jan 2012 00:00:00 +0000</p> <p>In 1953, the eminent mathematician John Von Neumann (as they say, “Most mathematicians prove what they can, von Neumann proves what he wants”) wrote a letter to T. V. Moore of Standard Oil. Apparently, Moore had written Von Neumann about an operations problem he had. Eighteen tankers would transport fuel from La Salina to Las Piedras to Aruba. Moore wanted to “determine the economic value of increasing the number of berths for the loading of these tankers in La Salina from 3 to 4 or to 5.”<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p> <p>However, due to “weather, conditions in the ports of call, etc… the comings and goings of the tankers are described in statistical terms only.” Significant uncertainty existed in the time between ports and in loading and unloading times. Because of the complexity added by these uncertainties, Von Neumann speculated that the system would be very difficult to describe analytically. A few years earlier, in 1946, Von Neumann and Stanislaw Ulam worked to solve the radiation shielding problem. And, “Despite having most of the necessary data… the problem could not be solved with analytical calculations.”<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p> <p>In this similarly probabilistic system, Ulam and Von Neumann decided to approximate the results they needed instead by repeatedly generating possible outcomes based on statistical manipulations and a (pseudo) random number generator. Their code name for this project was “Monto Carlo.” Von Neumann suggested that Standard Oil could build a <em>probabilistic</em> model of the tanker problem. For example, a trip from La Salina to Las Piedras might take 8 hours with a probability of .8 (good weather conditions), 10 hours with a probability of .15 (moderate weather), and 15 hours with a probability of 0.05 (poor weather). Based on this model and given a table of random numbers, he could simulate possible outcomes of a trip that (roughly) match reality.</p> <blockquote> <p>The procedure would have to be somewhat like this: Represent each tanker by some form of record, e.g., by a punch card, showing its exit time from La Salina…. Program calculations which will develop the further history of this tanker, always deriving those quantities which depend on chance…, with the use of suitable tables of random numbers.</p> </blockquote> <p>Of course, Moore would not want to generate just one possible trip. But if the process was done repeatedly, the average benefit of having some number of berths in the simulation would be an approximation of the real benefit:</p> <blockquote> <p>It is thus possible to trace the history of as many days of operation as desired (say, a few years). One can then work out the behavior of sufficiently large samples for any assumed number of berths and thereby get an evaluation of the economic significance of any particular arrangement (i.e. any particular number of berths).</p> </blockquote> <p>The same method Ulam and Von Neumann used to understand radioactive particles (which behave probabilistically) could be used to model full shipments. Fortunately, Monte Carlo simulation now requires neither punch cards or random number tables. Von Neumann’s punch card method would quickly be replaced by computer programs and <a class="link" href="http://www.codeproject.com/Articles/25172/Simple-Random-Number-Generation" target="_blank" rel="noopener" >pseudo-random number generation algorithms</a>. This early application of Monte Carlo methods to operations problems is one of many.</p> <p>I have recently been looking at online stochastic scheduling problems. In these problems, a system is asked to fulfill some scheduling requests while there is uncertainty (perhaps uncertainty about what other requests will come or how long the job might take to complete). A common approach to these problems is to solve <em>deterministic</em> scheduling problems based on Monte Carlo simulations of the future. The system somehow combines these various schedules to make a decision. There are many <a class="link" href="http://en.wikipedia.org/wiki/Monte_Carlo_method#Applications" target="_blank" rel="noopener" >applications of Monte Carlo methods</a> in other fields as well.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>John von Neumann (2005). Miklós Rédei. ed. John von Neumann: Selected letters. History of Mathematics. 27. American Mathematical Society. p. 123. ISBN 0-8218-3776-1. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p><a class="link" href="http://en.wikipedia.org/wiki/Monte_Carlo_method#History" target="_blank" rel="noopener" >http://en.wikipedia.org/wiki/Monte_Carlo_method#History</a> <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>What do all operations researchers know?</h1> <p>Mon, 16 Jan 2012 00:00:00 +0000</p> <p>My friend <a class="link" href="http://www.johndcook.com/blog/" target="_blank" rel="noopener" >John Cook</a> asked me an interesting question recently:</p> <blockquote> <p>If you had a room full of people with a graduate degree in [operations research], what things would nearly everyone in the room know?</p> </blockquote> <p>Operations research is notoriously hard to define. According to the Institute for Operations Research and Management Science, “In a nutshell, operations research (O.R.) is the discipline of applying advanced analytical methods to help make better decisions.” I suspect graduate programs spend most of their time teaching those “analytical methods,” i.e. mathematical and computational techniques for modeling and solving problems related to decisions. Examples include Received exception: <a class="link" href="http://en.wikipedia.org/wiki/Linear_programming" target="_blank" rel="noopener" >linear programming</a>, <a class="link" href="http://en.wikipedia.org/wiki/Nonlinear_programming" target="_blank" rel="noopener" >nonlinear programming</a>, <a class="link" href="http://en.wikipedia.org/wiki/Integer_programming" target="_blank" rel="noopener" >integer programming</a>, <a class="link" href="http://en.wikipedia.org/wiki/Dynamic_programming" target="_blank" rel="noopener" >dynamic programming</a>, <a class="link" href="http://en.wikipedia.org/wiki/Stochastic_programming" target="_blank" rel="noopener" >stochastic programming</a>, <a class="link" href="http://en.wikipedia.org/wiki/Stochastic_models" target="_blank" rel="noopener" >stochastic models</a>, <a class="link" href="http://en.wikipedia.org/wiki/Queueing_theory" target="_blank" rel="noopener" >queueing theory</a>, <a class="link" href="http://en.wikipedia.org/wiki/Game_theory" target="_blank" rel="noopener" >game theory</a>, and <a class="link" href="http://en.wikipedia.org/wiki/Simulation" target="_blank" rel="noopener" >simulation</a>. The <a class="link" href="http://www.or.ncsu.edu/academics/msc.php" target="_blank" rel="noopener" >course requirements</a> for OR PhD students at my university provide an upper bound for this problem: the only courses everyone must take are linear programming, nonlinear programming, and stochastic modeling. Some topics are surprisingly optional; in particular: simulation, statistics, integer programming/combinatorial optimization. John suggests that statistics PhD programs are similar. Topics diverge rather quickly after first year courses. Are all graduate programs like this? Is this a necessary evil (or evil at all)?</p> </article> <article> <h1>Carrots, Oatmeal, Operations Research</h1> <p>Mon, 09 Jan 2012 00:00:00 +0000</p> <p>A former grad school classmate of mine lived largely off of oatmeal and carrots. Suppose he wanted to be sure to get 2000 calories/day, 60 milligrams of vitamin C per day, and no more than 40 grams of fat per day. How should he balance his intake of oatmeal and carrots while minimizing his expense at the grocery store? Suppose he eats \(S\) pounds of carrots per day and \(Q\) pounds of dry oatmeal. A pound of carrots contains 172 calories; a pound of oatmeal contains 1742 calories<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. The total number of calories he gets in a day is \(172\cdot C+1732\cdot Q.\) To be sure he gets the adequate number of calories, he needs </p> $$172\cdot C+1732\cdot Q\geq 2000.$$<p> A pound of carrots contains 0.839 grams of fat, and a pound of oatmeal contains 25 grams of fat. For the total amount of fat has to be less than 40: </p> $$0.839\cdot C+25\cdot Q\leq 40.$$<p> A pound of carrots contains 19 milligrams of vitamin C, and a pound of oatmeal contains 15 milligrams of vitamin C. Since the total amount of vitamin C must be at least 60: </p> $$19\cdot C+15\cdot Q\geq 60.$$<p> If you graph these three inequalities, you get the blue region shown below. Any point in the blue region represents a combination of carrots and oatmeal that would provide sufficient nutrients without giving too much fat. What we want to know is which combination is least expensive.</p> <figure> </figure> <p>A pound of carrots and a pound of oatmeal both cost about $0.50. So, we want <em>to minimize</em> the function $0.5\cdot C+0.5\cdot Q$ while still staying inside the blue region. We can write all of these things in a form that operations researchers call a “linear program.”<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p> $$ \begin{align}\text{minimize }\\;\\; & 0.5\cdot C+0.5\cdot Q\\\\ \text{subject to }\\;\\; &172\cdot C+1742\cdot Q\geq 2000\\\\ &0.839\cdot C+25\cdot Q\leq 40\\\\ &19\cdot C+15\cdot Q\geq 60\\\\ &C\geq 0, \\;Q\geq 0\end{align} $$<p>It turns out, the optimal solution is that my friend should eat 2.44 pounds of carrots per day and 0.91 pounds of oatmeal. With that combination, he’ll get all his nutrients, restrict his fat, and keep his grocery store bill as low as possible (about $1.73 per day!). Below, I show another graph of the possible carrot-oatmeal combinations. This time, the color of a point represents the cost of that diet. The optimal diet occurs at the black dot, i.e. where the graph is most red. The most expensive diet occurs at the bottom right corner, where the graph is most blue. That diet corresponds to eating nearly 48 pounds of carrots per day!</p> <p><a class="link" href="https://tdhopper.com/uploads/2012/01/Whats-a-Stigler-Diet-Temp.gif" ><figure> </figure> </a></p> <p>Operations researchers use this sort of linear programming mathematical model to solve all kinds of problems. While I don’t know of anyone who uses linear programming to fix their diet, the economist George Stigler suggested many years ago it is possible. I will follow up with another post talking about <a class="link" href="http://en.wikipedia.org/wiki/Stigler_diet" target="_blank" rel="noopener" >Stigler’s Diet problem</a>.</p> <div class="footnotes" role="doc-endnotes"> <hr> <ol> <li id="fn:1"> <p>See <a class="link" href="http://www.wolframalpha.com/input/?i=pound+of+carrots%2C+pound+of+oatmeal" target="_blank" rel="noopener" >http://www.wolframalpha.com/input/?i=pound+of+carrots%2C+pound+of+oatmeal</a> <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> <li id="fn:2"> <p><em>Program</em> doesn’t refer to computer programming. It actually goes back to an older use of the word related to planning something out. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p> </li> </ol> </div> </article> <article> <h1>👋 Hey, I'm Tim Hopper!</h1> <p>Fri, 01 Jan 2010 00:00:00 +0000</p> <h1 id="-hey-im-tim-hopper">👋 Hey, I’m Tim Hopper! </h1><p>I’m an experienced machine learning platform engineer and Python developer. You can check out my resume at <a class="link" href="https://resume.tdhopper.com" target="_blank" rel="noopener" >resume.tdhopper.com</a>.</p> <p>For over 10 years, I’ve helped companies solve business problems with machine learning in domains such as audio information retrieval, banking, cybersecurity, environmental science, and weather forecasting. I see my role as helping data scientists and researchers shorten feedback loops and spend time on their business problems (instead of fussing with cloud resources).</p> <p>I’m also excited about developer productivity, especially in Python development. I am the author of an <a class="link" href="https://pydevtools.com" target="_blank" rel="noopener" >ebook on Python developer tooling</a> and like to help teams use Python more effectively.</p> <p>✍🏻 <strong>Writings</strong>:</p> <p><a class="link" href="https://tdhopper.com" target="_blank" rel="noopener" >tdhopper.com</a> has been a place for my thoughts and writings since grad school. If you’re new here, start with these:</p> <ul> <li><a class="link" href="https://tdhopper.com/blog/some-reflections-on-being-turned-down-for-a-lot-of-data-science-jobs/" target="_blank" rel="noopener" >Some Reflections on Being Turned Down for a Lot of Data Science Jobs</a></li> <li><a class="link" href="https://tdhopper.com/blog/how-i-became-a-data-scientist/" target="_blank" rel="noopener" >How I Became a Data Scientist Despite Having Been a Math Major</a> and <a class="link" href="https://tdhopper.com/blog/faq/" target="_blank" rel="noopener" >A Subjective and Anecdotal FAQ on Becoming a Data Scientist</a></li> <li><a class="link" href="https://tdhopper.com/blog/goodnight-zoom/" target="_blank" rel="noopener" >Goodnight Zoom</a></li> <li><a class="link" href="https://tdhopper.com/blog/entropy-of-a-discrete-probability-distribution/" target="_blank" rel="noopener" >Entropy of a Discrete Probability Distribution</a></li> </ul> <p>🖥️ <strong>Personal Projects</strong>:</p> <p>I have an <a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >occasional podcast</a> in which I talk to friends about things they’re interested in. You can find it in your podcast directory or at <a class="link" href="https://podcast.tdhopper.com" target="_blank" rel="noopener" >podcast.tdhopper.com</a>.</p> <p>Years ago, I created <a class="link" href="https://tdhopper.com/phd" target="_blank" rel="noopener" >Should I Get a Phd?</a> where I interviewed nine friends about whether a young, bright student should consider pursuing a PhD. This is the resource I wish I’d had before starting a PhD program, and it’s been useful to many.</p> <p><a class="link" href="https://pythonplot.com" target="_blank" rel="noopener" >Python Plotting for Exploratory Data Analysis</a> is a Rosetta Stone for Python plotting libraries, and it also compares them to the GOAT of plotting libraries: ggplot.</p> <p>I created <a class="link" href="https://dp.tdhopper.com" target="_blank" rel="noopener" >Notes on Dirichlet Processes</a> after working on a <a class="link" href="https://www.darpa.mil" target="_blank" rel="noopener" >DARPA</a>-funded open source project for developing <a class="link" href="http://datamicroscopes.github.io" target="_blank" rel="noopener" >Bayesian nonparametric models in Python</a>. I did a lot of work to understand Bayesian nonparametrics and derive the Gibbs sampler for Hierarchical Dirichlet Processes. <a class="link" href="https://dp.tdhopper.com" target="_blank" rel="noopener" >Notes on Dirichlet Processes</a> shares what I learned for the benefit of others.</p> <p>I enjoy wildlife and nature photography in my free time. <a class="link" href="https://photos.tdhopper.com" target="_blank" rel="noopener" >dothopper photo</a> is my gallery.</p> <p><a class="link" href="https://freespace.tdhopper.com" target="_blank" rel="noopener" >Free Disk Space</a> is a little site I maintain with commands for freeing up disk space on your computer.</p> <p>⌨️ <strong>Open Source</strong>:</p> <p>I love to contribute to open source as I’m able. I’ve contributed to libraries like <a class="link" href="https://github.com/python/cpython/commits?author=tdhopper" target="_blank" rel="noopener" >cpython</a>, <a class="link" href="https://datamicroscopes.github.io/" target="_blank" rel="noopener" >datamicroscopes</a>, <a class="link" href="https://github.com/parsely/streamparse/commits?author=tdhopper" target="_blank" rel="noopener" >Streamparse</a>, <a class="link" href="https://github.com/conda/conda/commits?author=tdhopper" target="_blank" rel="noopener" >Conda</a>, <a class="link" href="https://github.com/ariddell/lda/commits?author=tdhopper" target="_blank" rel="noopener" >lda</a>, and <a class="link" href="https://github.com/pandas-dev/pandas/commits?author=tdhopper" target="_blank" rel="noopener" >Pandas</a>.</p> <p>👨🏻‍💻 <strong>Social Media</strong>:</p> <p>You can find me on <a class="link" href="https://twitter.com/tdhopper" target="_blank" rel="noopener" >Twitter</a> and <a class="link" href="https://www.linkedin.com/in/tdhopper/" target="_blank" rel="noopener" >LinkedIn</a>.</p> <p>🗣️ <strong>Talks</strong>:</p> <p>I’ve been speaking at conferences and meetups for many years. I keep a list of my recorded talks <a class="link" href="http://tdhopper.com/talks/" target="_blank" rel="noopener" >here</a>. If you’d like to get a taste of my talks, start with <a class="link" href="https://www.youtube.com/watch?v=6flt_3yMNb0" target="_blank" rel="noopener" >Five semesters of linear algebra and all I do is solve Python dependency problems</a> or <a class="link" href="https://www.youtube.com/watch?v=OS6j1Y2tJLY" target="_blank" rel="noopener" >Challenges in Applying Machine Learning to Cybersecurity</a>.</p> </article> <article> <h1>On Ramsey Numbers for Quadrilaterals in 3 and 4 Colors.</h1> <p>Mon, 07 Jan 2008 21:00:00 +0000</p> <p>Last summer, I had the privilege of serving as a student researcher in the <a class="link" href="https://www.rit.edu/science/school-mathematics-and-statistics" target="_blank" rel="noopener" >Rochester Institute of Technology</a>’s Research Experience for Undergraduates alongside <a class="link" href="https://en.wikipedia.org/wiki/Stanis%C5%82aw_Radziszowski" target="_blank" rel="noopener" >Dr. Stanislaw Radziszowski</a>.</p> <p>Our research focused on computational combinatorics, specifically <a class="link" href="https://mathworld.wolfram.com/RamseyNumber.html" target="_blank" rel="noopener" >Ramsey Numbers</a>. My work focused on the Ramsey Number \(R_4(C_4)\), which is the smallest number \(n\) where, if you color all the edges of a complete graph with \(n\) vertices (called \(K_n\)) using four colors, you’ll always get a monochromatic \(C_4\) (a 4-cycle all in the same color). Before our work, the number was known to be 18 or 19. I wrote fast, distributed programs to generate \(C_4\)-free 4-colorings of \(K_{17}\) with the hope that an 18th node could be added while preserving the cycle-free property. Had we discovered such a graph, we would have proven the Ramsey number to be 18.</p> <p>This week, my research partner Evan Heidtmann and I presented our research at the <a class="link" href="https://jointmathematicsmeetings.org/meetings/national/jmm/2109_program_monday.html" target="_blank" rel="noopener" >AMS-MAA-SIAM Special Session on Research in Mathematics by Undergraduates</a> in San Diego. Here is our abstract:</p> <blockquote> <p>We discuss the two multicolor Ramsey numbers concerning 4-cycles in 3 and 4 colors. For 3 colors, we find that there are exactly 1000 nonisomorphic critical colorings of \(K_{10}\) for the Ramsey number \(R_3(C_4)\) = 11, verifying our results using two independent computations. One of these colorings contains the Petersen graph as one of the colors and is more symmetric than all published colorings for this Ramsey number. In 4 colors, we were not able to improve the currently best known bounds \(18 ≤ R_4 (C_4) ≤ 19\), but we gather extensive computational evidence and then conjecture that no 4-coloring of \(K_{18}\) can avoid monochromatic \(C_4\)’s. We generate more than 28,000 nonisomorphic \(C_4\)-free 4-colorings of \(K_{17}\) (only 2 of which were previously published), but none can be extended to successfully color K18. Several searches, both heuristic and deterministic, also failed to produce a desired coloring. An exhaustive search seems to be extremely difficult computationally, even with all known constraints. We conjecture that \(R_4(C_4) = 18\).</p> </blockquote> </article> <article> <h1>More from Tim</h1> <p>Mon, 01 Jan 0001 00:00:00 +0000</p> </article> <article> <h1>Search</h1> <p>Mon, 01 Jan 0001 00:00:00 +0000</p> </article> </main></body></html>