Omics! Omics!: programming

Showing posts with label programming. Show all posts

Wednesday, July 21, 2010

Distractions -- there's an app for that

Today I finally gave in to temptation & developed a Hello World application for my Droid. Okay, developed is a gross overstatement -- I successfully followed a recipe. But, it take a while to install the SDK & its plugin for the Eclipse environment plus the necessary device driver so I can debug stuff on my phone.

Since I purchased my Droid in November the idea of writing something for it has periodically tempted me. Indeed, one attraction of Scala (which I've done little with for weeks) was that it can be used to write Android apps, though it definitely means a new layer of comlexity. This week's caving in had two drivers.

First, Google last week announced a novice "you can write an app even if you can't program" tool called AppInventor. I rushed to try it out, only to find that they hadn't actually made it available but only a registration form. Supposedly they'll get back to you, but they haven't yet. Perhaps it's because I'm not an educator -- the form has lots of fields tilted at educators.

The second trigger is that an Android book I had requested came in at the library. Now, it's for a few versions back of the OS -- but certainly okay for a start (trying to keep public library collections current on technical stuff is a quixotic task in my opinion, though I do enjoy the fruits of the effort). So that was my train reading this mornign & it got me stoked. The book is certainly not much more than a starting springboard -- I'm debating buying one called "Advanced Android Programming" (or something close to that) or whether just to sponge off on-line resources.

The big question is what to do next. The general challenge is choosing between apps that don't do anything particularly sophisticated but are clearly doable vs. more interesting apps that might be a bit to take on -- especially given the challenge of operating a simulator for a device very unlike my laptop (accelerometers! GPS!). I have a bunch of ideas for silly games or demos, most of which shouldn't be too hard -- and then one concept that could be somewhat cool but also really pushing the envelope on difficulty.

It would be nice to come up with something practical for my work, but right now I haven't many ideas in that area. Given that most of the datasets I work with now are enormous, it's hard to see any point to trying to access them via phone. A tiny browser for the UCSC genome database has some appeal, but that's sounding a bit ambitious.

If I were still back at Codon Devices, I could definitely see some app opportunities, either to demo "tech cred" or really useful. For example, at one point we were developing (though an outsource vendor) a drag-and-drop gene design interface. The full version probably wouldn't be very app appropriate, but something along those lines could be envisioned -- call up any protein out of Entrez & have it codon optimized with appopropriate constraints & sent to the quoting system. In our terminal phase, it would have been very handy to have a phone app to browse metabolic databases such as KEGG or BioCyc.

That thought has suggested what I would develop if I were back in school. There is a certain amount of simple rote memorization that is either demanded or turns out to expedite later studies. For example, I really do feel you need to memorize the single letter IUPAC codes for nucleotides and amino acids. I remember having to memorize amino acid structures and the Krebs cycle and glycolysis and all sorts of organic synthesis reactions and so forth. I often devised either decks of flash cards or study sheets, which I would look at while standing in line for the cafeteria or other bits of solitary time. Some of those decks were a bit sophisticated -- for the pathways I remember making both compound-centric and reaction-centric cards for the same pathways. That sort of flashcard app could be quite valuable -- and perhaps even profitable if you could get students to try it out. I can't quite see myself committing to such a business, even as a side-line, so I'm okay with suggesting it here.

Saturday, May 22, 2010

Just say no to programming primitivism

A consistently reappearing thread in any bioinformatics discussion space is "What programming language should I learn/teach?". As one might expect, just about every language under the sun has some proponents (still waiting -- hopefully forever -- for a BioCobol fan), but the responses tend to cluster into a few camps & someone could probably carefully classify the arguments for each language into a small number of bins. I'm not going to do that, but each time I see these I do try to evaluate my own muddled opinions in this space. I've been debating writing one long post on some of the recent traffic, but the areas I think worth commenting on are distinct enough that they can't really fit well into a single body. So another of my erratic & indeterminate series of thematic posts.

One viewpoint I strongly disagree with was stated in one thread on SEQAnswers.

Learn C and bash and the most basic stuff first. LEARN vi as your IDE and your word processor and your only way of knowing how to enter text. Understand how to log into a machine with the most basic of linux available and to actually do something functional to bring it back to life. There will be times when there is no python, no jvm, no eclipse. If you cannot function in such an environment then you are shooting yourself in the foot.

Yes, there is something to be admired about being able to be dropped in the wilderness with nothing but a pocketknife and emerging alive. But the reality is that this is a very rare occurrence. Similarly, it is a neat trick to be able to work in a completely bare bones computing environment -- but few will ever face this. Nearly twenty years in the business, and I have yet to encounter such a situation.

The cost of such an attitude is what worries me. First, the demands of such a primitivist approach to programming will drive a lot of people out very early. That may appeal to some people, but not me. I would like to see as many people as possible get a taste of programming. In order to do that, you need to focus on stripping away the impediments and roadblocks which will trip up a newcomer. So from this viewpoint, a good IDE is not only desirable but near essential. Having to fire up a debugger and learn some terse syntax for exploring your code's behavior is far more daunting than a good graphical IDE. Similarly, the sort of down-to-the-compute-guts programming that C enables is very undesirable; you want a newcomer to be able to focus on program design and not tracking down memory leaks. Also, I believe Object Oriented Programming should be learned early, perhaps from the very beginning. That's easily the subject of an entire post. Finally, I strongly believe the first language learned should have powerful inherent support for advanced collection types such as associative arrays (aka hashtables or dictionaries)

Once you have passed those tests, then I get much less passionate. I increasingly believe Perl should only be taught as a handy text mangler and not a language in which to develop large systems -- but still break those rules daily (and will probably use Perl as a core piece of my teaching this summer). Python is generally what I recommend to others -- I simply am not comfortable enough in it to teach it. I'm liking Scala, but should it be a first language? I'm not quite ready to make that leap. Java or C#? Not bad choices either. R? Another one I don't really feel comfortable to teach (though there are some textbooks to help me get past that discomfort).

Thursday, January 28, 2010

A little more Scala

I can't believe how thrilled I was to get a run-time error today! Because that was the first sign I had gotten past the Scala roadblock I mentioned in my previous post. It would have been nicer for the case to just work, but apparently my SAM file was incomplete or corrupt. But, moments later it ran correctly on a BAM file. For better or worse, I deserve nearly no credit for this step forward -- Mr. Google found me a key code example.

The problem I faced is that I have a Java class (from the Picard library for reading alignment data in SAM/BAM format). To get each record, an iterator is provided. But my first few attempts to guess the syntax just didn't work, so it was off to Google.

My first working version is


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
import net.sf.samtools.SAMFileReader

object HelloWorld extends Application {

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

var recs=inputSam.iterator
while (recs.hasNext)
{
  var samRec=recs.next;
  counter=counter+1
}
  
 println("records: ",counter);

Ah, sweet success. But, while that's a step forward it doesn't really play with anything novel that Scala lends me. The example I found this in was actually implementing something richer, which I then borrowed (same imports as before)

First, I define a class which wraps an iterator and defines a foreach method:


class IteratorWrapper[A](iter:java.util.Iterator[A])
{
    def foreach(f: A => Unit): Unit = {
        while(iter.hasNext){
          f(iter.next)
        }
    }
}

Second, is the definition within the body of my object of a rule which allows iterators to be automatically converted to my wrapper object. Now, this sounds powerfully dangerous (and vice versa). A key constraint is Scala won't do this if there is any ambiguity -- if there are multiple legal solutions to what to promote to, it won't work. Finally, I rewrite the loop using the foreach construct.


object HelloWorld extends Application {
 implicit def iteratorToWrapper[T](iter:java.util.Iterator[T]):IteratorWrapper[T] = new IteratorWrapper[T](iter)

val samFile=new File("C:/workspace/short-reads/aln.se.2.sorted.bam")
val inputSam=new SAMFileReader(samFile)
var counter=0

for (val samRec<-recs) 
  { 
    counter=counter+1 
  }
println("records: ",counter);

Is this really better? Well, I think so -- for me. The code is terse but still clear. This also saves a lot of looking up some standard wordy idioms -- for some reason I never quite locked in the standard read-lines-one-at-a-time loop in C# -- always had to copy an example.

You can take some of this a bit far in Scala -- the syntax allows a lot of flexibility and some of the examples in the O'Reilly book are almost scary. I probably once would have been inspired to write my own domain specific language within Scala, but for now I'll pass.

Am I taking a performance hit with this? Good question -- I'm sort of trusting that the Scala compiler is smart enough to treat this all as syntactic sugar, but for most of what I do performance is well behind readibility and ease of coding & maintenance. Well, until the code becomes painfully slow.

I don't have them in front of me, but I can think of examples from back at Codon where I wanted to treat something like an iterator -- especially a strongly typed one. C# does let you use for loops using anything which implements the IEnumerable interface, but it can get tedious to wrap everything up when using a library class which I think should implement IEnumerable but the designer didn't.

I still have some playing to do, but maybe soon I'll put something together that I didn't have code to do previously. That would be a serious milestone.

Wednesday, January 27, 2010

The Scala Experiment

Well, I've taken the plunge -- yet another programming language.

I've written before about this. It's also a common question on various professional bioinformatics discussion boards: what programming language.

It is a decent time to ponder some sort of shift. I've written a bit of code, but not a lot -- partly because I've been more disciplined about using libraries as much as possible (versus rolling my own) but mostly because coding is a small -- but critical -- slice of my regular workflow.

At Codon I had become quite enamored with C#. Especially with the Visual Studio Integrated Development Environment (IDE), I found it very productive and a good fit for my brain & tastes. But, as a bioinformatics language it hasn't found much favor. That means no good libraries out there, so I must build everything myself. I've knocked out basic bioinformatics libraries a number of times (read FASTA, reverse complement a sequence, translate to protein, etc), but I don't enjoy it -- and there are plenty of silly mistakes that can be easy to make but subtle enough to resist detection for an extended period. Plus, there are other things I really don't feel like writing -- like my own SAM/BAM parser. I did have one workaround for this at Codon -- I could tap into Python libraries via a package called Python.NET, but it imposed a severe performance penalty & I would have to write small (but annoying) Python glue code. The final straw is that I'm finding it essential to have a Linux (Ubuntu) installation for serious second-generation sequencing analysis (most packages do not compile cleanly -- if at all -- in my hands on a Windows box using MinGW or Cygwin).

The obvious fallback is Perl -- which is exactly how I've fallen so far. I'm very fluent with it & the appropriate libraries are out there. I've just gotten less and less fond of the language & it's many design kludges (I haven't quite gotten to my brother's opinion: Perl is just plain bad taste). I lose a lot of time with stupid errors that could have been caught at compile time with more static typing. It doesn't help I have (until recently) been using the Perl mode in Emacs as my IDE -- once you've used a really polished tool like Visual Studio you realize how primitive that is.

Other options? There's R, which I must use for certain projects (microarrays) due to the phenomenal set of libraries out there. But R just has never been an easy fit for me -- somehow I just don't grok it. I did write a little serious Python (i.e. not just glue code) at Codon & I could see myself getting into it if I had peers also working in it -- but I don't. Infinity, like many company bioinformatics groups, is pretty much a C# shop though with ecumenical attitudes towards any other language. I've also realized I need as basic comprehension of Ruby, as I'm starting to encounter useful code in that. But, as with Python I can't seem to quite push myself to switch over -- it doesn't appeal to me enough to kick the Perl habit.

While playing around with various second generation sequencing analysis tools, I stumbled across a bit of wierd code in the Broad's Genome Analysis ToolKit (GATK) -- a directory labeled "scala". Turns out, that's yet another language -- and one that has me intrigued enough to try it out.

My first bit of useful code (derived from a Hello World program that I customized having it output in canine) is below and gives away some of the intriguing features. This program goes through a set of ABI trace files that fit a specific naming convention and write out FASTA of their sequences to STDOUT:


package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
object HelloWorld extends Application {

  for (i <- 1 to 32)
    {
 val lz = new java.text.DecimalFormat("00")
 var primerSuffix="M13F(-21)"
 val fnPrefix="C:/somedir/readprefix-"
 if (i>16) primerSuffix="M13R"
 val fn=fnPrefix+lz.format(i)+"-"+primerSuffix+".ab1"
 val traceFile=new File(fn)
 val name = traceFile.getName()
 val trace = new ABITrace(traceFile)
 val symbols = trace.getSequence()
 val seq=new SimpleSequence(symbols,name,name,Annotation.EMPTY_ANNOTATION)
 SeqIOTools.writeFasta(System.out, seq);
    }
}

A reader might ask "Wait a minute? What's all this java.this and biojava.that in there?". This is one of the appeals of Scala -- it compiles to Java Virtual Machine bytecode and can pretty much freely use Java libraries. Now, I mentioned this to a colleague and he pointed out there is Jython (Python to JVM compiler) which reminded me of reference to JRuby (Ruby to JVM compiler). So, perhaps I should revisit my skipping over those two languages. But in any case, in theory Scala can cleanly drive any Java library.

The example also illustrates something that I find a tad confusing. The book keeps stressing how Scala is statically typed -- but I didn't type any of my variables above! However, I could have -- so I can get the type safety I find very useful when I want it (or hold myself to it -- it will take some discipline) but can also ignore it in many cases.

Scala has a lot in it, most of which I've only read about in the O'Reilly book & haven't tried. It borrows from both the Object Oriented Programming (OOP) lore and Functional Programming (FP). OOP is pretty much old hat, as most modern languages are OO and if not (e.g. Perl) the language supports it. Some FP constructs will be very familiar to Perl programmers -- I've written a few million anonymous functions to customize sorting. Others, perhaps not so much. And, like most modern languages all sorts of things not strictly in the language are supplied by libraries -- such as a concurrency model (Actors) that shouldn't be as much of a swamp as trying to work with threads (at least when I tried to do it way back yonder under Java). Scala also has some syntactic flexibility that is both intriguing and scary -- the opportunities for obfuscating code would seem endless. Plus, you can embed XML right in your file. Clearly I'm still at the "look at all these neat gadgets" phase of learning the language.

Is it a picnic? No, clearly not. My second attempt at a useful Scala program is a bit stalled -- I haven't figured out quite how to rewrite a Java example from the Picard (Java implementation of SAMTools) library into Scala -- my tries so far have raised errors. Partly because the particular Java idiom being used was unfamiliar -- if I thought Scala was a way to avoid learning modern Java, I'm quite deluded myself. And, I did note that tonight when I had something critical to get done on my commute I reached for Perl. There's still a lot of idioms I need to relearn -- constructing & using regular expressions, parsing delimited text files, etc. Plus, it doesn't help that I'm learning a whole new development environment (Eclipse) virtually simultaneously -- though there is Eclipse support for all of the languages I looks like I might be using (Java, Scala, Perl, Python, Ruby), so that's a good general tool to have under my belt.

If I do really take this on, then the last decision is how much of my code to convert to Scala. I haven't written a lot of code -- but I haven't written none either. Some just won't be relevant anymore (one offs or cases where I backslid and wrote code that is redundant with free libraries) but some may matter. It probably won't be hard to just do a simple transformation into Scala -- but I'll probably want to go whole-hog and show off (to myself) my comprehension of some of the novel (to me) aspects of the language. That would really up the ante.

Monday, March 12, 2007

You say tomato, I say $tomato

When I first started programming thirty or so years ago, my choice of language was simple: machine code or bust. I didn't like machine code much, so I never wrote very much. A pattern, however, was established which would be maintained for a long time. A limited set of computer languages would be available at any one time, and I would pick the one that I liked the best and work solely in that. Machine code gave way to assembler (never mastered) to BASIC to APL to Pascal. Transitions were short and sweet; once a better language was available to me, I switched completely. A few languages (Logo, Forth, Modula 2) were contemplated, but never had the necessary immediate availability to be adopted.
A summer internship tweaked the formula slightly -- at work I would use RS/1, because that's what the system was, but at home I stuck to Pascal. For four years of college this was the pattern.

Grad school was supposed to mean one more shift: to C++. However, soon I discovered the universe of useful UNIX utility languages, and sed and awk and shell scripts started popping up. Eventually I discovered make, which is a very different language. A proprietary GUI language based on C++ came in handy. Prolog didn't quite get a proper trial, but at least I read the book. Finally, I found Perl and tried to focus on that, but the mold had been broken -- and for good measure I wrote one of the worlds first interactive genome viewers in Java. My thesis work consisted of an awful mess of all of these.

Come Millennium, I swore I would write nothing but Perl. But soon, that had to be modified as I needed to read and write relational databases, which requires SQL. Ultimately, I wanted to do statistics -- and these days that means R.

There are a number of computer language taxonomies which can be employed. For example, with the exceptions of make, SQL (as I used it) and Prolog all of these languages are procedural -- you write a series of steps and they are executed. The other three fit more of a pattern of the programmer specifying assertions, conditions or constraints and the language interpreter or compiler executes commands or returns data according to those specifications.

Within the procedural languages, there is a lot of variation. Some of this represents shared history. For example, C++ is largely an extension of C, so it shares many syntactic features. Perl also borrowed heavily from C, so much is similar. R is also loosely in the C syntax family. All of these languages tend to be terse and heavily use non-alphabetic characters. On the other hand, SQL is intrinsically loquacious.

The fun part is when you are trying to use multiple languages simultaneously, as you must keep straight the differences & properly shift gears. Currently, I'm working semi-daily in Perl, SQL and R, and there is plenty to catch me up if I'm napping. For example, many Perl and R statements can interchange single and double quotes freely -- as long as you do so symmetrically; SQL needs single quotes around strings.
Perl & R use the C-style != for inequality; SQL is the older style <> and in paralled Perl & R use == for equality whereas SQL uses a single = -- and since a single = in Perl is assignment, forgetting this rule can lead to interesting errors! R is a little easier to keep straight, as assignment is <- . R and Perl also diverge on $ -- for Perl it precedes every single value (scalar) variable, whereas in R it specifies a column of a table. I haven't done C++ or Java for over ten years, but my mind still wants to parse an R variable foo.bar as bar is a member of class instance foo (perhaps because that's the SQL idiom as well), but in R the period is just another legal character for composing a name -- and in Perl it's yet another syntax ( ->{'key'} ) to access the members of a class.

While I know all the rules, inevitably there is a mistake a day (or worse an hour!) where my R variables start growing $ and I try to select something out of my SQL using != . Eventually my mind melts down and all I can write is:

select tzu->{'name'},shih$color from $shih,$tzu where shih.dog==tzu.dog

which doesn't work in any language!

Omics! Omics!