Mahotas 1.1.0 Released!

Mahotas 1.1 Released

released mahotas 1.1.0 yesterday.

Use pip install mahotas --upgrade to upgrade.

Mahotas is my computer vision library for Python.

Summary of Changes

It adds the functions resize_to and resize_rgb_to, which can be used like:

import mahotas as mh
lena = mh.demos.load('lena')
big = mh.resize.resize_rgb_to(lena, [1024, 1024])

As well as remove_regions_where, which is useful for handling labeled images:

import mahotas as mh
nuclear = mh.demos.load('nuclear')
nuclear = mh.gaussian_filter(nuclear, 2)
labeled,_ = mh.label(nuclear > nuclear.mean())

# Ok, now remove small regions:

sizes = mh.labeled.labeled_size(labeled)

labeled = mh.labeled.remove_regions_where(
        labeled, sizes < 100)

Moments computation can now be done in a normalized mode, which is robust against scale changes:

import mahotas as mh
lena = mh.demos.load('lena', as_grey=1)
print mh.features.moments.moments(lena, 1, 2, normalize=1)
print mh.features.moments.moments(lena[::2], 1, 2, normalize=1)
print mh.features.moments.moments(lena[::2,::3], 1, 2, normalize=1)

prints 126.609789161 126.618233592 126.640228523

You can even spell the keyword argument “normalise”!

print mh.features.moments.moments(lena[::2,::3], 1, 2, normalise=1)

This release also contains some bugfixes to SLIC superpixels and to convolutions of very small images.

(If you like and use mahotas, please cite the software paper.)

How to save & load large pandas dataframes

Update (April 2018): Use feather format.

How to save & load large pandas dataframes

I have recently started using Pandas for many projects, but one feature which I felt was missing was a native file format the data. This is especially important as the data grows.

Numpy has a simple data format which is just a header plus a memory dump, which is great as it allows you to memory-map the data into memory. Pandas does not have the same thing.

After looking at the code a little bit, I realized it’s pretty easy to fake it though:

  1. The data in a Pandas DataFrame is held in a numpy array.
  2. You can save that array using the numpy format.
  3. The numpy code does not care about the file beyond the header: it just maps the rest of the data into memory.
  4. In particular, it does not care if there is something in the file after the data. Thus, you can save the Pandas extra-data after the numpy array on disk.

I wrote this up. Here is the writing code:

np.save(open(fname, 'w'), data)
meta = data.index,data.columns
s = pickle.dumps(meta)
s = s.encode('string_escape')
with open(fname, 'a') as f:
    f.seek(0, 2)
    f.write(s)

We save the array to disk with the numpy machinery, then seek to the end and write out the metadata.

Here is the corresponding loading code:

values = np.load(fname, mmap_mode='r')
with open(fname) as f:
    numpy.lib.format.read_magic(f)
    numpy.lib.format.read_array_header_1_0(f)
    f.seek(values.dtype.alignment*values.size, 1)
    meta = pickle.loads(f.readline().decode('string_escape'))
frame = pd.DataFrame(values, index=meta[0], columns=meta[1])

Check out this gist for a better version of these, which also supports pandas.Series.

As an added bonus, you can load the saved data as a numpy array directly if you do not care for the metadata:

save_pandas('data.pdy', data)
raw = np.load('data.pdy')

Note: I’m not sure this code covers all Pandas cases. It fits in my use case and I hope it can be useful for you, but feel free to point out shortcomings (or improvements) in the comments.

I’m Doubling Down on a Mistake

Let me double down on a mistake. Two days ago, I wrote that Python lists are not lists, they’re arrays.

I got called out on twitter for not realising that Lists refers to the abstract data type called list and not a particular implementation.

These people are right that list is used for an abstract data type that may be implemented by an array.

§

However, the word list carries more meaning than the abstract data type list. When people say a list they are not thinking of an array. If a group of theoreticians uses the term list to in a restricted context, sure, they can go ahead, but it does not change the fact that when people say list in the context of computer programming, they most often mean “some type of linked list”.

Which is why it is not meaningless to say Python lists are arrays, not lists or to ask should we use an array or a list for this?

§

If I look at the Most used programming languages, Python is the only one I know of whose list type is not a list. Even PHP calls its arrays arrays [1]. Other languages, use array or vector for arrays, and list for, well, lists.

§

Calling arrays lists is confusing. I mean this quite literally [2]: I have cleared up confusions in other people’s minds about Python lists [3] by just saying they’re not lists, they’re arrays.

§

I also do not particularly care about implementation details, but I do care about performance. I think that if you are going to call something a list, it should have O(1) splicing.

I quite like the way that C++ does it (one of my #UnpopularOpinon is that C++ is still one of the most advanced programming languages around and others are still catching up). Its standard data structures are defined as abstract data types plus big-O performance guarantees. Naturally, this often constrains the implementation to a specific family of techniques (particularly for the basic types), but it still leaves a bit of wiggle room for implementers to tune to their particular architectures.

More importantly, it lets me use the structure without caring for the internal implementation if I have the right performance guarantees. If you tell me here’s this structure with these functions it is not really very usable until I have an idea of the performance. This is more and more important as data grows. Implementing car/cdr with arrays is fine if you have 100 elements, not if you have a billion.

[1] Cheap shot, sorry. PHP probably got this right by coincidence.
[2] In this case, I am using literal in the strict sense, although literally also means emphatically. The history of the word literally is similar to the overall gist of this post: in a restricted context, literally means non-figuratively, but the word has morphed to mean figuratively.
[3] Because I started using Python a long time ago and am fairly vocal about how great it is, a lot of people turn to me for doubts when they’re learning the language.

Python “lists” are not lists. A history

Python “lists” are not lists. They are arrays.

§

This seems like the sort of rookie mistake that starting scientists make: they use terms imprecisely, they write like journalists not scientists.

I can totally imagine myself telling a student starting out: a list is a list and an array is an array, these are well defined computer science terms, you cannot use them interchangibly (and I can imagine their mental reaction: this guy is a pedantic prick [1]).

I tell the students to think like a politician: everything you write must still be defensible even taken out of context[2]

Calling an array a list is wrong.

§

I wondered whether they had been lists in the past, so I went back and checked the earliest version of Python available, which happens to be Python 1.0.1

After a minor bug fix [3], it actually compiled and ran on my Ubuntu 13.04 machine! It is somewhat crashy on errors, but otherwise works.

§

First find: already in Python 1.0.1, lists were arrays! This was already year old code base, so maybe at an even earlier time point, men were men and lists were lists. But by 1994, lists were arrays.

§

A surprising thing about the code: if you’re familiar with recent Python code, this will feel very familiar. Here is how you set a list member:

int
setlistitem(op, i, newitem)
    register object *op;
    register int i;
    register object *newitem;
{
    register object *olditem;
    if (!is_listobject(op)) {
        XDECREF(newitem);
        err_badcall();
        return -1;
    }
    if (i < 0 || i >= ((listobject *)op) -> ob_size) {
        XDECREF(newitem);
        err_setstr(IndexError, "list assignment index out of range");
        return -1;
    }
    olditem = ((listobject *)op) -> ob_item[i];
    ((listobject *)op) -> ob_item[i] = newitem;
    XDECREF(olditem);
    return 0;
}

This feels very similar to recent Python.

§

Even more surpring, here is the inner loop:

switch (opcode) {

case POP_TOP:
    v = POP();
    DECREF(v);
    break;

case ROT_TWO:
    v = POP();
    w = POP();
    PUSH(v);
    PUSH(w);
    break;

and so on… This is almost exactly the same as the most recent Python. We are still using fundamentally the same implementation of Python that was out 20 years ago.

Update: See the next post for a response to the notion that I just didn’t get what a list refers to.

[1] Student: I meant list as a general one-thing-after-another-in-order, Wise older researcher You mean a sequence Student: Yeah, that. Wise older researcher: then you should write sequence. At this point, the student attains enlightment and starts applying for jobs in industry.
[2] A common complaint I have heard after several MS thesis defenses is the jury member took this one sentence from my introduction out of context and made a big deal out of it. It wasn’t even that related to my work.
[3] There is a function getline() in the standard library which was not there in the early 1990s. So, Python’s use of an internal function with the same name gets the compiler confused. Renaming the internal function to python_getline fixes it.

Story of Two Bugs

I while back, I got a bug report for mahotas-imread [1]:

PNG reads in inverted: This image loads in imread 0.3.1 incorrectly. It looks right if I xor it with 255, but I don’t think that’s all that’s wrong.

im16_as8

The first thing I noticed was that this was a 16 bit image. If you’ve been coding for as long as I have, you’ll immediately think: It’s a byte-endiness issue [2]. So, I tested:

imshow(im.byteswap())

and saw the following:

byteswap0

Not good. The I looked at the hint that the original poster provided and it did seem to be true: imshow(~f) worked reasonably well. My working hypothesis was thus that there is a flag whereby the PNG data needs to be interpreted after a bit reversion. I also noticed another thing, though:

max_16bit_value = 2**16-1
imshow(max_16bit_value - f)

Also looks decent.

The TIFF format does allow you to specify whether zero is supposed to be white or black. Maybe PNG has a similar “feature.”

I read through the libpng documentation (which is not great), a bit through its source, and through online descriptions of PNG format. Along the way, I noticed that converting the image to TIFF (with ImageMagick) and loading it with imread also gave the wrong result. Perhaps the TIFF reader had the same bug or ImageMagick [3].

Eventually, I realised that PNG files are in network order (i.e., in big-endian format) and the code did not convert them to little-endian. Thus, my initial intuition had been right: it was a byte-swapping error!

But in this case, why did imshow(f.byteswap()) result in a mangled image?

I stated to suspect that matplotlib had a bug. I tried to do:

imshow(f.byteswap() / 2.**16)

and it resulted in the correct image being shown.

As it turned out, matplotlib does not do the right thing when given 16 bit files. Thus, I had this figured out. I fixed imread and released version 0.3.2 and closed the bug report.

§

A single bug is often easy to debug, but when you have multiple bugs interacting; it is much more than twice as hard.

§

Hairy details: You may want to stop reading now.

Consider the following identities:

255 == 0xff
-f == (f ^ 0xff + 1)
2**16 - f = -f + 2**16 == -f (because of overflow)

Thus, it should not be surprising that flipping the bits or subtracting the image resulted int , in two-bit complement, ~f is roughly -f. Not exactly, but similarly enough that, by eye, it is hard to tell apart.

Finally, it all makes sense when you realise that matplotlib assumes that non-8 bit images are floating point and does:

final_image = (input_image * 255)
final_image = final_image.astype(np.uint8)

Because what is multiplying by 255? It’s the same as multiplying by -1! Thus, matplotlib would multiply by -1 and then take the low order bits. Thus, it showed a correct image if you pre-multiplied it by -1 (or flipped the bits) and gave it a byteswapped image!

[1] People don’t always appreciate how valuable good bug reports are. Seriously, they are a huge help: you are testing the software for me. Unfortunately, either shyness or past bad experiences will often cause people who see something worng to not report it.
[2] I now have over 15 years of experience coding (having had a relative late start [I didn’t care much about computers until I was close to college age], I’ve caught up.) If there is an area where I really feel that my experience shines through is debugging: I’ve seen enough mistakes and errors that my guesses as to what the bug is are more and more accurate (this is true even in code I have not seen).
[3] One of the reasons I started mahotas-imread was that I had not found a single library that could read the formats I wanted without a lot of bugs. So, I trust no one on this. In this case, the paranoia was unwarranted, as we’ll see.

Saturday Links

Last few weeks have been a bit hectic and my post queue dwindled, but here are some Saturday links:

1. The I Quit Academia genre. I like the meta-idea. I also notice that often the people leaving academia use the word “corporatization” (or some equivalent) to discuss what they dislike about the public or non-profit institutions they are leaving, in contrast to the less bureaucratic for-profit enterprises they are now joining. Should we conclude that corporatization means “becoming less like a 21st century corporation and more like a 20th century bureaucracy”.

2. Do things, write about it. This is about blogging, but the title could apply to science too.

3. Nice colour schemes: http://colorbrewer2.org/ with a Python implementation. Pretty nice.

4. Why Python uses 0-based indices. It’s the closed-open interval that matters, the 0 vs 1 discussion is actually a distraction.

This is one of those things where the there is a solution (zero-based) which is not intuitive at first but is actually more intuitive in the long run. With 1-based indices and inclusive splicing, you always need to keep track of +1s and -1s all over the place.

Why Python is Better than Matlab for Scientific Software

Why Python is Better than Matlab for Scientific Software

This is an argument I made at EuBIAS when arguing for the use of mahotas and the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument.

0. My argument mostly applies for new projects

If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution.

Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch.

1. Python has caught up with Matlab and is in the process of overtaking it.

This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab.

nr_lines_python

This graph shows the number of lines of code in some important projects for bioimage informatics/science in general (numpy, matplotlib, mahotas, skimage, and sklearn). As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years.

Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R’s Dataframe object]).

The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, a dedicated peer-reviewed publication, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos.

As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true.

2. Python is a real programming language

Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast.

For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

In every area, specialized languages are the best choice, but Python is the second best in more areas [2].

3. Python can easily interface with other languages

Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab.

However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython).  Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. I semi-routinely get sent little snippets of R code to do something. I will often just import rpy2 and use it from Python without having to change the rest of my code.

4. With Python, you can have a full open-source stack

This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes.

5. Matlab Licensing issues are a pain. And expensive.

Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses.

However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs.

Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket).

By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself.

§

Python does have weaknesses, of course; but that’s for another post.

[1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.
[2] I remember people saying this about C++ back in the day.
[3] However, because science is a third-world economy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant.