A Weird Python 3 Unicode Failure

The following code can fail on my system:

from os import listdir
for f in listdir('.'):
    print(f)

Why?

UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 13: surrogates not allowed

What?

I have a file with the name b’Latin1 file: \xe9′. This is a filename with a “é” encoded using Latin-1 (which is byte value \xe9). Python attempts to decode it using the current locale, which is utf-8. Unfortunately, \xe9 is not valid UTF-8, so Python solves this by inserting a surrogate character. So, I get a variable f which can be used to open the file.

However, I cannot print the value of this f because when it attempts to convert back to UTF-8 to print, an error is triggered.

I can understand what is happening, but it’s just a mess. [1]

§

Here is a complete example:

f = open('Latin1 file: é'.encode('latin1'), 'w')
f.write("My file")
f.close()

from os import listdir
for f in listdir('.'):
    print(f)

On a modern Unix system (i.e., one that uses UTF-8 as its system encoding), this will fail.

§

A good essay on the failure of the Python 3 transition is out there to be written.

Update 10 Aug 2019: Still failing on Python 3.6.5

[1] ls on the same directory generates a placeholder character, which is a decent fallback.

mahotas 1.2.4 released

Just released mahotas 1.2.4. The single improvement over 1.2.3 is that PIL (or Pillow) based IO is automatically used as a fallback. This will hopefully make the package easier to use for new users who have one fewer dependency to install.

(If you like and use mahotas, please cite the software paper.)

Protip: Enable deprecation warnings in Python

Protip: Enable deprecation warnings in Python

Since version 2.7, Python no longer outputs anything when a deprecation warning is hit. This is a good idea if you are running apps that use Python internally as you might not care about this sort of thing. However, it’s a bad idea if you are developing in Python yourself (as opposed to just using an application that was written in Python).

You can turn them back on by either (1) passing -Wd on the command line, or (2) setting the PYTHONWARNINGS environmental variable to d. In fact, this allows you to just set this option in your .bashrc (or.zshrc):

export PYTHONWARNINGS=d

Now, all your Python will warn you about deprecations, which you should care about as a developer.

§

I only discovered this myself a few days ago, but have since already stumbled upon several deprecated usages in my code.

How I Use Jug & IPython Notebooks

How I Use Jug & IPython Notebooks

Having just released Jug 1.0 and having recently started using Ipython notebooks for data analysis, I want to describe how I combine these two technologies:

  1. Jug is for heavy computation. This should run in batch mode so that it can take advantage of a computer cluster.
  2. The IPython notebook is for visualization of the results. From the notebook, I will load the results of the jug run and plot them.

I am going to use, as an example, the sort of work I did for classifying images with local features that I did for my Bioinformatics paper last year That code did not use IPython notebook, but I already used a split between heavy computation and plotting[1].

I write a jugfile.py with my heavy computation, in this case, feature computation and classification [2]:

from jug import TaskGenerator
from features import computefeatures
from ml import classification

# computefeatures takes an image path and returns features
computefeatures = TaskGenerator(computefeatures)

# crossvalidation returns a confusion matrix
crossvalidation = TaskGenerator(crossvalidation)

images,labels = load_images() # This loads all the images
features = [computefeatures(im) for im in images]
results = crossvalidation(features, labels)

This way, if I have 1000 images, the computefeatures step can be run in parallel and use many cores.

§

When the computation is finished, I will want to look at the results and display them. For example, graphically plot a confusion matrix.

The only non-obvious trick is how to load the results from jug:

from jug import value, set_jugdir
import jugfile
set_jugdir('jugfile.jugdata')
results = value(jugfile.results)

And, boom!, results is a variable in our notebook with all the data from the computations (if the computation is not finished, an exception will be raised). Let’s unpack this one by one:

from jug import value, set_jugdir

Imports from jug. Nothing special. You are just importing jug in a Python notebook.

import jugfile

Here you import your jugfile.

set_jugdir('jugfile.jugdata')

This is the important step! You need to tell jug where its data is. Here I assumed you used the defaults, otherwise just pass a different argument to this function.

results = value(jugfile.results)

You now use the value function to load the value from disk. Done.

Now, use a second cell to plot:

from matplotlib import pyplot as plt
from matplotlib import cm

plt.imshow(results, interpolation='nearest', cmap=cm.OrRd)

confusion.matrix

§

I find this division of labour to maximize the value of each tool: jug does well for long computations and ensures that the results are consistent while making it easy to use the cluster; ipython is nicer at exploring the results and tweaking the graphical outputs.

[1] I would save the results from jug to a text file and load it from another script.
[2] This is a very simplified form of what the original actually looked like. I started to write this post trying to make it realistic, but the complexity was too much. The plot is from real data, though.

Building ImageJ Hyperstacks from Python

Building ImageJ Hyperstacks from Python

ImageJ calls a 4D image a hyperstack (it actually be 5D or even higher). You can save these and open them and it’ll show a nice GUI for them.

Unfortunately, it’s not very well documented (if at all) how it recognises some images as hyperstacks. This is what I found: A hyperstack is a multi-page TIFF with the image description tag containing information on the hyperstack.

I can generate one by outputting the individual slices to files (in the right order) and then calling tiffcp on the command line to concatenate them together. If the metadata is there, ImageJ will recognise it as a hyperstack.

§

Here is how to do it in Python and mahotas-imread [1].

Define the metadata (as a template):

_imagej_metadata = """ImageJ=1.47a
images={nr_images}
channels={nr_channels}
slices={nr_slices}
hyperstack=true
mode=color
loop=false"""

Now, we write a function which takes a z-stack and a filename to write to

def output_hyperstack(zs, oname):
    '''
    Write out a hyperstack to ``oname``

    Parameters
    ----------
    zs : 4D ndarray
        dimensions should be (c,z,x,y)
    oname : str
        filename to write to
    '''

Some basic imports:

import tempfile
import shutil
from os import system
try:
    # We create a directory to save the results
    tmp_dir = tempfile.mkdtemp(prefix='hyperstack')

    # Channels are in first dimension
    nr_channels = zs.shape[0]
    nr_slices = zs.shape[1]
    nr_images = nr_channels*nr_slices
    metadata = _imagej_metadata.format(
                    nr_images=nr_images,
                    nr_slices=nr_slices,
                    nr_channels=nr_channels)

We built up the final metadata string by replacing the right variables into the template. Now, we output all the images as separate TIFF files:

frames = []
next = 0
for s1 in xrange(zs.shape[1]):
    for s0 in xrange(zs.shape[0]):
        fname = '{}/s{:03}.tiff'.format(tmp_dir,next)
        # Do not forget to output the metadata!
        mh.imsave(fname, zs[s0,s1], metadata=metadata)
        frames.append(fname)
        next += 1

We call tiffcp to concatenate all the inputs

cmd = "tiffcp {inputs} {tmp_dir}/stacked.tiff".format(inputs=" ".join(frames), tmp_dir=tmp_dir)
r = system(cmd)
if r != 0:
    raise IOError('tiffcp call failed')
shutil.copy('{tmp_dir}/stacked.tiff'.format(tmp_dir=tmp_dir), oname)

Finally, we remove the temporary directory:

finally:
    shutil.rmtree(tmp_dir)

And, voilà! This function will output a file with the right format for ImageJ to think it is a hyperstack.

[1] Naturally, you can use other packages, but you need one which lets you write the image description TIFF tag.

Jug 1.0-release candidate 0

New release of Jug: 1.0 release candidate

I’ve put out a new release of jug, which I’m calling 1.0-rc0. This is a release candidate for version 1.0 and if there are no bugs, in a few days, I’ll just call it 1.0!

There are few changes from the previous version, but this has reached maturity now.

For the future, I want to start developing the hooks interface and use hooks for more functionality.

In the meanwhile, download jug from PyPI, watch the video or read the tutorials or the full documentation.

Modernity

Modernity

The Bourne shell was first released in 1977 (37 years ago).

The C Programming Language book was published in 1978 (36 years ago), describing the C language which had been out for a few years.

Python was first released in 1991 (23 years ago). The first version that looks very much like the current Python was version 2.2, released in 2001 (13 years ago), but the code from Python 1.0.1 is already pretty familiar to my eyes.

The future might point in the direction of functional languages such as Haskell, which first appeared in 1990 (24 years ago), although the first modern version is from 1998 (Haskell 98).

Vim was first released in 1991, based on vi released in 1976. Some people prefer Emacs, released a bit earlier (GNU Emacs, however, is fairly recent, only released in 1985; so 29 years ago)

The web was first implemented in 1989 with some preliminary work in the early 1980s (although the idea of hypertext had been around for longer).

§

The only really new software I use regularly is distributed version control systems, which are less than 20 years old in both implementation and ideas.

Edit: the original version of the post had a silly arithmetic error pointed out in the comments. Thanks

Denmark

I was in Denmark last week, teaching software carpentry. The students were very enthusiastic, but they had very different starting points, which made teaching harder.

For a complete beginner’s to programming course, I typically rely heavily on the Python Tutor created by Philip Guo, which is an excellent tool. Then, my goal is to get them to understand names, objects, and the flow of control.

I don’t use the term variable when discussing Python as I don’t think it’s a very good concept. C has variables, which work like little boxes you put values in. If you’re thinking of little boxes in Python, things get confusing. If you try to think of little boxes plus pointers (or references), it’s still not a very good map of what Python is actually doing.

For more intermediate students (the kind that has used one programming language), I typically still go through this closely. I find that many still have major faults in their mental model of how names and objects work. However, for these intermediate  students have, this can go much faster [1]. If it’s the first time they are even seeing the idea of writing code, then it naturally needs to be slow.

Last week, because the class was so mixed, it was probably too slow for some and too fast for others.

§

A bit of Danish weirdness:

sausage

 A sausage display at a local pub

[1] I suppose if students knew Haskell quite well but no imperative programming, this may no longer apply, but teaching Python to Haskell programmers is not a situation I have been in.