Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Launching Jupyter notebook server on startup

In this post, I show how to automatically run the Jupyter notebook server on an OSX machine on startup, without using extra terminal windows.

Background

Jupyter notebooks provide a useful interactive environment (think better alternative for console) for scripting languages like Python, Julia, Lua/torch, which has become a favourite tool of many data scientists for handling datasets that fit into the laptop’s memory. If you have not used it, try it next time you do data analysis or prototype an algorithm.

The Pain

There is a slight technical difficulty, though. The way Jupyter works is you run a server process on your machine (or somewhere in the network), and then access it through a web client. I have noticed a popular pattern of usage in the local-server case: a user runs the command to start the server process in the terminal then keeps the terminal window open while she is using the notebook. This method has several flaws:

  • you need to run the process manually each time you need a notebook (after reboot);
  • you have to remember the command line arguments;
  • the process takes up a terminal window/tab, or, if you run it in the background, you can accidentally close the tab and your RAM state will be lost.

The Solution

If your perfectionist’s feelings are hurt by those nuisances above, my setup may remedy it. It uses launchd framework that starts, stops, and manages background processes. 

First, create a property list file in your ~/Library/LaunchAgents directory: 

~/Library/LaunchAgents/com.blippar.jupyter-ipython.plist

and put the following text there. This is a custom format, which is not quite XML (e.g. order of children nodes apparently matters). I highlighted red the paths and other variables you will have to change:

<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
      <plist version="1.0">
      <dict>
          <key>KeepAlive</key>
          <true/>
          <key>Label</key>
          <string>com.blippar.jupyter-ipython</string>
          <key>ProgramArguments</key>
          <array>
              <string>zsh</string>
              <string>-c</string>
              <string>source ~/.zshrc;workon py3;jupyter notebook --notebook-dir /Users/roman/jupyter-nb --no-browser</string>
          </array>
          <key>RunAtLoad</key>
          <true/>
          <key>StandardErrorPath</key>
          <string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stderr</string>
          <key>StandardOutPath</key>
          <string>/Users/roman/Library/LaunchAgents/jupyter-notebook.stdout</string>
      </dict>
      </plist>
     
The most interesting part is the ProgramArguments entry. As it is impossible to specify several commands as an argument, I had to run a separate shell instance and pass the commands separated by semicolons. If you do not need to set the virtual environment (workon py3), you might get away without calling the shell, although I recommend always using virtualenv for python on a local machine. Also, I use zsh; replace it with your shell of choice.

To check if everything works, run
launchctl load com.blippar.jupyter-ipython
launchctl start com.blippar.jupyter-ipython

and open http://localhost:8888/tree in your browser. Hopefully, you see the list of available notebooks.

Linux users, you should be able to replicate this with systemd (although, chances are you already know that =).

Read Users' Comments (6)

Asynchronous Fitter design pattern for training in IPython notebooks

[Cross-posted from CodeReview@SX]

The problem

When you work with Python interactively (e.g. in an IPython shell or notebook) and run a computationally intensive operation like fitting a machine-learning model that is implemented in a native code, you cannot interrupt the operation since the native code does not return execution control to the Python interpreter until the end of the operation. The problem is not specific to machine learning, although it is typical to run a training process for which you cannot predict the training time. In case it takes longer that you expected, to stop training you need to stop the kernel and thus lose the pre-processed features and other variables stored in the memory, i.e. you cannot interrupt only a particular operation to check a simpler model, which allegedly would be fit faster.

The solution

I propose an Asynchronous Fitter design pattern that runs fitting in a separate process and communicates the results back when they are available. It allows to stop training gracefully by killing the spawned process and then run training of a simpler model. It also allows to train several models simultaneously and work in the IPython notebook during model training. Note that multithreading is probably not an option, since we cannot stop a thread that runs an uncontrolled native code.

Here is a draft implementation:
from multiprocessing import Process, Queue
import time

class AsyncFitter(object):
    def __init__(self, model):
        self.queue = Queue()
        self.model = model
        self.proc = None
        self.start_time = None

    def fit(self, x_train, y_train):
        self.terminate()
        self.proc = Process(target=AsyncFitter.async_fit_, 
            args=(self.model, x_train, y_train, self.queue))
        self.start_time = time.time()
        self.proc.start()

    def try_get_result(self):
        if self.queue.empty():
            return None

        return self.queue.get()

    def is_alive(self):
        return self.proc is not None and self.proc.is_alive()

    def terminate(self):
        if self.proc is not None and self.proc.is_alive():
            self.proc.terminate()
        self.proc = None

    def time_elapsed(self):
        if not self.start_time:
            return 0

        return time.time() - self.start_time

    @staticmethod
    def async_fit_(model, x_train, y_train, queue):
        model.fit(x_train, y_train)
        queue.put(model)

Usage

It is easy to modify a code that uses scikit-learn to adopt the pattern. Here is an example:
import numpy as np
from sklearn.svm import SVC

model = SVC(C = 1e3, kernel='linear')
fitter = AsyncFitter(model)
x_train = np.random.rand(500, 30)
y_train = np.random.randint(0, 2, size=(500,))
fitter.fit(x_train, y_train)

You can check if training is still running by calling fitter.is_alive() and check the time currently elapsed by calling fitter.time_elapsed(). Whenever you want, you can terminate() the process or just train another model that will terminate the previous one. Finally, you can obtain the model by try_get_result(), which returns None when training is in progress.

The issues

As far as I understand, the training set is being pickled and copied, which may be a problem if it is large. Is there an easy way to avoid that? Note that training needs only read-only access to the training set.

What happens if someone loses a reference to an AsyncFitter instance that wraps a running process? Is there a way to implement an asynchronous delayed resource cleanup?

Read Users' Comments (3)

OpenCV bindings

If you are a computer vision researcher or an engineer, you cannot miss OpenCV library. Even if you are not, it could be useful to you. For example, here is a funny application: camera shots a programmer's face when a merge fails. If you are not familiar with the library, I recommend you to look through the list of its features on Wikipedia.


OpenCV is written in C to be extremely portable (for example, to DSP). The fact is C is not very popular nowadays. The recent release 2.0 contains (besides the other decent stuff) also C++ and Python wrappers. What about the other languages?

There are a number of C# wrappers. The most known is EmguCV, which is reported to be the only C# wrapper that supports OpenCV 2.0 (actually, I don't now what it means, but I suppose the API should correspond the C++ interface). It is distributed under GPL or the "Commercial License with a small fee".

As for Java, JavaCV seems to be the only viable wrapper. It also contains wrappers for other popular libraries like FFmpeg. It is also distributed under GPL, but the author promised to discuss weakening it if needed.

OpenCV was being supported by Intel, but it became a FOSS project recently. They are also going to participate in Google Summer of Code. If they will succeed, you might try to apply. I think it is a nice experience to develop such a popular library and be paid for it. :)

One of the fields they want to develop is augmented reality support for Android operating system. When I get known that there is an AR API in Android, I decided to try it. So, this is going to be a good opportunity!

UPD (Aug 6, 2011). There appeared a Haskell (!) wrapper for OpenCV by Noam Lewis.
Also, OpenCV folks are developing the official Java wrapper. Looking forward to use it!

Read Users' Comments (0)