Showing posts with label tutorial. Show all posts
Showing posts with label tutorial. Show all posts

On structured learning

This post is devoted to structured learning, a novel machine learning paradigm which may be used for solving different computer vision problems. For example, finding optimal parameters of graphical models is essentially a structured learning problem. I am going to give an introductory talk on structured learning for Dmitry Vetrov's graphical models class this Friday at 4:20 pm, so please feel free to drop by if you are in Moscow (the talk is to be in Russian).

Structured learning is basically a very general supervised learning setting. Consider the classification setting as a basic problem. One needs to find parameters of a function that maps a feature vector to one of the pre-defined class labels $\mathbb{R}^m \to \{c_1, c_2, \dots, c_K\}$. The fundamental property is the classes are unordered and orthogonal. The latest means the probabilities of objects classified as different labels are uncorrelated (some negative correlation is normal since strong classifying of an object as $c_i$ decreases the probability for rest of the labels).

Now consider regression, which is another supervised learning problem where feature vectors map to real values: $\mathbb{R}^m \to \mathbb{R}$. It might be considered a classification problem with infinite number of classes. There are two obvious flaws in this reduction. First, a training set is unlikely to have at least one example for each class. To overcome this one can quantize the codomain and train a classifier over a finite set of class labels. However, it leaves us with the second flaw: the method does not take into account correlation between the possible outcomes. The bins are ordered: the training features that correspond to neighbouring bins should be handled differently from those that correspond to distant bins. That's why some global methods (like linear regression) are used.

The similar situation is in the case of a structured outcome. In this case, one usually has a plenty of possible outcomes, but they are not independent. The techniques of structured learning are applicable when the outcomes have some underlying structure, and the elements of the structure have similar sets of features. Also, it should be possible to estimate how bad the prediction is (often in terms of incorrectly predicted elements), which is called structured loss. The methods allow small deviations from the ideal training outcome, which are possible e.g. because of noise, but penalize the substantial ones (just like regression!). The prediction function (parameters of which are tuned) can thus be formalized as a mapping $\mathbb{R}^{m \times l} \to \{c_1, c_2, \dots, c_K\}^l$.

The possible example is hidden Markov model learning, where the elements are emission potentials and transition potentials, and the loss can be defined as the number of wrong HMM outputs. According to the upper formalization, there are $l$ elements, each represented by $m$ features (in practice, $m$ can vary for different types of elements). Since the labels of transition elements are strictly defined by emission outputs, not every outcome is possible.

Another example of a structured prediction problem is natural language parsing. Given a sentence of a natural language, the corresponding parsing tree is to be constructed. Surprisingly, parsing trees could also be represented as high-dimensional vectors, with constraints applied. To summarize, structure prediction has outcome that is multivariate, correlated and constrained.

Ideally, the parameters of structured prediction should be tuned via likelihood maximization, but this turns out to be intractable due to the need of computing the partition function on each gradient optimization step. That's why the L2-regularized hinge loss is usually minimized. The prediction algorithm is represented as a scoring function over possible outcomes. The task of the learning algorithm is to find the parameters of the scoring function to make it return maximum values for the true outcomes on the training set, and low ones for the instances that are far from optimal. Given that, the margin between good and bad possible outcomes should be maximized (in terms of scoring function value).

You can learn more about structure learning from Ben Taskar's NIPS 2007 tutorial. See also our recent 3DIMPVT paper for an up-to-date survey of MRF/CRF training methods and their applications.

PS. I've installed MathJax equation engine into the blog. Seems to work, huh?

Read Users' Comments (2)

Max-product optimization methods and software

I hoped I would never post a sorry-for-not-posting message, but I do (and now I feel like I owe you a long post =). In fact, I have been being really busy, particularly with putting my masters thesis down to paper. However, it has a positive outcome: I systematized my knowledge about inference techniques in MRFs, and I'd like to share them here.

Please don't consider it a survey. While it claims to be quite comprehensive, it is not enough deep and verbose. You are not gonna see a lot of formulas, but the links to papers and implementations along with the recommendations on applicability of the methods and the software. It could be useful for anybody who wants to quickly try energy minimization in her task, but don't wanna spend a lot of time digging into the literature and implementing algorithms. So let's call this form of posting unsurvey. You can always find more details on the topic in Stan Li's book (not to be confused with Stan Lee recently appeared in The Big Bang Theory).

Max-product energy minimization problem arises in a number of fields such as statistical physics, social network analysis, and, of course, computer vision, where the problems like image denoising, dense stereo reconstruction, semantic segmentation lead to the single discrete optimization problem: maximize the product of some functions defined over nodes and edges of the graph G=(N,E), typically a grid over image pixels in low-level vision. Due to using logarithms, it is equivalent to maximizing the sum

Image
where each Yi corresponds to a node of the graph and takes on a value from a fixed discrete range. It could be a class label in semantic segmentation, or a disparity value in the pixel in dense stereo. So-called potential functions φ (unary in the first term and pairwise in the second one) define possibility of the assignment given class labels to the nodes and edges correspondingly. They could be conditioned by the data. For example, unary potentials could reflect colour information in the pixel, and pairwise potentials could enforce assignment of the similar labels to the ends of an edge to provide smoothness in the resulting labelling. If the graph contains cliques of the order 3 and greater, one should consider more terms (as I explained earlier), but in practice higher order cliques are often ignored. Also, an important particular case, 4-connected grid, contains only second order cliques, that's why the research community focused on this "pairwise" problem.

In general, this maximization problem is NP-hard. That's why in the 1980s they solved it using genetic algorithms or simulated annealing. Fortunately, specific methods have been developed.

First of all, there are two cases when the exact global optimum could be found in polynomial time. The first is when the graph has a tree structure. Actually, I don't know application where this case is useful. Probably, it is good for analysis of parsing trees in natural language processing. In this case the maximum is found using the dynamic programming based belief propagation algorithm. One node is selected as the root, and messages are passed from the root and then back to the root. The optimal assignment is thus found.

Another polynomial case allows an arbitrary graph topology, but restricts the number of labels and the form of pairwise potentials. Only one of the two labels could be assigned to each node. Pairwise potentials should be submodular, i.e. φ(0,0) + φ(1,1) >= φ(0,1) + φ(1,0). Thus, the binary assignment implies that all the nodes should be divided into the two groups according to their labels. This division is found using the algorithm for finding minimum cut on the supplementary graph. Two fake nodes are added to the graph and declared as the source and the sink. Both fake nodes are adjacent to all the nodes of the "old" graph. Using your favourite min-cut/max-flow algorithm you can separate the nodes into two sets, which gives you the resulting labelling.

Unfortunately, the binary labelling problem is not so common too. In practice people want to choose one of the multiple labels for the nodes. This problem is NP-hard, but the iterative procedures of α-expansion and αβ-swap give good approximations with certain theoretical guarantees. [Boykov et al., 2001] During each iteration the min-cut algorithm is run, so there is an important constraint. Each projection to any two variables of pairwise potentials should be submodular. If it is hold for your energy, I recommend you to use this method since it is really fast and accurate. The implementation by Olga Veksler is available.

What should we do if the energy is not submodular and the graph contains cycles? There are some methods too, but they are not so fast and typically not so effective. First of all, the stated integer programming problem could be relaxed to linear programming. You can use any LP solver, but it will eventually take days to find the optimum. Moreover, it won't be exact, because the relaxed problem solution could be rounded back in multiple ways. No surprise here, because the possibility of getting the global optimum would prove that P = NP.

One of the methods for approaching the general problem is Loopy Belief Propagation [Kschischang, 2001]. No doubt, it is really loopy. Dynamic programming is used on the graph with loops, and it can eventually fall into the endless loop passing some messages along the loop. It is now considered outdated, but you can still try it, it is implemented in libDAI. The input is a factor graph, which is inconvenient in most cases, but really flexible (you could create cliques of arbitrary order).

State-of-the-art approach in general case, tree-reweighted message passing (TRW), has been developed by Wainwright and Kolmogorov. [Kolmogorov, 2006] It is based on the decomposition of the original graph to some graphs where exact message-passing inference is possible, i.e. trees. Message passing is done iteratively on the trees, and trees are enforced to assign the same values in particular nodes. The procedure is guaranteed to converge, but unfortunately not always to the global maximum. However, it is better than Loopy BP. Kolmogorov's implementation contains LBP and TRW-S under a common interface, so you can compare their performance. To understand the decomposition technique better, you could read Komodakis's paper [2007]. It describes a general decomposition framework.

Once a company of the world-leading MRF researchers gathered and implemented LBP, graph-cuts and TRW-S under common interface, investigated performance and even shared their code on a Middlebury college page. It would be great to use those implementations interchangeably, but unfortunately the interface is tailored for the stereo problem, i.e. it is impossible to use the general form of potentials. My latest research is about usage MRFs for laser scan classification, where the topology of the graph is far from a rectangular grid, and the potentials are more complex than the ones in the Potts model. So I ended up in writing my own wrappers.

Another interesting method was re-discovered by Kolmogorov and Rother [2007]. It is known as quadratic pseudo-boolean optimization (QPBO). This graph-cut based method can either assign a label to a node, or reject to classify it. The consistency property is guaranteed: if the label has been assigned, this exact label is assigned in the global optimum. The equivalent solution could be obtained using a specific LP relaxation with 1/2 probabilities for class labels allowed. I have not used the technique, so I cannot recommend an implementation.

To summarize, the algorithm for choosing an optimization method is the following. If your graph is actually a tree, use belief propagation. Otherwise, if your energy is submodular, use graph-cut based inference; it is both effective and efficient (even exact in the binary case). If not, use TRW-S, it often yields a good approximation.

Read Users' Comments (5)

Computer Vision: Fact & Fiction

I was surfing the web today and came upon Stanford CS 223B course (Introduction to Computer Vision), which is said to be fucking hard. The first course homework is to watch the series of films "Computer Vision: Fact & Fiction" where computer vision stars (like David Forsyth and Andrew Zisserman) analyse computer vision technologies featured in Hollywood movies. The videos require no background in Vision and might be interesting to everyone. To me, it is also interesting to see how the famous vision folks look and talk.


My friend Tolya Yudanov spoke about that to talk about realistic in The Terminator movie is like "arguing about physical correctness of animé. Terminator is the complex AI of the future, and it is stupid to apply modern computer vision criteria to it." So, it is a good illustration of the concept of computer blindness. I encourage you to watch the videos, they are worth watching.

Read Users' Comments (0)