adventures in optimization

🖍 Visualizing Decision Diagrams with Dash and Cytoscape

Sat, 18 Oct 2025 00:00:00 +0000

I posted a couple years ago about my adventures applying various graph visualization tools to Decision Diagrams (DD). This is an interesting problem because DDs have characteristics that don’t apply to other graphs. They are layered¹ and each arc from a layer n-1 to layer n has the same direction. Nodes in a layer can be merged or split apart, with those operations generally staying within the same layer². Sub-diagrams can even be peeled off of parent diagrams during branch-and-bound, while maintaining much of their original structure.

At the time, I settled on using Mermaid for automated DD rendering, but that still had a few issues. The rendering itself was nice, but it was hard to keep labels readable. Mermaid uses its own data format for graph representation. I’d rather draw graphs based on something like a Python data structure without translating that data into an intermediate format.

Since then, I’ve found myself stepping iteratively through processes that modify DDs in various ways for a side research project³. None of the options I looked at before are quite suitable. I need to visually inspect the impacts of DD operations like reduction and restriction, and to separate specific arcs in an iterative process that I can drive interactively. This led me to experiment with Dash and, by extension, Cytoscape⁴.

Dash & Cytoscape

Let’s look at the same example diagram as before using Dash. First we initialize a Python list of graph elements to display. We’ll feed this list directly into Dash’s Cytoscape layout.

ELEMENTS = [
    # nodes: layer 0
    {"data": {"id": "r", "label": "r"}},
    # arcs: layer 0 -> layer 1
    {"data": {"source": "r", "target": "1"}},
    {"data": {"source": "r", "target": "2"}},
    {"data": {"source": "r", "target": "3"}},
    # nodes: layer 1
    {"data": {"id": "1", "label": "0"}},
    {"data": {"id": "2", "label": "[[1,2],4]"}},
    {"data": {"id": "3", "label": "3"}},
    # arcs: layer 1 -> layer 2
    {"data": {"source": "2", "target": "4"}},
    {"data": {"source": "2", "target": "5"}},
    {"data": {"source": "3", "target": "6"}},
    # nodes: layer 2
    {"data": {"id": "4", "label": "10"}},
    {"data": {"id": "5", "label": "20"}},
    {"data": {"id": "6", "label": "100"}},
    # arcs: layer 2 -> layer 3
    {"data": {"source": "4", "target": "t"}},
    {"data": {"source": "5", "target": "t"}},
    {"data": {"source": "6", "target": "t"}},
    # nodes: layer 3
    {"data": {"id": "t", "label": "t"}},
]

Already this is pretty nice. This list is easy to generate from any graph data model. It’s just a flat list of nodes and arcs. If we need to, we can add additional information directly to the data dictionary.

    {"data": {"source": "6", "target": "t", "xyzzy": "plugh"}},

Next we initialize Cytoscape layouts and create a Dash app.

import dash_cytoscape as cyto
from dash import Dash

cyto.load_extra_layouts()
app = Dash()

The app is responsible for serving a web page containing our Cytoscape layout. We can add more data and layouts, interactive elements such as buttons, and logic through callbacks to the app as well⁵.

Now we just add a Cytoscape layout to the app and run it. Note the dagre layout name renders the diagram top down, while klay renders it left to right.

app.layout = cyto.Cytoscape(
    layout={ "name": "dagre" },
    elements=ELEMENTS,
)
app.run()

That’s it! So what does our beautiful diagram look like?

Styling

Wait, that’s not very good, is it? If anything, it’s at least as bad as any of the other options, right?

At this point, yes, but one of the qualities that separates Cytoscape from other graph visualization options is its capacity for element styling. Let’s improve on this visualization by adding some styles.

STYLES = [
    {
        "selector": "edge",
        "style": {
            "curve-style": "bezier",
            "target-arrow-shape": "triangle",
        },
    },
    {
        "selector": "node",
        "style": {
            "shape": "rectangle",
            "width": "label",
            "height": "label",
            "content": "data(label)",
            "font-family": "monospace",
            "text-valign": "center",
            "padding": "10px",
        },
    },
]

app.layout = cyto.Cytoscape(
    layout={ "name": "dagre" },
    elements=ELEMENTS,
    stylesheet=STYLES,
)
app.run()

Already, this is significantly better than the default style.

Since the elements and styles are cleanly separated, it’s convenient to style nodes and arcs based on aspects of their data. To give you a sense of what this means for DDs, here is a screenshot from that research project I mentioned.

In this case, border colors, background colors, and line styles have different meanings. It’s easy to add interactivity like toggling on more information in the node labels, or restructuring the diagram and its representation based on user input. Try out the example to see how Dash is built from the ground up for interactivity.

Resources

dash-cytoscape.py provides the full example visualization

Though this is less the case as DD implementations become more like Dynamic Programming and abandon layers. ↩︎
Ibid. ↩︎
Not to make excuses, but research projects go a lot slower lately than they used to. ↩︎
Dash delegates all of its graph rendering functionality to Cytoscape, and provides an API layer for graph data management and interactivity. ↩︎
These are out of scope for this post. ↩︎

🐏 RAMS Reboot

Wed, 25 Jun 2025 00:00:00 +0000

Some years ago, I worked on real-time meal delivery at Zoomer, a YC startup based out of Philadelphia. Zoomer’s production tech stack was primarily Ruby. As it grew we moved from using heuristics for things like routing and scheduling to open source optimization solvers.

Like most languages that aren’t Python, Ruby doesn’t have an especially mature ecosystem for optimization (or data science, or machine learning, for that matter). For some use cases that didn’t matter. When we upgraded the routing engine, we built a model in C++ using Gecode and wrapped a Ruby gem around a SWIG wrapper. But when we wanted to use integer programming to build schedules, the lack of solver APIs proved inconvenient.¹

At the time, PuLP was probably the most commonly used open source multi-solver Python library for linear and integer programming.² This led me the opportunity to develop RAMS, a PuLP-inspired library for basic MILP modeling in Ruby.

Then the Zoomer team became part of Grubhub. We moved to a Java stack and a commercial optimization solver. Improvements to the RAMS project languished on my todo list. It lagged behind major versions of Ruby, optimization, solvers, and dependencies, painfully out of date and unmaintained.

Then, last month, Github released its Copilot agent. Unlike vibe coding directly in the editor, which sounds like speeding maniacally through a bad acid trip, the idea here is more like running a project: create issues, receive and comment on pull requests, iterate.

I figured the grunt work of library upgrades should be perfect fodder to try out an AI developer assistant. RAMS is already well structured and tested. The upgrade is well defined. No creativity required.

A RAMS modeling example

This post is meandering through two topics: solving optimization models with Ruby and RAMS, and my experiences maintaining that library using Copilot. I could have split this into two posts, but that didn’t feel right. So let’s show what building a model in RAMS looks like first.

I don’t use Ruby with any regularity these days³, but modeling with RAMS reminded me how elegant Ruby DSLs can be. Here’s a simple example of a binary integer program.

#!/usr/bin/env ruby

require 'rams'

m = RAMS::Model.new

x1 = m.variable type: :binary
x2 = m.variable type: :binary
x3 = m.variable type: :binary

m.constrain(x1 + x2 + x3 <= 2)
m.constrain(x2 + x3 <= 1)

m.sense = :max
m.objective = 1 * x1 + 2 * x2 + 3 * x3

solution = m.solve
puts <<-HERE
objective: #{solution.objective}
x1 = #{solution[x1]}
x2 = #{solution[x2]}
x3 = #{solution[x3]}
HERE

I think that’s rather nice, and very clean.

RAMS enhancements

The biggest change in RAMS is that it now supports the HiGHS optimization solver. Prior to v0.2.0, GLPK was the default solver, but now that is HiGHS. There are a number of smaller changes as well.

RAMS requires Ruby v3.1.
CPLEX support was removed since I can’t test it.⁴
One can set solver paths using environment variables (e.g. RAMS_SOLVER_PATH_CBC).
Improved documentation and a logo!

The Copilot agent as coding companion

While I tend to err on the side of LLM skepticism, working with the Copilot agent for this upgrade was generally positive. It was a bit like working with a fast, responsive, and inexperienced developer. The issues it ran into were pretty much the same, but the time scale was compressed.

I had it open three pull requests for me.

🤨 PR 29: Upgrade Ruby and dependencies

Performance here was middling. Copilot got through some of the task without assistance. It also made a number of changes that were unhelpful and irrelevant to the request.

On a positive note, I forgot to ask it to change from CircleCI to GitHub Actions for testing. This gave me the opportunity to test its response to feature creep. It responded with a partially working GitHub Actions workflow (and no grumbling!).

Copilot made a number of errors and wasn’t able to finish the upgrade on its own.

It decided to build the optimizers from source instead of simply installing binary packages using apt or dnf. Not only is this wasteful and overly complicated, it ultimately wasn’t able to build and install them from source.
Once I told it to use a Fedora 42 base image, this improved, but it couldn’t figure out what package to use for the CBC solver. It switched back and forth without prompting between cbc (incorrect) and coin-or-Cbc (correct).
It inexplicably couldn’t figure out the latest stable version of Ruby.
It added a bunch of architecture-specific package definitions to the build, unprompted. This was unnecessary given that RAMS is a vanilla Ruby project.
I had to help it figure out that the CBC binary is now called coin.cbc on Fedora. This wasn’t entirely surprising.

🤩 PR 32: Add environment variables for solver paths

Copilot did a great job on this task. I had no issue with the code it wrote. It followed the style of the rest of the package nicely. It added appropriate documentation and unit tests.

👌 PR 34: Support the HiGHS optimization solver

Copilot did pretty well here, even though it didn’t get the feature working. It was able to create a new solver interface and get most of the logic for solution parsing right. I was a little surprised that it forgot to test the new solver integration in GitHub Actions. The biggest issue it needed my help on was solution status parsing, where it didn’t realize that the second condition here will never trigger.

return :feasible if status =~ /feasible/i
return :infeasible if status =~ /infeasible/i

This should have been the following (note the ^).

return :feasible if status =~ /^feasible/i
return :infeasible if status =~ /infeasible/i

I don’t remember finding any MILP modeling interfaces for Ruby like PuLP in 2016-17. More recently, Rulp and Opt have been developed. ↩︎
PulLP is still heavily used and developed. ↩︎
Once upon a time I was a Perl programmer. Ruby was originally written to be a better Perl. I’ve long since given up the old ways. ↩︎
For now, RAMS is focussing on open source solvers. Maintaining commercial solver licenses can be challenging when you’re not part of academia. PRs welcome. ↩︎

🖖 Hi, I'm Ryan.

Thu, 10 Apr 2025 00:00:00 +0000

I build decision science and optimization software.

By day, I am an optimization engineer, coder, and co-founder of Nextmv. I’m interested in hybrid optimization, decision diagrams, and mixed integer programming. My applications skew toward logistics for delivery platforms, with detours into cutting and packing. Lately I’ve been embedding a lot of trained machine learning models in optimization problems, and exploring applications of inverse optimization.

For the past several years, I’ve worked in real-time optimization for on-demand delivery, scheduling, forecasting, and simulation. I received a MS in Operations Research by night at George Mason University, then a PhD in the same department under the advisement of Karla Hoffman.

Appearances

This is a running list of talks I’ve given and am scheduled to give. It probably isn’t exhaustive. Some of them have slides or videos available.


2025	Jun 26	🎥	Nextmv Videos	Operationalize a Gurobi price optimization notebook: Deploy, collaborate, test, and visualize with Nextmv
	May 15		ODSC East 2025	Predict & Prescribe: Combining forecasts with optimized plans
	Apr 8		University of Luxembourg	Decision model, meet reality: Testing lessons from food logistics and delivery operations
	Mar 14-16	📄 📄	INFORMS Computing Society Conference 2025	Chair and organizer of the solvers cluster
	Mar 6	🎥	Nextmv Videos	Nextmv Hexaly Integration: How to run, test, and manage with DecisionOps workflows
2024	Nov 7	🎥	Nextmv Videos	Nextmv ML/OR connectors: A price optimization example with Gurobipy, Gurobi ML, and Gurobipy Pandas
	Oct 21	📄 💻	INFORMS Annual Meeting 2024	Solving the Weapon Target Assignment Problem with Decision Diagrams
	Oct 3	🎥	Nextmv Videos	Uncertainty, ML + OR, and stochastic optimization: Demo and Q&A with Seeker creator
	Jul 30	🎥	Nextmv Videos	Operationalizing HiGHS-based MIP models and Q&A with project developers
	Jun 27	💻	HiGHS Workshop 2024	Symphonic HiGHS: Operationalizing next moves with DecisionOps
	Jun 18	🎥	Nextmv Videos	Combining machine learning (ML) and operations research (OR) through horizontal computing
	Jun 7	📄 💻 🎥	EURO Practitioners’ Forum	Three model problem: Combining machine learning (ML) and operations research (OR) through horizontal computing
	Apr 14	📄	INFORMS Analytics Conference 2024	The sushi is ready. How do I deliver it? Forecast, schedule, route with DecisionOps
	Apr 10	🎥	Nextmv Videos	Getting started with DecisionOps for decision science models using Gurobi
2023	Dec 6	📄 🧑‍💻️ 💻 🎥	PyData Global 2023	Order up! How do I deliver it? Build on-demand logistics apps with Python, OR-Tools, and DecisionOps
	Nov 16	🎥	Nextmv Videos	Forecast, schedule, route: 3 starter models for on-demand logistics
	Oct 17	📄	INFORMS Annual Meeting 2023	Adapting to Change in On-Demand Delivery: Unpacking a Suite of Testing Methodologies
	Sep 20	📄 💻 🎥	DecisionCAMP 2023	Decision model, meet the real world: Testing optimization models for use in production environments
	Aug 27	💻	DPSOLVE 2023	Implementing Decision Diagrams in Production Systems
	May 11	🎥	Nextmv Videos	Several people are optimizing: Collaborative workflows for decision model operations
	Apr 17	📄	INFORMS Analytics Conference 2023	Decision Model, Meet Production: A Collaborative Workflow for Optimizing More Operations
	Feb 16	🎥	Nextmv Videos	Decision diagrams in operations research, optimization, vehicle routing, and beyond
	Jan 18	🎥	Nextmv Videos	In conversation with Karla Hoffman
2022	Nov 16	🎥	Nextmv Videos	Decision model, meet production
2020	Oct 5	🎥	INFORMS Philadelphia Chapter	Real-Time Routing for On-Demand Delivery
2019	Oct 22	💻	INFORMS Annual Meeting 2019	Decision Diagrams for Real-Time Routing
2017	July 6	📄 🎥	PyData Seattle 2017	Practical Optimization for Stats Nerds
	Mar 5	💻	Data Science DC	Practical Optimization for Stats Nerds
2015	Dec 4	💻 🎥	PyData NYC 2015	Optimize your Docker Infrastructure with Python
2014	Jul 17	📄 💻	IFORS 2014	A MIP-Based Dual Bounding Technique for the Irregular Nesting Problem
2010	Feb 19	🎥	PyCon 2010	Optimal Resource Allocation using Python

Articles, papers & patents

I’m an desultory blogger and intermittent academic. Most of my current and old posts live here. Some of my other content is below.


2024	Dec 4		ORMS Today	Ops Researchers, It’s Time to Git with the Flow
	Mar 26	🖨️	USPTO	Fast computational generation of digital pickup and delivery plans describes algorithms for fast on-demand routing in pickup and delivery problems.
	Oct 31		Nextmv Blog	5 things software teams should know about operations research and decision science
	Oct 17		Nextmv Blog	New integration: Bring your Hexaly decision model to Nextmv
	Mar 7		Nextmv Blog	Nextmv Gurobi integration: Build, test, deploy decision models using Gurobi and DecisionOps
	Feb 13		Nextmv Blog	CI/CD for decision science: What is it, how does it work, and why does it matter?
	Feb 1		Nextmv Blog	New decision apps, an open source decision model hub, and an individual plan
2023	Dec 26	🖨️	USPTO	Prediction of travel time and determination of prediction interval describes technology for predicting travel times for on-demand delivery platforms.
	Dec 19		Nextmv Blog	Shift scheduling optimization: Generating shift types, planning for demand, and assigning workers
	Jun 13	🖨️	USPTO	Runners for optimization solvers and simulators describes technology for creating and executing Decision Diagram-based optimization solvers and state-based simulators in cloud environments.
2022	Apr 20		Nextmv Blog	You need a solver. What is a solver?
2021	Mar 2		Nextmv Blog	Binaries are beautiful
2020	Sep 11	🖨️	Operations Research Forum	MIPLIBing: Seamless Benchmarking of Mathematical Optimization Problems and Metadata Extensions presents a Python library that automatically downloads queried subsets from the current versions of MIPLIB, MINLPLib, and QPLIB, provides a centralized local cache across projects, and tracks the best solution values and bounds on record for each problem.
	Mar 2		Nextmv Blog	How Hop Hops
2019	May	🖨️	Operations Research Letters	Decision diagrams for solving traveling salesman problems with pickup and delivery in real time explores the use of Multivalued Decision Diagrams and Assignment Problem inference duals for real-time optimization of TSPPDs.
2018	Oct 2	🖨️	Optimization Online	Integer Models for the Asymmetric Traveling Salesman Problem with Pickup and Delivery proposes a new ATSPPD model, new valid inequalities for the Sarin-Sherali-Bhootra ATSPPD, and studies the impact of relaxing complicating constraints in these.
	Sep 13		Grubhub Bytes	Decisions are first class citizens: an introduction to Decision Engineering
	Sep 2	🖨️	Optimization Online	Exact Methods for Solving Traveling Salesman Problems with Pickup and Delivery in Real Time examines exact methods for solving TSPPDs with consolidation in real-time applications. It considers enumerative, Mixed Integer Programming, Constraint Programming, and hybrid optimization approaches under various time budgets.
	Apr 10	🖨️	Optimization Online	The Meal Delivery Routing Problem introduces the MDRP to formalize and study an important emerging class of dynamic delivery operations. It also develops optimization-based algorithms tailored to solve the courier assignment (dynamic vehicle routing) and capacity management (offline shift scheduling) problems encountered in meal delivery operations.
2015	Jan 5		The Yhat Blog	Currency Portfolio Optimization Using ScienceOps
2014	Nov 10		The Yhat Blog	How Yhat Does Cloud Balancing: A Case Study

Software

Most of my work is proprietary, but some of it is open. Here are a few projects I’ve built or made significant contributions. I’ve also done some work on projects such as PuLP, MIPLIBing, and MDRPlib.

Active(ish) projects

The Ruby Algebraic Modeling System is a simple modeling tool for formulating and solving MILPs in Ruby.
ap.cpp is an incremental primal-dual assignment problem solver written in C++. It can vastly improve propagation in hybrid optimization models that use AP relaxations. I use it within custom propagators in Gecode and in Decision Diagrams for solving the Traveling Salesman Problem with side constraints.
ap is a Go version of ap.cpp.
TSPPD Hybrid Optimization Code and TSPPD Decision Diagram Code are both used in my dissertation. The former contains C++14 code for hybrid CP and MIP models for solving TSPPDs. The latter uses a hybridized Decision Diagram implementation with an Assignment Problem inference dual inside a branch-and-bound.
TSPPDlib is a standard test set for TSPPDs. The instances are based on observed meal delivery data at Grubhub.

Defunct projects

python-zibopt was a Python interface to the SCIP Optimization Suite. This was no longer necessary once PySCIPOpt emerged.
Chute was a simple, lightweight tool for running discrete event simulations in Python.
PyGEP was a simple library suitable for academic study of GEP (Gene Expression Programming) in Python 2.

Et al

In my spare time, I’m a cat and early music enthusiast, plus…

a board member of Classical Uprising,
a mentor of startup founders at the Roux Institute,
chair of the INFORMS Membership Committee,
and a cellist in the Southern Maine Symphony Orchestra.

Iconography

📄 = abstract
🧑‍💻️ = code
🖨️ = pdf
🎟 = registration
💻 = slides
🎥 = video

🧺 ICS 2025 Solvers Cluster Takeaways

Tue, 18 Mar 2025 00:00:00 +0000

I just returned from the 2025 INFORMS Computing Society conference, where I had the privilege of organizing a cluster on optimization solvers. The cluster had two sessions, Solvers I and Solvers II, and focussed on new developments in the implementation of optimization solvers.

In the coming days, I’m going to explore some of these solvers in more depth. For now, I wanted to give a few hot takes from the sessions while they are still fresh in my mind.

Hybrid optimization is everywhere

Hybrid optimization combines multiple techniques to solve a given problem. Most of the hybrid optimization literature focuses on leveraging strengths of different techniques to solve a particular well-defined problem, such as a routing problem with time windows, but it can also provide clear benefits to general-purpose solvers.

OR-Tools, which is likely the most commonly used open source solver, gave a talk on the design of their CP-SAT[-LP] algorithm. To users interacting with OR-Tools through its APIs, CP-SAT looks like an ordinary constraint programming (CP) solver. Internally, however, they boost its CP solver with techniques from satisfiability (SAT) and linear programming (LP). This gives a whole that is much more powerful that its parts, as shown below.

☝ Please pardon the poor quality of this photo I took during their ICS talk.

Meanwhile, the commercial optimizer Hexaly incorporates a basketful of technologies under the hood. These include techniques from from exact methods, heuristics like large neighborhood search, and even some ideas from Decision Diagrams (DD).

Interestingly, both solvers admit to some component algorithms being well behind leading implementations. OR-Tools’s SAT and LP solvers are somewhat rudimentary, and Hexaly’s simplex and interior point algorithms would not be competitive on their own. It is the combination of multiple algorithms and approaches that makes the solvers powerful.

State-based modeling has a big opportunity

Mixed Integer Programming (MIP) (and other math programming classes), CP, and Dynamic Programming (DP) have all been standard techniques in the optimization toolkit for decades. While MIP and CP both benefit from standard formats and solver interoperability through systems like MiniZinc, AMPL, and other projects, that never really happened for DP. Even now, DP models are usually bespoke and lack both modeling standards and standard solvers.

That is rapidly changing with the development of both Domain-Independent Dynamic Programming (DIDP), and new DD solvers like CODD. These efforts are still nascent, but there is growing momentum toward building both domain-independent solvers and modeling languages for state-based models. If this succeeds, DP and state-based models have the potential to become similar to MIP and CP in power, portability, and expressiveness.

Established technologies are rapidly innovating, too

Other talks in the cluster showed MaxiCP, a CP solver with roots in MiniCP that is suitable for real-life use, recent developments in proving global optimality for Mixed Integer Non-Linear Programs (MINLP) in Xpress, and an interesting new heuristic solver based on a technique called Random-Key Optimization (RKO) which represents solutions as vectors between 0 and 1 and changes the modeling exercise into solution decoding.

During an interview several years ago, an optimization team leader at a major logistics company who told me that “optimization is a solved problem” and that new solver development was therefore not interesting. That isn’t what I see, though. Instead, I see the practical application of optimization continuing to grow beyond the boundaries of what today’s solvers can handle, and a ton of activity in development of those solvers to make them ever more powerful and flexible.

👔 Hierarchical Optimization with HiGHS

Mon, 11 Nov 2024 00:00:00 +0000

In the last post, we used Gurobi’s hierarchical optimization features to compute the Pareto front for primary and secondary objectives in an assignment problem. This relied on Gurobi’s setObjectiveN method and its internal code for managing hierarchical problems.

Some practitioners may need to do this without access to a commercial license. This post adapts the previous example to use HiGHS and its native Python interface, highspy. It’s also useful to see what the procedure is in order to understand it better. This isn’t exactly what I’d call hard, but it is easy to mess up.¹

Code

The mathematical models are available in the last post, so I won’t restate them here. We start in roughly the same manner as before²: create a binary variable for each worker-patient pair, add assignment problem constraints, and state the primary objective.

from itertools import product
import highspy

n = len(data["cost"])
workers = range(n)
patients = range(n)
workers_patients = list(product(workers, patients))

h = highspy.Highs()

# x[w,p] = 1 if worker w is assigned to patient p.
x = {(w, p): h.addBinary(obj=data["cost"][w][p]) for w, p in workers_patients}

# Each worker is assigned to one patient.
h.addConstrs(sum(x[w, p] for p in patients) == 1 for w in workers)

# Each patient is assigned one worker.
h.addConstrs(sum(x[w, p] for w in workers) == 1 for p in patients)

# Primary objective: minimize cost.
h.setMinimize()
h.solve()
cost = h.getObjectiveValue()

Note that if the costs and affinities were lists instead of matrices, we could have used h.addBinaries instead of h.addBinary.

From here we’ll be solving the model twice for every value of alpha. These expressions for total cost and affinity will make a code a little cleaner.

cost_expr = sum(data["cost"][w][p] * x[w, p] for w, p in workers_patients)
affinity_expr = sum(data["affinity"][w][p] * x[w, p] for w, p in workers_patients)

Now comes the hierarchical optimization logic. For every value of alpha, we find the best affinity possible while keeping cost within alpha of its best possible value.

Update the objective function to maximize affinity (see the calls to h.changeColCost and h.setMaximize).
Constrain the cost to be within alpha of the original optimal cost (see cost_cons).
Re-optimize and save the maximal affinity.

Now we constrain the affinity and re-optimize cost.³

Update the objective function to minimize cost again.
Constrain the affinity.

Once that’s done, we remove the additional constraints and repeat for a new value of alpha.

for alpha in alphas:
    # Secondary objective: maximize affinity.
    for (w, p), x_wp in x.items():
        h.changeColCost(x_wp.index, data["affinity"][w][p])

    # Constrain cost to be within alpha of maximum.
    cost_cons = h.addConstr(cost_expr <= (1 + alpha) * cost)

    h.setMaximize()
    h.solve()
    affinity = h.getObjectiveValue()

    # Re-optimize with original cost objective, constraining affinity.
    for (w, p), x_wp in x.items():
        h.changeColCost(x_wp.index, data["cost"][w][p])
    affinity_cons = h.addConstr(affinity_expr >= affinity)

    h.setMinimize()
    h.solve()

    yield alpha, h.getObjectiveValue(), affinity

    # Remove cost and affinity constraints for
    h.removeConstr(cost_cons)
    h.removeConstr(affinity_cons)

Encouragingly, running this using the model.py linked below gives the same values as the Gurobi model, albeit not as quickly. Floating point values are rounded for readability.

| alpha | cost     | affinity |
| ----- | -------- | -------- |
| 0.0   | 11212.0  | 53816.0  |
| 0.05  | 11761.0  | 74001.0  |
| 0.1   | 12332.0  | 79981.0  |
| 0.15  | 12886.0  | 83103.0  |
| 0.2   | 13454.0  | 85394.0  |
| 0.25  | 13996.0  | 87136.0  |
| 0.3   | 14557.0  | 88546.0  |
| 0.35  | 15125.0  | 89751.0  |
| 0.4   | 15670.0  | 90664.0  |
| 0.45  | 16255.0  | 91345.0  |
| 0.5   | 16816.0  | 91997.0  |
| 0.55  | 17370.0  | 92537.0  |
| 0.6   | 17924.0  | 93012.0  |
| 0.65  | 18495.0  | 93491.0  |
| 0.7   | 19055.0  | 93829.0  |
| 0.75  | 19591.0  | 94228.0  |
| 0.8   | 20167.0  | 94530.0  |
| 0.85  | 20737.0  | 94833.0  |
| 0.9   | 21295.0  | 95114.0  |
| 0.95  | 21812.0  | 95361.0  |
| 1.0   | 22402.0  | 95613.0  |

Resources

model.py hierarchical objectives HiGHS model

It gets even easier to mess up with more than two objectives. ↩︎
Isn’t it nice that MIP modeling is similar across different APIs? ↩︎
Exercise for the reader: why do we need to re-optimize cost? ↩︎

👔 Hierarchical Optimization with Gurobi

Fri, 08 Nov 2024 00:00:00 +0000

One of the first technology choices to make when setting up an optimization stack is which modeling interface to use. Even if we restrict our choices to Python interfaces for MIP modeling, there are lots of options to consider.

If you use a specific solver, you can opt for its native Python interface. Examples include libraries like gurobipy, Fusion, highspy, or PySCIPOpt. This approach provides access to important solver-specific features such as lazy constraints, heuristics, and various solver settings. However, it can also lock you into a solver before ready for that.

You can also choose a modeling API that targets multiple solvers. In the Python ecosystem. These are libraries like amplpy, Pyomo, PyOptInterface, and linopy. These interfaces target multiple solver backends (both open source and commercial) and provide a subset of the functionality of each. Since they make it easy to switch between solvers, this is usually where I start.¹

Hierarchical assignment

However, there are plenty of times when solver-specific APIs are useful, or even critical. One example is hierarchical optimization. This is a simple technique for managing trade-offs between multiple objectives in a problem. Let’s look at an example.

Imagine we are assigning in-home health care workers ($w \in W$) to patients ($p \in P$). For simplicity, let’s say we have $n$ workers and $n$ patients, and we are assigning them one-to-one. Each worker has a given cost ($c_{wp}$) of assignment to each patient, which may reflect something like the travel time to get to them. We want to assign each worker to exactly one patient while minimizing the overall cost.

Model

So far, what we have is a simple linear sum assignment problem.

$$ \begin{align*} & \text{min} && z = \sum_{wp} c_{wp} x_{wp} \\ & \text{s.t.} && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

Solving this model gives us the minimum cost assignment. That’s all well and good, but now say we have a secondary objective of maximizing affinity of workers to patients ($a_{wp}$). That is, we want to prefer assignments that increase overall affinity while still minimizing cost. This is actually a common goal in health care scheduling: if possible, send the same worker to a given patient that you usually send.

Hierarchical optimization gives us a simple way to solve this problem. First, we optimize the model as stated above. This gives us an optimal objective value $z^*$. Then we re-solve the same optimization model, while constraining the cost to be $z^*$ and using the secondary objective function. This says to the optimizer, “improve the affinity as much as you can, but keep the cost optimal.”

$$ \begin{align*} & \text{max} && w = \sum_{wp} a_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} c_{wp} x_{wp} \le z^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

From here, the natural question becomes: what if we trade off some cost for affinity? If we’re willing to increase cost by some percentage, how much more affinity do we get? We can do this by setting a constant $\alpha \ge 0$ and solving the model a number of times.²

$$ \begin{align*} & \text{max} && w = \sum_{wp} a_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} c_{wp} x_{wp} \le (1 + \alpha) z^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

For example, if $\alpha = 0.05$, then we’re willing to accept a 5% increase in overall cost to improve affinity. Setting different values of $\alpha$ lets us explore the space of that trade-off and its impact on cost and affinity.

Once we solve this and get the optimal affinity ($w^*$), we should re-optimize for the primary objective again while constraining the secondary one.

$$ \begin{align*} & \text{min} && \sum_{wp} c_{wp} x_{wp} \\ & \text{s.t.} && \sum_{wp} a_{wp} x_{wp} \ge w^* \\ & && \sum_w x_{wp} = 1 && \forall \quad p \in P \\ & && \sum_p x_{wp} = 1 && \forall \quad w \in W \\ & && x \in \{0,1\}^{|W \times P|} \end{align*} $$

Code

So the math looks reasonable. How do we implement it? If we have a Gurobi license, we can use its built-in facilities for multiobjective optimization. This means that, instead solving a model multiple times and adding constraints to keep cost within $\alpha$ of its optimal value, we can create a single model that does all of this for us.

Assume we have input data which looks like this.

{
    "cost": [
        [10, 20, ...],
        [30, 40, ...],
        ...
    ],
    "affinity": [
        [25, 15, ...],
        [35, 25, ...],
        ...
    ]
}

We start with a simple assignment problem formulation.

import gurobipy as gp

n = len(data["cost"])
workers = range(n)
patients = range(n)

m = gp.Model()
m.ModelSense = gp.GRB.MINIMIZE

# x[w,p] = 1 if worker w is assigned to patient p.
x = m.addVars(n, n, vtype=gp.GRB.BINARY)

for i in range(n):
    # Each worker is assigned to one patient.
    m.addConstr(gp.quicksum(x[i, p] for p in patients) == 1)

    # Each patient is assigned one worker.
    m.addConstr(gp.quicksum(x[w, i] for w in workers) == 1)

We add primary and secondary objectives, and call optimize. The objectives are solved in descending order of the priority flag for Model.setObjectiveN. reltol allows us to degrade the primary objective by some amount (e.g. 5%) to improve the secondary objective.

One catch is that the model only has one objective sense. Since we are minimizing the primary objective, we give the secondary objective a weight of -1 in order to maximize it.

from itertools import product

# Primary objective: minimize cost.
z = (data["cost"][w][p] * x[w, p] for w, p in product(workers, patients))
m.setObjectiveN(expr=gp.quicksum(z), index=0, name="cost", priority=1, reltol=alpha)

# Secondary objective: maximize affinity. Since the model sense is minimize,
# we negate the secondary objective in order to maximize it.
w = (data["affinity"][w][p] * x[w, p] for w, p in product(workers, patients))
m.setObjectiveN(
    expr=gp.quicksum(w), index=1, name="affinity", priority=0, weight=-1
)

m.optimize()

Then we use this magic syntax to pull out the optimal cost and affinity.

m.params.ObjNumber = 0
cost = m.ObjNVal

m.params.ObjNumber = 1
affinity = m.ObjNVal

Results

If we solve this in a loop with alpha values from 0 to 1 in increments of 0.05, we can plot the trade-off between cost and affinity. Going from $\alpha = 0$ to $\alpha = 0.05$ or $\alpha = 0.1$ gives a pretty sizable improvement in affinity. After that, the return starts to gradually level off. This allows us to make a more informed choice about these two objectives.

Resources

generate.py generates input data
input-100x100.json contains input data
model.py hierarchical objectives Gurobi model

While commercial libraries like AMPL have always focussed on modeling performance, some of the open source options targeting multiple solvers come with significant performance penalties during formulation and model handoff to the solver. Newer options like linopy (benchmarks) and PyOptInterface (benchmarks) don’t have that issue. ↩︎
This gives us a Pareto front, which explores the trade-offs between different objectives. ↩︎

📅 Reducing Overscheduling

Sun, 26 Nov 2023 00:00:00 +0000

At a Nextmv tech talk a couple weeks ago, I showed a least absolute deviations (LAD) regression model using OR-Tools. This isn’t new – I pulled the formulation from Rob Vanderbei’s “Local Warming” paper, and I’ve shown similar models at conference talks in the past using other modeling APIs and solvers.

There are a couple reasons I keep coming back to this problem. One is that it’s a great example of how to build a machine learning model using an optimization solver. Unless you have an optimization background, it’s probably not obvious you can do this. Building a regression or classification model with a solver directly is a great way to understand the model better. And you can customize it in interesting ways, like adding epsilon insensitivity.

Another is that least squares, while most commonly used regression form, has a fatal flaw: it isn’t robust to outliers in the input data. This is because least squares minimize the sum of squared residuals, as shown in the formulation below. Here, $A$ is an $m \times n$ matrix of feature data, $b$ is a vector of observations to fit, and $x$ is a vector of coefficients the optimizer must find.

$$ \min f(x) = \Vert Ax-b \Vert^2 $$

Since the objective function minimizes squared residuals, outliers have a much bigger impact than other data. LAD regression solves this by simply summing the values of the residuals as they are.

$$ \min f(x) = \vert Ax-b \vert $$

So why isn’t this used more? Simple – least squares has a convenient analytical solution, while LAD requires an algorithm to solve. For instance, you can formulate LAD regression as a linear program, but now you need a solver.

$$ \begin{align*} \min \quad & 1’z \\ \text{s.t.}\ \quad & z \ge Ax - b \\ & z \ge b - Ax \end{align*} $$

While I like using this example, it paints a rather negative picture of squaring. If it does funny things to solvers, is there any good reason to square? Thus I’ve been on the lookout for a practical example where squaring a variable or expression makes a model more useful.

Luckily for me, Erwin Kalvelagen recently posted about using optimization to schedule team meetings. This is an application where minimizing squared values of overbooking can be beneficial – it may be worse to be triple booked than double booked.

I won’t recreate the reasoning behind Erwin’s post here. You can read his blog for that. What we’ll do is look at both the formulations in his post, along with a couple extras using Julia for code, JuMP for modeling, SCIP for optimization, and Gadfly for visualization. All model code and data are linked in the resources section at the end.

Maximize attendance

To start off, I built a new data set, which you can find in the resources section. This differentiates team membership between two types of employees: individual contributors (starting with ic in the data), who attend meetings for 1 or 2 teams, and managers (prefixed with mgr), who attend meetings to coordinate across multiple teams. We schedule meetings for 10 teams (prefix t) into 3 time slots (s).

The first model in Erwin’s post maximizes attendance. This means it tries to schedule team members for as many unique time slots as possible. It doesn’t consider overbooking.

$$ \begin{align*} \max\quad & \sum_{i,s} y_{i,s} \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & y_{i,s} \le \sum_{t} m_{i,t}\ x_{t,s} &\quad\forall&\ i,s & \text{individuals attend team meetings}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & y_{i,s} \in \{0,1\} &\quad\forall&\ i,s \end{align*} $$

This yields the following team schedule, with red representing a scheduled team meeting.

If we look at the manager schedules, we’ll see that every manager is completely booked. This makes sense. That’s what managers do, right? Go to meetings?

Minimize overbooking

The model gets more interesting once we account for overbooking. Erwin’s post has a model that minimizes overbooking, where overbooking is the number of additional meetings in a time slot. If a team member is double booked, that’s 1 overbooking. If they are triple booked, that’s 2 overbookings.

Sum of overbooking

The second model in Erwin’s post minimizes the sum of all overbookings. He does this by adding a continuous c vector that only incurs value once a team member goes over a single meeting in a given time slot.

$$ \begin{align*} \min\quad & \sum_{i,s} c_{i,s} \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & c_{i,s} \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{measure overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & c_{i,s} \ge 0 &\quad\forall&\ i,s \end{align*} $$

Given our data this results in the following team schedule, which is probably not all that interesting. I’ll leave this visualization out from now on.

Where it gets interesting is plotting overbookings for the managers. Here we see that 3 manager time slots are triple booked (red), while 8 are double booked (gray).

Sum of squared overbooking

Let’s say it’s worse to triple book (or, gasp, quadruple book) than to double book. How can the model account for this? One answer, if you have a MIQP-enabled solver, is to simply square the c values.

$$ \begin{align*} \min\quad & \sum_{i,s} c_{i,s}^2 \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & c_{i,s} \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{measure overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s\\ & c_{i,s} \ge 0 &\quad\forall&\ i,s \end{align*} $$

This completely eliminates triple booking, as shown below. No manager is worse off than being double booked, which seems normal given my experiences.

The problem with this is that the solver now takes a lot longer. It’s not bad for the data in this example, but if you try it with something larger you’ll see what I mean. You can find the data generator code in the resources section.

Constrained bottleneck

So how can we do something similar without the computational cost? One option is to continue using MILP formulations, but in the context of hierarchical optimization. This means splitting the model into two. First, we try to minimize the maximum overbookings for any team member (the bottleneck, if you will). This involves adding a variable $b$ representing that maximum.

$$ b = \max\Bigl\{\sum_{t} m_{i,t}\ x_{t,s} - 1 : i \in I, s \in S \Bigr\} $$

Now we can simply minimize $b$ using a MILP instead of a MIQP.

$$ \begin{align*} \min\quad & b \\ \text{s.t.}\quad& \sum_{s} x_{t,s} = 1 &\quad\forall&\ t & \text{schedule each team meeting once}\\ & b \ge \sum_{t} m_{i,t}\ x_{t,s} - 1 &\quad\forall&\ i,s & \text{maximum overbooking}\\ & x_{t,s} \in \{0,1\} &\quad\forall&\ t,s \end{align*} $$

Once we solve the first model, we get the minimal value of $b$, which we call $b^*$. We can simply use $b^*$ as an upper bound for overbookings in the second original model.

As we see below, this model also eliminates triple bookings, and it’s quite a bit faster to solve than the MIQP.

Resources

main.go generates input data
membership.csv contains input data
maximize-attendance.jl MILP model
minimize-overbooking.jl MILP model
minimize-overbooking-squared.jl MIQP model
minimize-bottleneck.jl hierarchical MILP models

🖍 Visualizing Decision Diagrams

Wed, 13 Sep 2023 00:00:00 +0000

I attended DPSOLVE 2023 recently and found lots of good inspiration for the next version of Nextmv’s Decision Diagram (DD) solver, Hop. It’s a few years old now, and we learned a lot applying it in the field. Hop formed the basis for our first routing models. While those models moved to a different structure in our latest routing code, the first version broke ground combining DDs with Adaptive Large Neighborhood Search (ALNS), and its use continues to grow organically.

A feature I’d love for Hop is the ability to visualize DDs and monitor the search. That could work interactively, like Gecode’s GIST, or passively during the search process. This requires automatic generation of images representing potentially large diagrams. So I spent a few hours looking at graph rendering options for DDs.

Manual rendering

We’ll start with examples of visualizations built by hand. These form a good standard for how we want DDs to look if we automate rendering. We’ll start with some examples from academic literature, look at some we’ve used in Nextmv presentations, and show an interesting example that embeds in Hugo, the popular static site generator I use for this blog.

All the literature on using Decision Diagrams (DD) for optimization that I’m aware of depicts DDs as top-down, layered, directed graphs (digraphs). Some of the diagrams we come across appear to be coded and rendered, while some are fussily created by hand with a diagramming tool.

Academia

I believe most of of the examples we find in academic literature are coded by hand and rendered using the LaTeX TikZ package. Below is one of the first diagrams that newcomers to DDs encounter. It’s from Decision Diagrams for Optimization by Bergman et al, 2016.

It doesn’t matter here what model this represents. It’s a Binary Decision Diagram (BDD), which means that each variable can be $0$ or $1$. The BDD on the left is exact, while the BDD on the right is a relaxed version of the same.

There’s quite a bit going on, so it’s worth an explanation. Let’s look at the “exact” BDD on the left first.

Horizontal layers group arcs with a binary variable (e.g. $x_1$, $x_2$).
Arcs assign either the value $0$ or $1$ to their layer’s variable. Dotted lines assign $0$ while solid lines assign $1$.
Arc labels specify their costs. The BDD searches for a longest (or shortest) path from the root node $r$ to the terminal node $t$.

The “relaxed” BDD on the right overapproximates both the objective value and the set of feasible solutions of the exact BDD on the left.

The diagram is limited to a fixed width (2, in this case) at each layer.
The achieve this, the DD merges exact nodes together.
Thus, on the left of the relaxed BDD, there is a single node in which $x_2$ can be $0$ or $1$.

Here’s another example of an exact BDD from the same book.

In this diagram, each node has a state. For example, the state of $r$ is $\{1,2,3,4,5\}$. If we start at the root node $r$ and assign $x_1 = 0$, we end up at node $u_1$ with state $\{2,3,4,5\}$.

Most other academic literature about DDs uses images similar to these.

Nextmv

We’ve rendered a number of DDs over the years at Nextmv. Most of these images demonstrate a concept instead of a particular model. We usually create them by hand in a diagramming tool like Whimsical, Lucidchart, or Excalidraw. I built the diagrams below by hand in Whimsical. I think the result is nice, if time consuming and fussy.

This is a representation of an exact DD. It doesn’t indicate whether this is a BDD or a Multivalued Decision Diagram (MDD). It doesn’t have any labels or variable names. It just shows what a DD search might look like in the abstract.

The restricted DD below is more involved. It addition to horizontal layers, it divides nodes into explored and deferred groups. Most of the examples I’ve seen mix different types of nodes, like exact and relaxed. I really like differentiating node types like this.

In this representation, deferred nodes are in Hop’s queue for later exploration. Thus they don’t connect to any child nodes yet. This is the kind of thing I’d like to generate with real diagrams during search so I can examine the state of the solver.

My favorite of my DD renderings so far is the next one. This shows a single-vehicle pickup-and-delivery problem. The arc labels are stops (e.g. 🐶, 🐱). The path the 🚗 follows to the terminal node is the route. The gray boxes group together nodes to merge based on state to reduce isomorphisms out of the diagram.

We also have some images like those in our post on expanders by hand. As you can see, coding these by hand gets tedious.

GoAT

TikZ is a program that renders manually coded graphics, while Whimsical is a WYSIWYG diagram editor. I like the Whimsical images a lot better – they feel cleaner and easier to understand.

Hugo supports GoAT diagrams by default, so I tried that out too. Here is an arbitrary MDD with two layers. The $[[1,2],4]$ node is a relaxed node; it doesn’t really matter here what the label means.

I like the way GoAT renders this diagram. It’s very readable. Unfortunately, it isn’t easy to automate. Creating a GoAT diagram is like using ASCII as a WYSIWYG diagramming tool, as you can see from the code for that image.

                                .-.
                   .-----------+ o +-----------.
                  |             '+'             |
                  |              |              |
                  v              v              v
                 .-.        .---------.        .-.
        x1      | 0 |      | [[1,2],4] |      | 3 |
                 '-'        '----+----'        '+'
                                 |              |
                    .------------+              |
                   |             |              |
                   v             v              v
                 .--.          .--.           .---.
        x2      | 10 |        | 20 |         | 100 |
                 '-+'          '-+'           '-+-'
                   |             |              |
                   |             v              |
                   |            .-.             |
                    '--------->| * |<----------'
                                '-'

Automated rendering

Now we’ll look at a couple options for automatically generating visualizations of DDs. These convert descriptions of graphs into images.

Graphviz

Graphviz is the tried and true graph visualizer. It’s used in the Go pprof library for examining CPU and memory profiles, and lots of other places.

Graphviz accepts a language called DOT. It uses different layout engines to convert input into a visual representation. The user doesn’t have control over node position. That’s the job of the layout engine.

Here’s the same MDD as written in DOT. The start -> end lines specify arcs in the digraph. The subgraphs organize nodes into layers. We add a dotted border around each layer and a label to say which variable it assigns. There isn’t any way of vertically centering and horizontally aligning the layer labels, so I thought it make more sense this way.

digraph G {
    s1 [label = 0]
    s2 [label = "[[1,2],4]"]
    s3 [label = 3]
    s4 [label = 10]
    s5 [label = 20]
    s6 [label = 100]

    r -> s1 [label = 2]
    r -> s2 [label = 4]
    r -> s3 [label = 1]
    s2 -> s4 [label = 10]
    s2 -> s5 [label = 4]
    s3 -> s6 [label = 2]

    subgraph cluster_0 {
        label = "x1"
        labeljust = "l"
        style = "dotted"
        s1
        s2
        s3
    }

    subgraph cluster_1 {
        label = "x2"
        labeljust = "l"
        style = "dotted"
        s4
        s5
        s6
    }

    s4 -> t
    s5 -> t
    s6 -> t
}

The result is comprehensible if not very attractive. With some fiddling, it’s possible to improve things like the spacing around arc labels. I couldn’t figure out how to align the layer labels and boxes. It doesn’t seem possible to move the relaxed nodes into their own column either, but that limitation isn’t unique to Graphviz.

Mermaid

Mermaid is a JavaScript library for diagramming and charting. One can use it on the web or, presumably, embed it in an application.

Mermaid is similar to Graphviz in many ways, but it supports more diagram types. The input for that MDD in Mermaid is a bit simpler. Labels go inside arcs (e.g. -- 2 -->), and there are more sensible rendering defaults.

graph TD
    start((( )))
    stop((( )))
    A(0)
    B("[[1,2],4]")
    C(3)
    D(10)
    E(20)
    F(100)

    start -- 2 --> A
    start -- 4 --> B
    start -- 1 --> C
    B -- 10 --> D
    B -- 4 --> E
    C -- 2 --> F
    D --> stop
    E --> stop
    F --> stop

    subgraph "x1 "
        A; B; C
    end
    subgraph "x2"
        D; E; F
    end

The result has a lot of the same limitations as the Graphviz version, but it looks more like the GoAT version. The biggest problem, as we see below, is that it’s not possible to left-align the layer labels. They can be obscured by arcs.

graph TD start((( ))) stop((( ))) A(0) B("[[1,2],4]") C(3) D(10) E(20) F(100) start -- 2 --> A start -- 4 --> B start -- 1 --> C B -- 10 --> D B -- 4 --> E C -- 2 --> F D --> stop E --> stop F --> stop subgraph "x1 " A; B; C end subgraph "x2" D; E; F end

This got me thinking that there isn’t a strong reason DDs have to progress downward layer by layer. They could just as easily go from left to right. If we change the opening line from graph TD to graph LR, then we get the following image.

graph LR start((( ))) stop((( ))) A(0) B("[[1,2],4]") C(3) D(10) E(20) F(100) start -- 2 --> A start -- 4 --> B start -- 1 --> C B -- 10 --> D B -- 4 --> E C -- 2 --> F D --> stop E --> stop F --> stop subgraph "x1 " A; B; C end subgraph "x2" D; E; F end

I think that’s pretty nice for a generated image.

👾 Detecting Polygon Intersections

Sun, 27 Sep 2015 00:00:00 +0000

Note: This post has been updated to work with HiGHS.

A fun geometry problem to think about is: given two polygons, do they intersect? That is, do they touch on the border or overlap? Does one reside entirely within the other? While this question has obvious applications in computer graphics (see: arcade games of the 1980s), it’s also important in areas such as cutting and packing problems.

There are a number of way to answer this. In computer graphics, the problem is often approached using a clipping algorithm. This post examines a couple of simpler techniques using linear inequalities and properties of convexity. To simplify the presentation, we assume we’re only interested in convex polygons in two dimensions. We also assume that rotation is not an issue. That is, if one of the polygons is rotated, we can simply re-test to see if they overlap.

Problem

Let’s say we have two objects: a right triangle and a square. We can place them anywhere inside a larger rectangle. The triangle has vertices:

$$\{\left(x_t, y_t\right), \left(x_t, y_t + a\right), \left(x_t + a, y_t\right)\}$$

The square has vertices:

$$\{\left(x_s, y_s\right), \left(x_s, y_s + a\right), \left(x_s + a, y_s + a\right), \left(x_s + a, y_s\right)\}$$

We will be given $\left(x_t, y_t\right)$, $\left(x_s, y_s\right)$, and $a$, but we do not know them a priori. We would like to know, for any set of values these can take, whether or not the triangle and square they define intersect.

$\left(x_t, y_t\right)$ and $\left(x_s, y_s\right)$ are the offsets of the triangle and square with respect to the bottom left corner of the rectangle. If they are far enough apart in any direction, the two objects do not intersect. The figure below shows such a case, with small gray circles representing $\left(x_t, y_t\right)$ and $\left(x_s, y_s\right)$.

However, if they are too close in some manner, the objects will either touch or overlap, as shown below.

he two polygons can intersect in a few different ways. They may touch on their borders, in which case they will share a single point or line segment. They may overlap such that their intersecting region has nonzero relative interior but each polygon contains points outside the other. Or one of them might live entirely within the other, so that the former is a subset of the latter. Our goal is to determine if any of these cases are true given any $\left(x_t, y_t\right)$, $\left(x_s, y_s\right)$, and $a$.

Method 1. Define the intersecting polygon with linear inequalities

The first method we use to detect intersection is based on the fact that our polygons themselves are the intersections of finite numbers of linear inequalities. Instead of defining them based on their vertices, we can equivalently represent them as the set of $\left(x, y\right)$ that satisfy a known inequality for each edge.

Let $S_t$ be the set of points in our triangle. It can be defined as follows. $x$ must be greater than or equal to $x_t$. $y$ must be greater than or equal to $y_t$. And $x + y$ must be left of or lower than the triangle’s hypotenuse. There are three sides on the triangle, so we have three inequalities.

$$ \begin{array}{rcl} S_t = \{\,\left(x, y\right) & | & x \ge x_t,\\ & & y \ge y_t,\\ & & x + y \le x_t + y_t + a \,\} \end{array} $$

Similarly, let $S_s$ be the set of points in our square. This set is defined using four inequalities, which are shown in a slightly compacted form.

$$ \begin{array}{rcl} S_s = \{\,\left(x, y\right) & | & x_s \le x \le x_s + a,\\ & & y_s \le y \le y_s + a \,\} \end{array} $$

Finally, let $S_i = S_t \cap S_s$ be the set of points that satisfy all seven inequalities.

$$ \begin{array}{rcl} S_i = \{\,\left(x, y\right) & | & x \ge x_t,\\ & & y \ge y_t,\\ & & x + y \le x_t + y_t + a,\\ & & x_s \le x \le x_s + a,\\ & & y_s \le y \le y_s + a \,\} \end{array} $$

If $S_i \ne \emptyset$, then there must exist some point that satisfies the inequalities of both the triangle and the square. This point resides in both of them, therefore they intersect. If $S_i = \emptyset$, then there is no such point and they do not intersect.

Method 2. Use convex combinations of the polygon vertices

Both of our polygons are convex. That is, they contain every convex combination of their vertices. So every point in the triangle, regardless of where it is located, can be represented as a linear combination of ${\left(x_t, y_t\right), \left(x_t + a, y_t\right), \left(x_t, y_t + a\right)}$ where $\lambda_1, \lambda_2, \lambda_3 \ge 0$ and $\lambda_1 + \lambda_2 + \lambda_3 = 1$.

We can define the set $S_t$ equivalently using this concept.

$$ S_t = \{\, \lambda_1 \left(\begin{array}{c} x_t \\ y_t \end{array}\right) + \lambda_2 \left(\begin{array}{c} x_t + a \\ y_t \end{array}\right) + \lambda_3 \left(\begin{array}{c} x_t \\ y_t + a \end{array}\right) \, | \\ \lambda_1 + \lambda_2 + \lambda_3 = 1, \\ \lambda_i \ge 0, , i = {1, \ldots, 3 } \, \} $$

Similarly, the square is defined a the convex combination of its vertices.

$$ S_s = \{\, \lambda_4 \left(\begin{array}{c} x_s \\ y_s \end{array}\right) + \lambda_5 \left(\begin{array}{c} x_s + a \\ y_s \end{array}\right) + \lambda_6 \left(\begin{array}{c} x_s \\ y_s + a \end{array}\right) + \lambda_7 \left(\begin{array}{c} x_s + a \\ y_s + a \end{array}\right) \, | \\ \lambda_4 + \lambda_5 + \lambda_6 + \lambda_7 = 1, \\ \lambda_i \ge 0, , i = {4, \ldots, 7 } \, \} $$

If there exists a point inside both the triangle and the square, then it must satisfy both convex combinations. Thus we can define our intersecting set $S_i$ as follows. (This is a little loose with the notation, but I think it makes the point a bit better.)

$$ \begin{array}{rl} S_i = \{\, & \\ & \lambda_1 \left(\begin{array}{c} x_t \\ y_t \end{array}\right) + \lambda_2 \left(\begin{array}{c} x_t + a \\ y_t \end{array}\right) + \lambda_3 \left(\begin{array}{c} x_t \\ y_t + a \end{array}\right) =\\ & \lambda_4 \left(\begin{array}{c} x_s \\ y_s \end{array}\right) + \lambda_5 \left(\begin{array}{c} x_s + a \\ y_s \end{array}\right) + \lambda_6 \left(\begin{array}{c} x_s \\ y_s + a \end{array}\right) + \lambda_7 \left(\begin{array}{c} x_s + a \\ y_s + a \end{array}\right),\\ & \lambda_1 + \lambda_2 + \lambda_3 = 1,\\ & \lambda_4 + \lambda_5 + \lambda_6 + \lambda_7 = 1,\\ & \lambda_i \ge 0, \, i = {1, \ldots, 7}\\ \,\} & \end{array} $$

Just as before, if $S_i \ne \emptyset$, our polygons intersect.

Code

Both models are pretty easy to implement using an LP Solver. But they look very different. That’s because in the first method we’re thinking about the problem in terms of inequalities and in the second we’re thinking about it in terms of vertices. The code below generates a thousand random instances of the problem and tests that each method produces the same result.

import highspy
import random


def method1(xy_t, xy_s, a):
    x_t, y_t = xy_t
    x_s, y_s = xy_s

    h = highspy.Highs()
    h.silent()

    x = h.addVariable()
    y = h.addVariable()
    h.addConstrs(
        x_t <= x <= x_t + a,
        x_s <= x <= x_s + a,
        y_t <= y <= y_t + a,
        y_s <= y <= y_s + a,
        x + y <= x_t + y_t + a,
    )

    return h


def method2(xy_t, xy_s, a):
    x_t, y_t = xy_t
    x_s, y_s = xy_s

    h = highspy.Highs()
    h.silent()

    lm = [h.addVariable(lb=0, ub=1) for _ in range(7)]

    conv_xt = lm[0] * x_t + lm[1] * (x_t + a) + lm[2] * x_t
    conv_xs = lm[3] * x_s + lm[4] * (x_s + a) + lm[5] * x_s + lm[6] * (x_s + a)

    conv_yt = lm[0] * y_t + lm[1] * y_t + lm[2] * (y_t + a)
    conv_ys = lm[3] * y_s + lm[4] * y_s + lm[5] * (y_s + a) + lm[6] * (y_s + a)

    h.addConstrs(
        conv_xt == conv_xs,
        conv_yt == conv_ys,
        sum(lm[:3]) == 1,
        sum(lm[3:]) == 1,
    )

    return h


if __name__ == "__main__":
    problems1 = []
    problems2 = []

    for _ in range(1000):
        a = random.random() * 2.5 + 1
        x_t = random.random() * 10
        y_t = random.random() * 10
        x_s = random.random() * 10
        y_s = random.random() * 10

        problems1.append(method1([x_t, y_t], [x_s, y_s], a))
        problems2.append(method2([x_t, y_t], [x_s, y_s], a))

    overlap1 = []
    for h in problems1:
        h.solve()
        overlap1.append(h.getModelStatus())

    overlap2 = []
    for h in problems2:
        h.solve()
        overlap2.append(h.getModelStatus())

    assert overlap1 == overlap2

These aren’t necessarily the best ways to solve this particular problem, but they are quick and flexible. And they leverage existing solver technology. One downside is that they aren’t easy to adapt to certain decision making contexts. That is, we can use them to determine whether objects overlap, but not to force objects not to overlap. In the next post, we’ll go over another tool from computational geometry that allows us to embed decisions about the relative locations of objects in our models.

Exercises

We assumed convex polygons in this presentation. How might one extend the model to work on non-convex polygons? What problems does this introduce?
The two methods shown above are equivalent. How can this be proven?
This post only answers the question of whether two convex polygons intersect. Devise models for determining if they only touch, or if one is a subset of the other.

😁 Are We Getting Happier?

Fri, 18 Jul 2014 00:00:00 +0000

Note: This post was originally written using Julia v0.2, GLPK, and Hedonometer data through 2014. It has been updated to use Julia v1.11, HiGHS, and data through May 26, 2025.

Hedonometer popped onto my radar a couple weeks ago. It’s a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.

While I’m not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.

This post examines another approach. It follows the structure of Robert Vanderbei’s excellent “Local Warming” project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We’ll be using Julia with JuMP and HiGHS for linear optimization, Gadfly for graphing, and a few other libraries. If you haven’t installed Julia, first do that. Missing packages should be installed when you import them.

Data

Hedonometer provides an API which we can use to pull daily happiness data in JSON format. We can request specific date rates, or leave the dates off to retrieve the full data set.

We use the HTTP, JSON3, and DataFrame packages to read the Hedonometer data into a data frame. Calls to parse convert strings to date and float types. Finally, we sort the data frame in place by ascending date.

using DataFrames, HTTP, JSON3

url = "https://hedonometer.org/api/v1/happiness/?format=json×eries__title=en_all"
response = HTTP.get(url)
df = DataFrame(JSON3.read(response.body)[:objects])

df.date = parse.(Date, df.date)
df.happiness = parse.(Float64, df.happiness)
sort!(df, :date)

5367×4 DataFrame
  Row │ date        frequency  happiness  timeseries            
      │ Date        Int64      Float64    String                
──────┼─────────────────────────────────────────────────────────
    1 │ 2008-09-09    2009276      6.042  /api/v1/timeseries/3/
    2 │ 2008-09-10    5263723      6.028  /api/v1/timeseries/3/
    3 │ 2008-09-11    5298101      6.02   /api/v1/timeseries/3/
    4 │ 2008-09-12    5351503      6.028  /api/v1/timeseries/3/
    5 │ 2008-09-13    5153710      6.035  /api/v1/timeseries/3/
    6 │ 2008-09-14    5170835      6.04   /api/v1/timeseries/3/
    7 │ 2008-09-15    5553350      6.004  /api/v1/timeseries/3/
    8 │ 2008-09-16    5421531      6.011  /api/v1/timeseries/3/
    9 │ 2008-09-17    5380008      6.02   /api/v1/timeseries/3/
   10 │ 2008-09-18    5591645      6.034  /api/v1/timeseries/3/
   11 │ 2008-09-19    5695345      6.063  /api/v1/timeseries/3/
   12 │ 2008-09-20    5291298      6.081  /api/v1/timeseries/3/
   13 │ 2008-09-21    5363113      6.066  /api/v1/timeseries/3/
  ⋮   │     ⋮           ⋮          ⋮                ⋮
 5356 │ 2023-05-15  170487394      6.026  /api/v1/timeseries/3/
 5357 │ 2023-05-16  174192397      6.021  /api/v1/timeseries/3/
 5358 │ 2023-05-17  186034773      6.016  /api/v1/timeseries/3/
 5359 │ 2023-05-18  189092448      6.03   /api/v1/timeseries/3/
 5360 │ 2023-05-19  179957496      6.026  /api/v1/timeseries/3/
 5361 │ 2023-05-20  167540306      6.044  /api/v1/timeseries/3/
 5362 │ 2023-05-21  167091303      6.031  /api/v1/timeseries/3/
 5363 │ 2023-05-22  171660415      6.03   /api/v1/timeseries/3/
 5364 │ 2023-05-23  166443756      6.033  /api/v1/timeseries/3/
 5365 │ 2023-05-24  183687637      6.025  /api/v1/timeseries/3/
 5366 │ 2023-05-25  170265817      6.014  /api/v1/timeseries/3/
 5367 │ 2023-05-26  180664806      6.032  /api/v1/timeseries/3/
                                               5342 rows omitted

Note that the data does seem to be missing a few days. This means we need to compute day offsets in our model instead of using row indices.

last(df.date) - first(df.date)
nrow(df)

5372 days
5367

Now lets take a look at happiness over time, as computed by Hedonometer.

function plot_happiness(df::DataFrame)
    plot(
        df,
        x=:date,
        y=:happiness,
        color=[colorant"darkblue"],
        Guide.xlabel("Date"),
        Guide.ylabel("Happiness"),
        Coord.cartesian(
            xmin=minimum(df.date),
            xmax=maximum(df.date)
        ),
        Theme(
            key_position=:none,
            line_width=0.75px,
            background_color=colorant"white"
        ),
        Geom.line
    )
end

plot_happiness(df)

The data looks right, so we’re off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We’ll start with the same ones as in the Vanderbei model:

A linear happiness trend describing how our overall happiness changes over time.
Seasonal trends accounting for mood changes with weather.
Solar cycle trends.

We’ll add to this weekly trends, as zooming into the data shows we tend to be happier on the weekends than on work days. In the next section we’ll build a model to separate out the effects of these trends on the Hedonometer index.

Model

Vanderbei’s model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\sum_i \left\lVert y_i - \hat{y}_i \right\rVert_2^2$ is minimized.

The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more robust, that is, less sensitive to outliers in our input data.

Using a robust model with this data set makes sense because it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we’re going to ignore that detail for now. After all, not every day is Christmas.

We formulate our model below using JuMP with the HiGHS solver. The code works by defining a set of variables called coefficients that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:

Linear daily trend ($a_1$ = day number in the data set)
Seasonal variation: $\cos(2, \pi, a_1 / 365.25)$ and $\sin(2, \pi, a_1 / 365.25)$
Solar cycle variation: $\cos(2, \pi, a_1 / (10.66 \times 365.25))$ and $\sin(2, \pi, a_1 / (10.66 \times 365.25))$
Weekly variation: $\cos(2, \pi, a_1 / 7)$ and $\sin(2, \pi, a_1 / 7)$

We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.

Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 9, 2008.)

function train(df::DataFrame)
    m = Model(HiGHS.Optimizer)

    # Define a linear variable for each of our regression coefficients.
    # Note that by default, JuMP variables are unrestricted in sign.
    @variable(m, x[1:8])

    # Residuals are the absolute values of the error comparing our
    # observed and fitted values.
    #
    # If alpha - beta = residual and alpha, beta >= 0, then we can min
    # alpha + beta to get the absolute value of the residual.
    @variable(m, alpha[1:nrow(df)] >= 0)
    @variable(m, beta[1:nrow(df)] >= 0)

    # This builds rows for determining fitted values. The first value is
    # 1 since it is multiplied by our our trend line's offset. The other
    # values correspond to the trends described above. Sinusoidal elements
    # have two variables with sine and cosine terms.
    function constants(i)
        [
            1,                                # Offset
            i,                                # Daily trend
            cos(2pi * i / 365.25),            # Seasonal variation
            sin(2pi * i / 365.25),            #
            cos(2pi * i / (10.66 * 365.25)),  # Solar cycle variation
            sin(2pi * i / (10.66 * 365.25)),  #
            cos(2pi * i / 7),                 # Weekly variation
            sin(2pi * i / 7)                  #
        ]
    end

    # This builds a linear expression as the dot product of a row's
    # constants and the coefficient variables.
    expression(i) = dot(constants(i), x)

    start = minimum(df.date)
    for (i, row) in enumerate(eachrow(df))
        days = (row.date - start).value
        @constraint(m, alpha[i] - beta[i] == expression(days) - row.happiness)
    end

    # Minimize the total sum of these residuals.
    @objective(m, Min, sum(alpha + beta))
    optimize!(m)

    # Return the model coefficients and a function that predicts happiness
    # for a given day, by index from the start of the data set.
    coefficients = value.(x)

    # And we would like our model to work over vectors.
    predict(i) = dot(constants(i), coefficients)
    return coefficients, predict
end

coefficients, predictor = train(df)
coefficients

This gives us the optimal value of $x$. The second value is the change in happiness per day. We can see from this that there does seem to be a small negative trend.

8-element Vector{Float64}:
  6.056241434337748
 -1.2891297798930273e-5
 -0.004956377505740697
  0.00933036370632761
 -0.014231170085464805
 -0.01043882249306958
 -0.01121031443373725
 -0.003886782963711294

We can call our predictor function to obtain the fitted happiness level for any day number starting from September 9, 2008.

predictor(1000)

6.0206559094198635

Similarly, we can compute a range of fitted happiness values.

predictor.(1000:1009)

10-element Vector{Float64}:
 6.0206559094198635
 6.0133111627737295
 6.01441070655176
 6.0230645068990105
 6.032696070016563
 6.0359946962536455
 6.030420605574473
 6.020117462633424
 6.012792131062921
 6.013911265631212

Bootstrapping

We now have a set of coefficients and a predictive model. That’s nice, but we’d like to have some sense of a reasonable range on our model’s coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertainties, we use a method called bootstrapping.

Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.

First step: compute the errors between the observed and fitted data. We’ll construct a new data frame that contains everything we need to construct fake data.

# Compute fitted data corresponding to our observations and their associated errors.
start = minimum(df.date)
fitted = DataFrame(
    date=df.date,
    happiness=predictor.(map(d -> d.value, df.date .- start))
)
fitted.error = fitted.happiness - df.happiness

4×7 DataFrame
 Row │ variable  mean        min         median      max         nmissing  eltype   
     │ Symbol    Union…      Any         Any         Any         Int64     DataType 
─────┼──────────────────────────────────────────────────────────────────────────────
   1 │ date                  2008-09-09  2016-01-19  2023-05-26         0  Date
   2 │ observed  6.01506     5.628       6.016       6.376              0  Float64
   3 │ fitted    6.01858     5.95996     6.0245      6.06771            0  Float64
   4 │ error     0.00351745  -0.327932   0.0         0.353297           0  Float64

Note that the median for our errors is exactly zero. This is a good sign.

Now we build a function that creates a fake input data set using the fitted values with randomly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we’ve done that for every observation, we have a complete fake data set.

function fake_data(fitted::DataFrame)
    indices = rand(1:nrow(fitted), nrow(fitted))
    DataFrame(
        date=fitted.date,
        happiness=fitted.happiness + fitted.error[indices]
    )
end

Le’ts plot some fake data to see if it looks similar.

plot_happiness(fake_data(fitted))

Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.

Now we generate 199 fake data sets and run them through our optimization function above. This generates 100 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.

The following code took a few minutes on my machine. If you’re intent on running it yourself, you may want to get some coffee in the meantime.

using HypothesisTests

coefficient_data = [train(fake_data(fitted))[1] for _ in 1:10]
confidence_intervals = map(
    i -> confint(OneSampleTTest([c[i] for c in coefficient_data])),
    1:length(coefficients)
)
confidence_intervals

8-element Vector{Tuple{Float64, Float64}}:
 (6.055873073964057, 6.056558575993962)
 (-1.304766046882304e-5, -1.2820860816666347e-5)
 (-0.005375474015660127, -0.004919187929729383)
 (0.009063427594182353, 0.009549696262121937)
 (-0.01457854811479927, -0.014061680921618894)
 (-0.010625952275843275, -0.010121904880077722)
 (-0.011441598179220986, -0.010964483538449823)
 (-0.004290285203751687, -0.003833687701331523)

Results

From the above output we can see that appear to be trending slightly less happy over time, with a daily trend of -0.00001293 in Hedonometer units and a 95% confidence interval on that trend of approximately -0.00001305, and -0.00001282. Bummer.

Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.

plot_happiness(fitted)

Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 9, 2008 was a Wednesday, we index Sunday starting at 6.

daily(i) = coefficients[6]*cos(2pi*i/7) + coefficients[7]*sin(2pi*i/7)
plot(
    x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"],
    y = map(daily, [6, 7, 1, 2, 3, 4, 5]),
    Guide.xlabel("Day of the Week"),
    Guide.ylabel("Happiness"),
    Geom.line,
    Geom.point
)

The resulting graph highlights what I think we all already know about the work week.

That’s it for this analysis. We’ve learned that, for the being at least, we seem to be trending less happy. When we initially did this analysis, almost 11 years ago, the opposite was true. The fitted data shows pretty clearly when that trend took a stark turn down.

Exercises

The particularly ambitious reader may find the following exercises interesting.

The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls train so it runs concurrently.
According to Google, the lunar cycle is approximately 29.53 days. Add parameters for this to the LAD model above. Does it make sense to include the lunar cycle in the model? In other words, are we lunatics?
Some of the happier days in the Hedonometer data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
Try the same analysis using a least-squares model. Which model is better for this data?

Resources

are-we-getting-happier.jl contains all the code in this post
hedonometer.json contains the Hedonometer data as of May 16, 2025

🗺️ Preprocessing for Routing Problems - Part 2

Fri, 27 Jun 2014 00:00:00 +0000

In the previous post, we considered preprocessing for the vehicle routing problem where the vehicles have different starting locations. Our goal was to create potentially overlapping regions for the entire US which we could later use for route construction. We defined these regions using all 5-digit zip codes in the continental US for which one of our regional headquarters is the closest, or one of $n$ closest, headquarters in terms of Euclidean distance. The resulting regions gave us some flexibility in terms of how much redundancy we allow in our coverage of the country.

This post refines those regions, replacing the Euclidean distance with a more realistic metric: estimated travel time. Doing this should give us a better sense of how much space a given driver can actually cover. It should also divide the country up more equitably among our drivers.

Our approach here will be similar to that of the last post, but instead of ranking our headquarter-zip pairs by Euclidean distance, we’ll rank them by estimated travel time. The catch is that, while the former is easy to compute using the SpatialTools library, we have to request the latter from a third party. In this post, we’ll use the MapQuest Route Matrix, since it provides estimates based on OpenStreetMap data to us for free, and doesn’t cap the number of requests we can make.

To do this we’re going to need a lot of point estimates for location-to-location travel times. In fact, building a full data set for replacing our Euclidean distance ranks would require $37,341 \times 25 = 933,525$ travel time estimates. That’s a bit prohibitive. The good news is we don’t need to all the data points unless we generate 25 levels of redundancy. We can just request enough travel time estimates to make us reasonably certain that we’ve got all the necessary data. In the last post we generated regions for 1, 2, and 3 levels of redundancy, so here we’ll get travel times for the 10 closest headquarters to each zip code, and take the leap of faith that the closest 3 headquarters by travel time for each zip will be among the 10 closest by Euclidean distance.

Let’s assume that you have just executed the code from the last post and have its variables in your current scope.¹ First, we define some constants we’re going to need in order to make MapQuest requests.

# Define some constants for making requests to MapQuest and determining
# when to save and what to request.
library(RCurl)
library(rjson)
library(utils)

MAPQUEST_API_KEY <- 'YOUR KEY HERE'
MAPQUEST_API_URL <- 'http://www.mapquestapi.com/directions/v2/routematrix?key=%s&json=%s'
ZIPS_BETWEEN_SAVE <- 250
HQ_RANK_MIN <- 1  # Min/max distance ranks for time estimates
HQ_RANK_MAX <- 10

Now we create a data frame to hold our HQ-to-zip travel estimates. The rows correspond to zip codes and the columns correspond to our headquarter locations. We initialize the data frame to contain no estimates and write it to a CSV file. Since it will take on the order of days for us to fill this file in, we’re going to write it out and read it back in periodically. That way we can pick up where we left off by simply rerunning the code in case of an error or loss of network connectivity.

# Write out a blank file containing our time estimates.
TIME_CSV_PATH <- 'hqs_to_zips_time.csv'
if (!file.exists(TIME_CSV_PATH)) {
    # Clear out everything except row and column names.
    empty <- as.data.frame(matrix(nrow=nrow(zips_deduped), ncol=nrow(largest_cities)+1))
    names(empty) <- c('zip', largest_cities$name)
    empty$zip <- zips_deduped$zip

    # This represents our current state.
    write.csv(empty, TIME_CSV_PATH, row.names=F)
}

# Read in our current state in case we are starting over.
hqs_to_zips_time <- read.csv(TIME_CSV_PATH)
hqs_to_zips_time$zip <- sprintf('%05d', hqs_to_zips_time$zip)

# Sanity check: If we have any times = 0, set them to NA so that we re-request them.
hqs_to_zips_time[hqs_to_zips_time <= 0] <- NA

With that file created, we can start making requests to MapQuest’s Route Matrix. For each zip code, we are going to request travel times for its 10 closest HQs. We’ll save our time estimates data frame every 250 zip codes. Also, we’re going to randomize the order of the zip codes so we fill out our data set more evenly as we go. That way we can generate maps during the process or otherwise inspect the data as we go.

# Now we start requesting travel times from MapQuest.
requests_until_save <- ZIPS_BETWEEN_SAVE
col_count <- ncol(hqs_to_zips_time)

# Randomize the zip code order so we fill in the map uniformly as we get more data.
# This will enable us to check on our data over time and make sure it looks right.
for (zip_idx in sample(1:nrow(zips_deduped))) {
    z <- zips_deduped$zip[zip_idx]
    z_lat <- zips_deduped$latitude[zip_idx]
    z_lon <- zips_deduped$longitude[zip_idx]

    # Find PODs for this zip that are in the rank range.
    which_hqs <- which(
        hqs_to_zips_rank[,zip_idx] >= HQ_RANK_MIN &
        hqs_to_zips_rank[,zip_idx] <= HQ_RANK_MAX
    )

    # We're only interested in records that aren't filled in yet.
    na_pods <- is.na(hqs_to_zips_time[zip_idx, which_hqs+1])
    if (length(hqs_to_zips_time[zip_idx,2:col_count][na_pods]) < 1) {
        next
    }

    # Request this block of PODs and fill them all in.
    print(sprintf('requesting: zip=%s rank=[%d-%d]', z, HQ_RANK_MIN, HQ_RANK_MAX))

    # Construct a comma-delimited string of lat/lons containing the locations of our
    # HQs We will use this for our MapQuest requests below: for each zip code, we
    # make one request for its distance to every HQ in our range.
    hq_locations <- paste(
        sprintf("'%f,%f'", largest_cities$lat[which_hqs], largest_cities$long[which_hqs]),
        collapse = ', '
    )

    # TODO: make sure we are requesting from location 1 to 2:n only
    request_json <- URLencode(sprintf(
        "{allToAll: false, locations: ['%f,%f', %s]}",
        z_lat,
        z_lon,
        hq_locations
    ))
    url <- sprintf(MAPQUEST_API_URL, MAPQUEST_API_KEY, request_json)
    result <- fromJSON(getURL(url))

    # If we get back 0s, they should be NA. Otherwise they'll mess up our
    # rankings and region drawing later.
    result$time[result$time <= 0] <- NA

    hqs_to_zips_time[zip_idx, which_hqs+1] <- result$time[2:length(result$distance)]

    # See if we should save our current state.
    requests_until_save <- requests_until_save - 1
    if (requests_until_save < 1) {
        print('saving current state')
        write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)
        requests_until_save <- ZIPS_BETWEEN_SAVE
    }
}

# Final save once we are done.
write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)

Now we generate our ranks based on travel time instead of distance. We have to be a bit more careful this time, as we might have incomplete data. We don’t want pairs with travel time of NA showing up in the rankings.

# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank2 <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    not_na <- !is.na(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)])
    hqs_to_zips_rank2[not_na,i] <-
        rank(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)][not_na], ties.method='first')
}

We build our map for the Dallas, TX headquarter the same way as before.

# Now we draw regions for which Dallas is one of the closest 3 HQs by time.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)

map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in time rank.
    within_n <- hqs_to_zips_rank2[hq_idx,] <= redundancy_levels[i]

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n],
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))
}

# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs],
    pch = 21,
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)

# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx],
    pch = 21,
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)

This shows the regions for which Dallas is among the closest headquarters for 1, 2, and 3 level of redundancy. Compare this map to the one from the previous post, and you’ll see that it conforms better to the highway system. For instance, it takes into account I-20 which moves east and west across Texas, instead of pushing up into the Dakotas.

And now our map of the US, showing the regions for each HQ as the set of zip codes for which it is the closest.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank2[hq_idx,] == 1
    within_1[is.na(within_1)] <- F

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1],
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}

# All HQs
points(
    largest_cities$long,
    largest_cities$lat,
    pch = 21,
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)

This gives us our new map. If we compare this with the original, it should better reflect the topology of the highway system. It also looks a bit less jagged.

Exercises for the reader:

Some of these regions overlap, even though they are supposed to be only composed of zip codes for which a given HQ is the closest. Why is that?
Say we want to limit our driver to given maximum travel times. Based on our data from MapQuest, draw concentric circles representing approximate 3, 5, and 7 hour travel time regions.

If you need it, you can find that code here. ↩︎

🗺️ Preprocessing for Routing Problems - Part 1

Wed, 28 May 2014 00:00:00 +0000

Consider an instance of the vehicle routing problem in which we have drivers that are geographically distributed, each in a unique location. Our goal is to deliver goods or services to a set of destinations at the lowest cost. It does not matter to our customers which driver goes to which destination, so long as the deliveries are made.

One can think of this problem as a collection of travelling salesman problems, where there are multiple salespeople in different locations and a shared set of destinations. We attempt to find the minimum cost schedule for all salespeople that visits all destinations, where each salesman can technically go anywhere.

We believe that sending a driver farther will result in increased cost. But, given a particularly good tour, we might do that anyway. On the other hand, there are plenty of assignments we would never consider. It would be madness to send a driver from Los Angeles to New York if we already have another person stationed near there. Thus there are a large number of scenarios that may be possible, but that we will never pursue.

Our ultimate goal is to construct a model that finds an optimal (or near-optimal) schedule. Before we get to that, we have a bit of preprocessing to do. We would like to create regions for our drivers that make some bit of sense, balancing constraints on travel time with redundant coverage of our customers. Once we have these regions, we will know where we can allow our drivers to go in the final schedule.

Let’s get started in R. We’ll assume that we have drivers stationed at our regional headquarters in the 25 largest US cities by population. We assume that every possible customer address will be in some five digit zip code in the continental US. We pull this information out of the standard R data sets and pare down to only unique locations, fixing a couple errors in the data along the way.

library(datasets)
library(zipcode)
data(zipcode)

# Alexandria, VA is not in Normandy, France.
zipcode[zipcode$zip=='22350', c('latitude', 'longitude')] <- c(38.863930, -77.055547)

# New York City, NY is not in Kyrgyzstan.
zipcode$longitude[zipcode$zip=='10200'] <- -zipcode$longitude[zipcode$zip=='10200']

# Pare down to zip codes in the continental US.
states_continental <- state.abb[!(state.abb %in% c('AK', 'HI'))]
zips_continental <- subset(zipcode, state %in% states_continental)
zips_deduped <- zips_continental[!duplicated(zips_continental[, c('latitude', 'longitude')]), ]

# Geographic information for top 25 cities in the country.
library(maps)
data(us.cities)
largest_cities <- subset(
    us.cities[order(us.cities$pop, decreasing=T),][1:25,],
    select = c('name', 'lat', 'long')
)

With this information we can get some sense of what we’re up against. We generate a map off all the zip code locations in blue and our driver locations in red.

# Plot our corporate headquarters and every unique zip code location.
map('state')
points(zips_deduped$longitude, zips_deduped$latitude, pch=21, col=rgb(0, 0, 1, .5), cex=0.1)
points(largest_cities$long, largest_cities$lat, pch=21, bg=rgb(1, 0, 0, .75), col='black', cex=1.5)

So how do we go about assigning zip codes to our drivers? One option is to draw circles of a given radius around our drivers and increase that radius until we have the coverage we need.

On second thought, that doesn’t work so well. By the time we have large enough radius, there will be so much overlap the assignments won’t make much sense. It would be better if we started by assigning each zip code to the driver that is physically closest. We could then start introducing redundancy into our data by adding the second closest driver, and so on.

# Euclidean distance from each HQ to each zip code.
library(SpatialTools)
zips_to_hqs_dist <- dist2(
    matrix(c(zips_deduped$longitude, zips_deduped$latitude), ncol=2),
    matrix(c(largest_cities$long, largest_cities$lat), ncol=2)
)

# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    hqs_to_zips_rank[,i] <- rank(zips_to_hqs_dist[i,], ties.method='first')
}

Let’s see what this looks like on the map. The following shows what the region for the Dallas, TX driver would be if she were only allowed to visit zip codes for which she is the closest, second closest, and third closest. We map these as polygons using the convex hull of their respective zip code locations.

# Now we draw regions for which Dallas is one of the closest 3 HQs.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)

map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in distance rank.
    within_n <- hqs_to_zips_rank[hq_idx,] <= redundancy_levels[i]

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n],
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))
}

# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs],
    pch = 21,
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)

# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx],
    pch = 21,
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)

This makes a bit more sense. If we enforce a redundancy level of 1, then every zip code has exactly one person assigned to it. As we increase that redundancy level, we have more options in terms of driver assignment. And our optimization model will grow correspondingly in size.

The following produces a map of all our regions where each zip code is served only by its closest driver.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank[hq_idx,] == 1
    within_1[is.na(within_1)] <- F

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1],
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}

# All HQs
points(
    largest_cities$long,
    largest_cities$lat,
    pch = 21,
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)

This is a good start. Our preprocessing step gives us a reasonable level of control over the assignments of drivers before we begin optimizing. So what’s missing?

One immediately apparent failure is that these regions are based on Euclidean distance. Travel time is not a simple function of that. It would be much better if we could create regions using estimated time, drawing them based on topology of the highway system. We’ll explore techniques for doing so in the next post.

⭕ Chebyshev Centers of Polygons with Gurobi

Mon, 03 Feb 2014 00:00:00 +0000

Note: This post was written before Gurobi supported nonlinear optimization. It has been updated to work with Python 3.

A common problem in handling geometric data is determining the center of a given polygon. This is not quite so easy as it sounds as there is not a single definition of center that makes sense in all cases. For instance, sometimes computing the center of a polygon’s bounding box may be sufficient. In some instances this may give a point on an edge (consider a right triangle). If the given polygon is non-convex, that point may not even be inside or on its boundary.

This post looks at computing Chebyshev centers for arbitrary convex polygons. We employ essentially the same model as in Boyd & Vandenberghe’s Convex Optimization text, but using Gurobi instead of CVXOPT.

Consider a polygon defined by the intersection of a finite number of half-spaces, $Au \le b$. We assume we are given the set of vertices, $V$, in clockwise order around the polygon. $E$ is the set of edges connecting these vertices. Each edge in $E$ defines a boundary of the half-space $a_i^\intercal u \le b_i$

$$ V = {(1,1), (2,5), (5,4), (6,2), (4,1)}\\ E = {((1,1),(2,5)), ((2,5),(5,4)), ((5,4),(6,2)), ((6,2),(4,1)), ((4,1),(1,1))} $$

The Chebyshev center of this polygon is the center point $(x, y)$ of the maximum radius inscribed circle. That is, if we can find the largest circle that will fit inside our polygon without going outside its boundary, its center is the point we are looking for. Our decision variables are the center $(x, y)$ and the maximum inscribed radius, $r$.

In order to do this, we consider the edges independently. The long line segment below shows an arbitrary edge, $a_i^\intercal u \le b_i$. The short line connected to it is orthogonal in the direction $a$. $(x, y)$ satisfies the inequality.

The shortest distance from $(x, y)$ will be in the direction of $a$. We’ll call this distance $r$. If we were to move the edge so it had the same slope but went through $(x, y)$, its distance from $a_i^\intercal u = b_i$ would be $r||a_i||_2$. Thus we can add a constraint of the form $a_i’u + r||a_i||_2 \le b_i$ for each edge and maximize the value of $r$ as our objective function.

$$ \begin{align*} & \text{max} && r \\ & \text{s.t.} && (y_i-y_j)x + (x_j-x_i)y + r\sqrt{(x_j-x_i)^2 + (y_j-y_i)^2} \le (y_i-y_j)x_i + (x_j-x_i)y_i \\ & && \quad \forall \quad ((x_i,y_i), (x_j,y_j)) \in E \\ \end{align*} $$

As this is linear, we can solve it using any LP solver. The following code does so with Gurobi.

#!/usr/bin/env python3
from gurobipy import Model, GRB
from math import sqrt

vertices = [(1,1), (2,5), (5,4), (6,2), (4,1)]
edges = zip(vertices, vertices[1:] + [vertices[0]])

m = Model()
r = m.addVar()
x = m.addVar(lb=-GRB.INFINITY)
y = m.addVar(lb=-GRB.INFINITY)
m.update()

for (x1, y1), (x2, y2) in edges:
    dx = x2 - x1
    dy = y2 - y1
    m.addConstr((dx*y - dy*x) + (r * sqrt(dx**2 + dy**2)) <= dx*y1 - dy*x1)

m.setObjective(r, GRB.MAXIMIZE)
m.optimize()

print('r = %.04f' % r.x)
print('(x, y) = (%.04f, %.04f)' % (x.x, y.x))

The model output shows our center and its maximum inscribed radius.

$$ r = 1.7466\\ (x, y) = (3.2370, 2.7466) $$

Question for the reader: in certain circumstances, such as rectangles, the Chebyshev center is ambiguous. How might one get around this ambiguity?

✂️ Network Splitting

Mon, 29 Jul 2013 00:00:00 +0000

Note: A reader pointed out that Union-Find is a very efficient way to accomplish this task. Start there if you have the same problem!

Last week, Paul Rubin wrote an excellent post on Extracting a Connected Graph from an existing graph. Lately I’ve been performing related functions on data from OpenStreetMap, though without access to a solver. In my case I’m taking in arbitrary network data and splitting it into disconnected sub-networks. I thought it might be a good case study to show an algorithmic way doing this and some of the performance issues I ran into.

A small example can be seen below. This shows a road network around the Las Vegas strip. There is one main (weakly) connected network in black. The roads highlighted in red are disconnected from the main network. We want code that will split these into connected sub-networks.

Say we have data that looks like the following. Instead of nodes, the numbers in quotes represent edges. Think of these as streets.

{
    "0": [1, 2, 3],
    "1": [9248, 9249, 9250],
    "2": [589, 9665, 9667],
    "3": [0, 5, 6],
    "4": [0, 5, 6],
    "5": [588],
    "6": [4, 8, 9],
    ...
}

Our basic strategy is the following:

Start with every edge alone in its own subnetwork.
For each connection, merge the networks of the source and destination edges.

#!/usr/bin/env python3
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print(time.asctime(), 'parsing json input')
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print(time.asctime(), 'detecting disconnected subgraphs')
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))

        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    to_net.update(from_net)
                    for e in from_net:
                        edge_to_net[e] = to_net
                    nets.remove(from_net)
                    from_net = to_net

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print(time.asctime(), len(nets), 'subnets found')

We run this against the network pictured above and it works reasonably quickly, finishing in about 7 seconds:

Mon Jul 29 12:22:38 2013 parsing json input
Mon Jul 29 12:22:38 2013 detecting disconnected subgraphs
Mon Jul 29 12:22:38 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:22:44 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:22:45 2013 60 subnets found

However, when run against a road network for an entire city, the process continues for several hours. What is the issue?<

The inefficiency occurs from lines 46 to 50. In this we are frequently removing references to every element in a large set. Instead, it would be better to remove as few references as possible. Therefore, instead of merging from_net into to_net, we will determine which network is the smaller of the two and marge that one into the larger one. Note that this does not necessarily change the worst case time complexity of the algorithm, but it should make the code fast enough to be useful. The new version appears below.`

# !/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print('usage: %s network.json' % sys.argv[0])
        sys.exit()

    print(time.asctime(), 'parsing json input')
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print(time.asctime(), 'detecting disconnected subgraphs')
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print(time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets)))

        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    # Update references to and remove the smaller set for speed.
                    if len(to_net) < len(from_net):
                        smaller, larger = to_net, from_net
                    else:
                        smaller, larger = from_net, to_net

                    larger.update(smaller)
                    for e in smaller:
                        edge_to_net[e] = larger
                    nets.remove(smaller)
                    edge_to_net[to] = larger
                    from_net = larger

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print(time.asctime(), len(nets), 'subnets found')

Indeed, this is significantly faster. And on very large networks it runs in minutes instead of hours or days. On the small test case used for this post, it runs in under a second. While this could probably be done faster, that’s actually good enough for right now.

Mon Jul 29 12:39:55 2013 parsing json input
Mon Jul 29 12:39:55 2013 detecting disconnected subgraphs
Mon Jul 29 12:39:55 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:39:55 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:39:55 2013 60 subnets found

🏖️ Langrangian Relaxation with Gurobi

Sat, 22 Sep 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and the 2nd edition of “Integer Programming” by Laurence Wolsey.

We’ve been studying Lagrangian Relaxation (LR) in the Advanced Topics in Combinatorial Optimization course I’m taking this term, and I had some difficulty finding a simple example covering its application. In case anyone else finds it useful, I’m posting a Python version for solving the Generalized Assignment Problem (GAP). This won’t discuss the theory of LR at all, just give example code using Gurobi.

Generalized assignment

The GAP as defined by Wolsey consists of a maximization problem subject to a set of set packing constraints followed by a set of knapsack constraints.

$$ \begin{align*} & \text{max} && \sum_i \sum_j c_{ij} x_{ij} \\ & \text{s.t.} && \sum_j x_{ij} \leq 1 && \forall i \\ & && \sum_i a_{ij} x_{ij} \leq b_{ij} && \forall j \\ & && x_{ij} \in {0, 1} \end{align*} $$

Naive model

A naive version of this model using Gurobi might look like the following.

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 208.
from gurobipy import Model, GRB, quicksum as qsum

m = Model("GAP per Wolsey")
m.modelSense = GRB.MAXIMIZE
m.setParam("OutputFlag", False)  # turns off solver chatter

b = [15, 15, 15]
c = [
    [6, 10, 1],
    [12, 12, 5],
    [15, 4, 3],
    [10, 3, 9],
    [8, 9, 5],
]
a = [
    [5, 7, 2],
    [14, 8, 7],
    [10, 6, 12],
    [8, 4, 15],
    [6, 12, 5],
]

# x[i][j] = 1 if i is assigned to j
x = [[m.addVar(vtype=GRB.BINARY) for _ in row] for row in c]

# sum j: x_ij <= 1 for all i
for x_i in x:
    m.addConstr(sum(x_i) <= 1)

# sum i: a_ij * x_ij <= b[j] for all j
for j, b_j in enumerate(b):
    m.addConstr(qsum(a[i][j] * x_i[j] for i, x_i in enumerate(x)) <= b_j)

# max sum i,j: c_ij * x_ij
m.setObjective(
    qsum(qsum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i)) for c_i, x_i in zip(c, x))
)
m.optimize()

# Pull solution out of m.
print(f"z = {m.objVal}")
print("x = [")
for x_i in x:
    print(f"  {[1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]}")
print("]")

The solver quickly finds the following optimal solution of this toy problem.

z = 46.0
x = [
  [0, 1, 0]
  [0, 1, 0]
  [1, 0, 0]
  [0, 0, 1]
  [0, 0, 0]
]

Lagrangian model

There are two sets of constraints we can dualize. It can be beneficial to apply Lagrangian Relaxation against problems composed of knapsack constraints, so we will dualize the set packing ones.

# sum j: x_ij <= 1 for all i
for x_i in x:
    model.addConstr(sum(x_i) <= 1)

We replace these with a new set of variables, penalties, which take the values of the slacks on the set packing constraints. We then modify the objective function, adding Lagrangian multipliers times these penalties.

Instead of optimizing once, we do so iteratively. An important consideration is we may get nothing more than a dual bound from this process. Any integer solution is not guaranteed to be primal feasible unless it satisfies complementary slackness conditions – for each dualized constraint either its multiplier or penalty must be zero.

We then set the initial multiplier values to 2 and use sub-gradient optimization with a step size of 1 / (iteration #) to adjust them.

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 208, using Lagrangian Relaxation.
from gurobipy import Model, GRB, quicksum as qsum

m = Model("GAP per Wolsey with Lagrangian Relaxation")
m.modelSense = GRB.MAXIMIZE
m.setParam("OutputFlag", False)  # turns off solver chatter

b = [15, 15, 15]
c = [
    [6, 10, 1],
    [12, 12, 5],
    [15, 4, 3],
    [10, 3, 9],
    [8, 9, 5],
]
a = [
    [5, 7, 2],
    [14, 8, 7],
    [10, 6, 12],
    [8, 4, 15],
    [6, 12, 5],
]

# x[i][j] = 1 if i is assigned to j
x = [[m.addVar(vtype=GRB.BINARY) for _ in row] for row in c]

# As stated, the GAP has these following constraints. We dualize these into
# penalties instead, using variables so we can easily extract their values.
penalties = [m.addVar() for _ in x]

# Dualized constraints: sum j: x_ij <= 1 for all i
for p, x_i in zip(penalties, x):
    m.addConstr(p == 1 - sum(x_i))

# sum i: a_ij * x_ij <= b[j] for all j
for j, b_j in enumerate(b):
    m.addConstr(qsum(a[i][j] * x_i[j] for i, x_i in enumerate(x)) <= b_j)

# u[i] = Lagrangian Multiplier for the set packing constraint i
u = [2.0] * len(x)

# Re-optimize until either we have run a certain number of iterations
# or complementary slackness conditions apply.
for k in range(1, 101):
    # max sum i,j: c_ij * x_ij
    m.setObjective(
        qsum(
            # Original objective function
            sum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i))
            for c_i, x_i in zip(c, x)
        )
        + qsum(
            # Penalties for dualized constraints
            u_j * p_j
            for u_j, p_j in zip(u, penalties)
        )
    )
    m.optimize()

    print(
        f"iteration {k}: z = {m.objVal}, u = {u}, penalties = {[p.x for p in penalties]}"
    )

    # Test for complementary slackness
    stop = True
    eps = 10e-6
    for u_i, p_i in zip(u, penalties):
        if abs(u_i) > eps and abs(p_i.x) > eps:
            stop = False
            break

    if stop:
        print("primal feasible & optimal")
        break

    else:
        s = 1.0 / k
        for i in range(len(x)):
            u[i] = max(u[i] - s * (penalties[i].x), 0.0)

# Pull solution out of m.
print(f"z = {m.objVal}")
print("x = [")
for x_i in x:
    print(f"  {[1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]}")
print("]")

Again, the example converges very quickly to an optimal solution.

iteration 1: z = 48.0, u = [2.0, 2.0, 2.0, 2.0, 2.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 2: z = 47.0, u = [2.0, 2.0, 2.0, 2.0, 1.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 3: z = 46.5, u = [2.0, 2.0, 2.0, 2.0, 0.500], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 4: z = 46.2, u = [2.0, 2.0, 2.0, 2.0, 0.167], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 5: z = 46.0, u = [2.0, 2.0, 2.0, 2.0, 0.000], penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
primal feasible & optimal
z = 46.0
x = [
  [0, 1, 0]
  [0, 1, 0]
  [1, 0, 0]
  [0, 0, 1]
  [0, 0, 0]
]

Exercise for the reader: change the script to dualize the knapsack constraints instead of the set packing constraints. What is the result of this change in terms of convergence?

Resources

🔲 Normal Magic Squares

Fri, 13 Jan 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt. It has also been edited for clarity.

As a followup to the last post, I created another SCIP example for finding Normal Magic Squares. This is similar to solving a Sudoku problem, except that here the number of binary variables depends on the square size. In the case of Sudoku, each cell has 9 binary variables – one for each potential value it might take. For a normal magic square, there are $n^2$ possible values for each cell, $n^2$ cells, and one variable representing the row, column, and diagonal sums. This makes a total of $n^4$ binary variables and one continuous variables in the model.

However, there are no big-Ms.

I think the neat part of this code is in this section:

# Construct an expression for each cell that is the sum of
# its binary variables with their associated coefficients.
sums = []
for row in matrix:
    sums_row = []
    for cell in row:
        sums_row.append(sum((i + 1) * x for i, x in enumerate(cell)))
    sums.append(sums_row)

It creates sums of the $n^2$ variables for each cell with their appropriate coefficients ($1$ to $n^2$) and stores those expressions to make the subsequent constraint creation simpler.

Another interesting exercise for the reader: Change the code to minimize the sum of each column. How does that impact the solution time?

🔲 Magic Squares and Big-Ms

Thu, 12 Jan 2012 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt. It has also been edited for clarity.

Back in October of 2011, I started toying with a model for finding magic squares using SCIP. This is a fun modeling exercise and a challenging problem. First one constructs a square matrix of integer-valued variables.

from pyscipopt import Model

# [...snip...]

m = Model()

matrix = []
for i in range(size):
    row = [m.addVar(vtype="I", lb=1) for _ in range(size)]
    for x in row:
        m.addCons(x <= M)
    matrix.append(row)

Then one adds the following constraints:

All variables ≥ 1.
All rows, columns, and the diagonal sum to the same value.
All variables take different values.

The first two constraints are trivial to implement, and relatively easy for the solver. What I do is add a single extra variable then set it equal to the sums of each row, column, and the diagonal.

sum_val = m.addVar(vtype="M")
for i in range(size):
    m.addCons(sum(matrix[i]) == sum_val)
    m.addCons(sum(matrix[j][i] for j in range(size)) == sum_val)

m.addCons(sum(matrix[i][i] for i in range(size)) == sum_val)

It’s the third that messes things up. You can think of this as saying, for every possible pair of integer-valued variables $x$ and $y$:

$$ x \ge y + 1 \quad \text{or} \quad x \le y - 1 $$

Why is this hard? Because we can’t add both constraints to the model. That would make it infeasible. Instead, we add write them in such a way that exactly one will be active for any any given solution. This requires, for each pair of variables, an additional binary variable $z$ and a (possibly big) constant $M$. Thus we reformulate the above as:

$$ x \ge (y + 1) - M z \ x \le (y - 1) + M (1-z) \ z \in {0,1} $$

In code this looks like:

from itertools import chain

all_vars = list(chain(*matrix))
for i, x in enumerate(all_vars):
    for y in all_vars[i+1:]:
        z = m.addVar(vtype="B")
        m.addCons(x >= y + 1 - M*z)
        m.addCons(x <= y - 1 + M*(1-z))

However, here be dragons. We may not know how big (or small) to make $M$. Generally we want it as small as possible to make the LP relaxation of our integer programming model tighter. Different values of $M$ have unpredictable effects on solution time.

Which brings us to an interesting idea:

SCIP now supports bilinear constraints. This means that I can make $M$ a variable in the above model.

import sys

try:
    M = int(sys.argv[2])
except IndexError:
    M = m.addVar(vtype="M", lb=size * size)
else:
    assert M >= size * size

The magic square model linked to in this post provides both options. The first command line argument it requires is the matrix size. The second one, $M$, is optional. If not given, it leaves $M$ up to the solver.

An interesting exercise for the reader: Change the code to search for a minimal magic square, which minimizes either the value of $M$ or the sums of the columns, rows, and diagonal.

⏳️ Know Your Time Complexities - Part 2

Fri, 25 Nov 2011 00:00:00 +0000

In response to this post, Ben Bitdiddle inquires:

I understand the concept of using a companion set to remove duplicates from a list while preserving the order of its elements. But what should I do if these elements are composed of smaller pieces? For instance, say I am generating combinations of numbers in which order is unimportant. How do I make a set recognize that [1,2,3] is the same as [3,2,1] in this case?

There are a couple points that should help here.

While lists are unhashable and therefore cannot be put into sets, tuples are perfectly capable of this. Therefore I cannot do this.

s = set()
s.add([1,2,3])

Traceback (most recent call last):
 File "", line 1, in 
TypeError: unhashable type: 'list'

But this works just fine (extra space added for emphasis of tuple parentheses).

s.add( (1,2,3) )

(3,2,1) and (1,2,3) may not hash to the same thing, but tuples are easily sortable. If I sort them before adding them to a set, they look the same.

tuple(sorted( (3,2,1) ))

(1, 2, 3)

If I want to be a little fancier, I can user itertools.combinations. The following generates all unique 3-digit combinations of integers from 1 to 4:

from itertools import combinations
list(combinations(range(1,5), 3))

[(1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4)]

Now say I want to only find those that match some condition. I can add a filter to return, say, only those 3-digit combinations of integers from 1 to 6 that multiply to a number divisible by 10:

list(filter(
    lambda x: not (x[0]*x[1]*x[2]) % 10,
    combinations(range(1, 7), 3)
))

[(1, 2, 5),
 (1, 4, 5),
 (1, 5, 6),
 (2, 3, 5),
 (2, 4, 5),
 (2, 5, 6),
 (3, 4, 5),
 (3, 5, 6),
 (4, 5, 6)]

⏳️ Know Your Time Complexities

Tue, 25 Oct 2011 00:00:00 +0000

This is based on a lightning talk I gave at the LA PyLadies October Hackathon.

I’m actually not going to go into anything much resembling algorithmic complexity here. What I’d like to do is present a common performance anti-pattern that I see from novice programmers about once every year or so. If I can prevent one person from committing this error, this post will have achieved its goal. I’d also like to show how an intuitive understanding of time required by operations in relation to the size of data they operate on can be helpful.

Say you have a Big List of Things. It doesn’t particularly matter what these things are. Often they might be objects or dictionaries of denormalized data. In this example we’ll use numbers. Let’s generate a list of 1 million integers, each randomly chosen from the first 100 thousand natural numbers:

import random

choices = range(100000)
x = [random.choice(choices) for i in range(1000000)]

Now say you want to remove (or aggregate, or structure) duplicate data while keeping them in order of appearance. Intuitively, this seems simple enough. A first solution might involve creating a new empty list, iterating over x, and only appending those items that are not already in the new list.

The Bad Way

order = []
for i in x:
    if i not in order:
        order.append(i)

Try running this. What’s wrong with it?

The issue is the conditional on line 3. In the worst case, it could look at every item in the order list for each item in x. If the list is big, as it is in our example, that wastes a lot of cycles. We can reason that we can improve the performance of our code by replacing this conditional with something faster.

The Good Way

Given that sets have near constant time for membership tests, one solution is to create a companion data structure, which we’ll call seen. Being a set, it doesn’t care about the order of the items, but it will allow us to test for membership quickly.

order = []
seen = set()
for i in x:
    if i not in seen:
        seen.add(i)
        order.append(i)

Now try running this. Better?

Not that this is the best way to perform this particular action. If you aren’t familiar with it, take a look at the groupby function from itertools, which is what I will sometimes reach for in a case like this.

🎰 Deterministic vs. Stochastic Simulation

Sat, 11 Jun 2011 00:00:00 +0000

I find I have to build simulations with increasing frequency in my work and life. Usually this indicates I’m faced with one of the following situations:

The need for a quick estimate regarding the quantitative behavior of some situation.
The desire to verify the result of a computation or assumption.
A situation which is too complex or random to effectively model or understand.

Anyone familiar at all with simulation will recognize the last item as the motivating force of the entire field. Simulation models tend to take over when systems become so complex that understanding them is prohibitive in cost and time or entirely infeasible. In a simulation, the modeler can focus on individual interactions between entities while still hoping for useful output in the form of descriptive statistics.

As such, simulations are nearly always stochastic. The output of a simulation, whether it be the mean time to service upon entering a queue or the number of fish alive in a pond, is determined by a number of random inputs. It is estimated by looking at a sample of the entire, often infinite, problem space and therefore must be described in terms of mean and variance.

For me, simulation building usually follows a process roughly like this:

Work with a domain expert to understand the process under study.
Convert this process into a deterministic simulation (no randomness).
Verify the output of the deterministic simulation.
Anlyze the inputs of the simulation to determine their probability distributions.
Convert the deterministic simulation to a stochastic simulation.

The reason for creating a simulation without randomness first is that it can be difficult or impossible to verify its correctness otherwise. Thus one may focus on the simulation logic first before analyzing and adding sources of randomness.

Where the procedure breaks down is after the third step. Domain experts are often happy to share their knowledge about systems to aid in designing simulations, and typically can understand the resulting abstractions. They are also invaluable in verifying simulation output. However, they are unlikely to understand why it is necessary to add randomness to a system that they already perceive as functional. Further, doing so can be just as difficult and time consuming as the initial model development and therefore requires justification.

This can be a quandary for the model builder. How does one communicate the need to incorporate randomness to decision makers who lack understanding of probability? It is trivially easy to construct simulations that use the same input parameters but yield drastically different outputs. Consider the code below, which simulates two events occurring and counts the number of times event b happens before event a.

import random

def sim_stochastic(event_a_lambda, event_b_lambda):
    # Returns 0 if event A arrives first, 1 if event B arrives first

    # Calculate next arrival time for each event randomly.
    event_a_arrival = random.expovariate(event_a_lambda)
    event_b_arrival = random.expovariate(event_b_lambda)

    return 0.0 if event_a_arrival <= event_b_arrival else 1.0

def sim_deterministic(event_a_lambda, event_b_lambda):
    # Returns 0 if event A arrives first, 1 if event B arrives first

    # Calculate next arrival time for each event deterministically.
    event_a_arrival = 1.0 / event_a_lambda
    event_b_arrival = 1.0 / event_b_lambda

    return 0.0 if event_a_arrival <= event_b_arrival else 1.0

if __name__ == '__main__':
    event_a_lambda = 0.3
    event_b_lambda = 0.5

    repetitions = 10000

    for sim in (sim_stochastic, sim_deterministic):
        output = [
            sim(event_a_lambda, event_b_lambda)
            for _ in range(repetitions)
        ]
        event_b_first = 100.0 * (sum(output) / len(output))
        print('event b is first %0.1f%% of the time' % event_b_first)

Both simulations use the same input parameter, but the second one is essentially wrong as b will always happen first. In the stochastic version, we use exponential distributions for the inputs and obtain an output that verifies our basic understanding of these distributions.

event b is first 63.0% of the time
event b is first 100.0% of the time

How about you? How do you discuss the need to model a random world with decision makers?

🔮 NetworkX and Python Futures

Thu, 19 May 2011 00:00:00 +0000

Note: This post was updated to work with NetworkX and for clarity.

It’s possible this will turn out like the day when Python 2.5 introduced coroutines. At the time I was very excited. I spent several hours trying to convince my coworkers we should immediately abandon all our existing Java infrastructure and port it to finite state machines implemented using Python coroutines. After a day of hand waving over a proof of concept, we put that idea aside and went about our lives.

Soon after, I left for a Python shop, but in the next half decade I still never found a good place to use this interesting feature.

But it doesn’t feel like that.

As I come to terms more with switching to Python 3.2, the futures module seems similarly exciting. I wish I’d had it years ago, and it’s almost reason in itself to upgrade from Python 2.7. Who cares if none of your libraries have been ported yet?

This library lets you take any function and distribute it over a process pool. To test that out, we’ll generate a bunch of random graphs and iterate over all their cliques.

Code

First, let’s generate some test data using the dense_gnm_random_graph function. Our data includes 1000 random graphs, each with 100 nodes and 100 * 100 edges.

import networkx as nx

n = 100
graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]

Now we write a function iterate over all cliques in a given graph. NetworkX provides a find_cliques function which returns a generator. Iterating over them ensures we will run through the entire process of finding all cliques for a graph.

def iterate_cliques(g):
    for _ in nx.find_cliques(g):
        pass

Now we just define two functions, one for running in serial and one for running in parallel using futures.

from concurrent import futures

def serial_test(graphs):
    for g in graphs:
        iterate_cliques(g)

def parallel_test(graphs, max_workers):
    with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
        executor.map(iterate_cliques, graphs)

Our __main__ simply generates the random graphs, samples from them, times both functions, and write CSV data to standard output.

from csv import writer
import random
import sys
import time

if __name__ == '__main__':
    out = writer(sys.stdout)
    out.writerow(['num graphs', 'serial time', 'parallel time'])

    n = 100
    graphs = [nx.dense_gnm_random_graph(n, n*n) for _ in range(1000)]

    # Run with a number of different randomly generated graphs
    for num_graphs in range(50, 1001, 50):
        sample = random.choices(graphs, k = num_graphs)

        start = time.time()
        serial_test(sample)
        serial_time = time.time() - start

        start = time.time()
        parallel_test(sample, 16)
        parallel_time = time.time() - start

        out.writerow([num_graphs, serial_time, parallel_time])

The output of this script shows that we get a fairly linear speedup to this code with little effort.

I ran this on a machine with 8 cores and hyperthreading. Eyeballing the chart, it looks like the speedup is roughly 5x. My system monitor shows spikes on CPU usage across cores whenever the parallel test runs.

Resources

Output data
Full source listing

👉 Affine Scaling in R

Wed, 27 Apr 2011 00:00:00 +0000

I recently stumbled across an implementation of the affine scaling interior point method for solving linear programs that I’d coded up in R once upon a time. I’m posting it here in case anyone else finds it useful. There’s not a whole lot of thought given to efficiency or numerical stability, just a demonstration of the basic algorithm. Still, sometimes that’s exactly what one wants.

solve.affine <- function(A, rc, x, tolerance=10^-7, R=0.999) {
  # Affine scaling method
  while (T) {
    X_diag <- diag(x)

    # Compute (A * X_diag^2 * A^t)-1 using Cholesky factorization.
    # This is responsible for scaling the original problem matrix.
    q <- A %*% X_diag**2 %*% t(A)
    q_inv <- chol2inv(chol(q))

    # lambda = q * A * X_diag^2 * c
    lambda <- q_inv %*% A %*% X_diag^2 %*% rc

    # c - A^t * lambda is used repeatedly
    foo <- rc - t(A) %*% lambda

    # We converge as s goes to zero
    s <- sqrt(sum((X_diag %*% foo)^2))

    # Compute new x
    x <- (x + R * X_diag^2 %*% foo / s)[,]

    # If s is within our tolerance, stop.
    if (abs(s) < tolerance) break
  }
  x
}

This function accepts a matrix A which contains all technological coefficients for an LP, a vector rc containing its reduced costs, and an initial point x interior to the LP’s feasible region. Optional arguments to the function include a tolerance, for detecting when the method is within an acceptable distance from the optimal point, and a value for R, which must be strictly between 0 and 1 and controls scaling.

The method works by rescaling the matrix A around the current solution x. It then computes a new x such that it remains feasible and interior, which is why R cannot be 0 or 1. It requires a feasible interior point to start and only projects to other feasible interior points, so the right hand side of the LP is not required (it is implicit from the starting point). The shadow prices for each iteration are captured in the vector lambda, so the gap between primal and dual solutions is easy to compute.

We run this function against a 3x3 LP with a known solution:

max z = 5x1 + 4x2 + 3x3
st      2x1 + 3x2 +  x3 <=  5
        4x1 +  x2 + 2x3 <= 11
        3x1 + 4x2 + 2x3 <=  8
        x1, x2, x3 >= 0

The optimal solution to this LP is:

z  = 13
x1 =  2
x2 =  0
x3 =  1

This problem can be run against the affine scaling function by defining A with all necessary slack variables, and using an arbitrary feasible interior point:

A <- matrix(c(
  2,3,1,1,0,0,
  4,1,2,0,1,0,
  3,4,2,0,0,1
), nrow=3, byrow=T)
rc <- c(5, 4, 3, 0, 0, 0)
x  <- c(0.5, 0.5, 0.5, 2, 7.5, 3.5)

solution <- solve.affine(A, rc, x)
print(solution)
print(sum(solution * rc))

This provides an output vector that is very close to the optimal primal solution shown above. Since interior point methods converge asymptotically to optimal solutions, it is important to note that we can only ever get (extremely) close to our final optimal objective and decision variable values.

> print(solution)
[1] 1.999998e+00 4.268595e-07 1.000002e+00 1.280579e-06 1.000005e+00
[6] 1.280579e-06

> print(sum(solution * rc))
[1] 13.00000

🐪 Reformed JAPHs: Transpiler

Wed, 20 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

For the final JAPH in this series, I implemented a simple transpiler that converts a small subset of Scheme programs to equivalent Python programs. It starts with a Scheme program that prints 'just another scheme hacker'.

(define (output x)
    (if (null? x)
        ""
        (begin (display (car x))
                (if (null? (cdr x))
                    (display "\n")
                    (begin (display " ")
                            (output (cdr x)))))))
(output (list "just" "another" "scheme" "hacker"))

The program then tokenizes that Scheme source, parses the token stream, and converts that into Python 3.

def output(x):
    if not x:
        ""
    else:
        print(x[0], end='')
        if not x[1:]:
            print("\n", end='')
        else:
            print(" ", end='')
            output(x[1:])

output(["just", "another", "python", "hacker"])

Finally it executes the resulting Python string using exec. Obfuscation is left as an exercise for the reader.

import re

def tokenize(input):
    '''Tokenizes an input stream into a list of recognizable tokens'''
    token_res = (
        r'\(',      # open paren -> starts expression
        r'\)',      # close paren -> ends expression
        r'"[^"]*"', # quoted string (don't support \" yet)
        r'[\w?]+'   # atom
    )
    return re.findall(r'(' + '|'.join(token_res) + ')', input)

def parse(stream):
    '''Parses a token stream into a syntax tree'''
    if not stream:
        return []

    else:
        # Build a list of arguments (possibly expressions) at this level
        args = []
        while True:
            # Get the next token
            try:
                x = stream.pop(0)
            except IndexError:
                return args

            # ( and ) control the level of the tree we're at
            if x == '(':
                args.append(parse(stream))
            elif x == ')':
                return args
            else:
                args.append(x)

def compile(tree):
    '''Compiles an Scheme Abstract Syntax Tree into near-Python'''
    def compile_expr(indent, expr):
        indent += 1

        lines = [] # these will have [(indent, statement), ...] structure
        while expr:
            # Two options: expr is a string like "'" or it is a list
            if isinstance(expr, str):
                return [(
                    indent,
                    expr.replace('scheme', 'python').replace('\n', '\\n')
                )]

            else:
                start = expr.pop(0)

                if start == 'define':
                    signature = expr.pop(0)
                    lines.append((indent,
                        'def %s(%s):' % (
                            signature[0],
                            ', '.join(signature[1:])
                        )
                    ))
                    while expr:
                        lines.extend(compile_expr(indent, expr.pop(0)))

                elif start == 'if':
                    # We don't support multi-clause conditionals yet
                    clause = compile_expr(indent, expr.pop(0))[0][1]
                    lines.append((indent, 'if %s:' % clause))

                    if_true_lines = compile_expr(indent, expr.pop(0))
                    if_false_lines = compile_expr(indent, expr.pop(0))

                    lines.extend(if_true_lines)
                    lines.append((indent, 'else:'))
                    lines.extend(if_false_lines)

                elif start == 'null?':
                    # Only supports conditionals of the form (null? foo)
                    if isinstance(expr[0], str):
                        condition = expr.pop(0)
                    else:
                        condition = compile_expr(indent, expr.pop(0))[0][1]
                    return [(indent, 'not %s' % condition)]

                elif start == 'begin':
                    # This is just a series of statements, so don't indent
                    while expr:
                        lines.extend(compile_expr(indent-1, expr.pop(0)))

                elif start == 'display':
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((
                        indent,
                        "print(%s, end='')" % (', '.join(arguments))
                    ))

                elif start == 'car':
                    lines.append((indent, '%s[0]' % expr.pop(0)))

                elif start == 'cdr':
                    lines.append((indent, '%s[1:]' % expr.pop(0)))

                elif start == 'list':
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((indent, '[%s]' % ', '.join(arguments)))

                else:
                    # Assume this is a function call
                    arguments = []
                    while expr:
                        arguments.append(
                            compile_expr(indent, expr.pop(0))[0][1]
                        )
                    lines.append((
                        indent,
                        "%s(%s)" % (start, ', '.join(arguments))
                    ))

        return lines

    return [compile_expr(-1, expr) for expr in tree]

if __name__ == '__main__':
    scheme = '''
        (define (output x)
            (if (null? x)
                ""
                (begin (display (car x))
                       (if (null? (cdr x))
                           (display "\n")
                           (begin (display " ")
                                  (output (cdr x)))))))
        (output (list "just" "another" "scheme" "hacker"))
    '''
    python = ''
    for expr in compile(parse(tokenize(scheme))):
        python += '\n'.join([(' ' * 4 * x[0]) + x[1] for x in expr]) + '\n\n'
    exec(python)

🐪 Reformed JAPHs: Turing Machine

Mon, 18 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

This JAPH uses a Turing machine. The machine accepts any string that ends in '\n' and allows side effects. This lets us print the value of the tape as it encounters each character. While the idea of using lambda functions as side effects in a Turing machine is a little bizarre on many levels, we work with what we have. And Python is multi-paradigmatic, so what the heck.

import re

def turing(tape, transitions):
    # The tape input comes in as a string.  We approximate an infinite
    # length tape via a hash, so we need to convert this to {index: value}
    tape_hash = {i: x for i, x in enumerate(tape)}

    # Start at 0 using our transition matrix
    index = 0
    state = 0
    while True:
        value = tape_hash.get(index, '')

        # This is a modified Turing machine: it uses regexen
        # and has side effects.  Oh well, I needed IO.
        for rule in transitions[state]:
            regex, next, direction, new_value, side_effect = rule
            if re.match(regex, value):
                # Terminal states
                if new_value in ('YES', 'NO'):
                    return new_value

                tape_hash[index] = new_value
                side_effect(value)
                index += direction
                state = next
                break

assert 'YES' == turing('just another python hacker\n', [
    # This Turing machine recognizes the language of strings that end in \n.

    # Regex rule, next state, left/right = -1/+1, new value, side effect.
    [ # State 0:
        [r'^[a-z ]$', 0, +1, '', lambda x: print(x, end='')],
        [r'^\n$', 1, +1, '', lambda x: print(x, end='')],
        [r'^.*$', 0, +1, 'NO', None],
    ],
    [ # State 1:
        [r'^$', 1, -1, 'YES', None]
    ]
])

Obfuscation again consists of converting the above code into lambda functions using Y combinators. This is a nice programming exercise, so I’ve left it out of this post in case anyone wants to try. The Turing machine has to return 'YES' to indicate that it accepts the string, thus the assertion. Our final obfuscated JAPH is a single expression.

assert'''YES'''==(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(
lambda arg: f(f)(arg))))(lambda f: lambda q:[(lambda g:(lambda f:g(lambda
arg:f(f)(arg)))(lambda f: g(lambda arg:f(f)(arg))))(lambda f: lambda x:(x
[0][0]if x[0] and __import__('re').match(x[0][0][0],x[1])else f([x[0][1:]
,x[1]]))) ([q[3][q[1]],q[2].get(q[0],'')])[4](q[2].get(q[0],'')), (lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0] and __import__('re').match(x[0][0][0],x
[1])else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]if(lambda
g:(lambda f:g(lambda arg:f(f)(arg))) (lambda f:g(lambda arg:f(f)(arg))))(
lambda f:lambda x:(x[0][0]if x[0]and __import__('re').match(x[0][0][0],x[
1]) else f([x[0][1:],x[1]])))([q[3][q[1]],q[2].get(q[0],'')])[3]in('YES',
'NO')else f([q[0]+(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g
(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__(
're').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]], q[2].
get(q[0],'')])[2],(lambda g:(lambda f:g(lambda arg: f(f)(arg)))(lambda f:
g(lambda arg:f(f)(arg))))(lambda f:lambda x:(x[0][0]if x[0]and __import__
('re').match(x[0][0][0],x[1])else f([x[0][1:], x[1]])))([q[3][q[1]],q[2].
get(q[0],'')])[1],q[2],q[3]])][1])([0,0,{i:x for i,x in enumerate('just '
'another python hacker\n')}, [[[r'^[a-z ]$',0,+1,'',lambda x:print(x,end=
'')], [r'^\n$',1,+1,'',lambda x:print(x, end='')],[r'^.*$',0,+1,'''NO''',
lambda x:None]], [[r'''^$''',+1,-1,'''YES''', lambda x: None or None]]]])

🐪 Reformed JAPHs: Huffman Coding

Thu, 14 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

At this point, tricking python into printing strings via indirect means got a little boring. So I switched to obfuscating fundamental computer science algorithms. Here’s a JAPH that takes in a Huffman coded version of 'just another python hacker', decodes, and prints it.

# Build coding tree
def build_tree(scheme):
    if scheme.startswith('*'):
        left, scheme = build_tree(scheme[1:])
        right, scheme = build_tree(scheme)
        return (left, right), scheme
    else:
        return scheme[0], scheme[1:]

def decode(tree, encoded):
    ret = ''
    node = tree
    for direction in encoded:
        if direction == '0':
            node = node[0]
        else:
            node = node[1]
        if isinstance(node, str):
            ret += node
            node = tree
    return ret

tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
    decode(tree, bin(10627344201836243859174935587).lstrip('0b').zfill(103))
)

The decoding tree is like a LISP-style sequence of pairs. '*' represents a branch in the tree while other characters are leaf nodes. This looks like the following.

(
    (
        (
            (
                ('j', 'u'), 
                ('s', 'p')
            ), 
            ('e', 'r')
        ), 
        (
            (
                ('y', 'c'), 
                't'
            ), 
            (' ', 'h')
        )
    ), 
    (
        ('k', 'a'), 
        ('n', 'o')
    )
)

The actual Huffman coded version of our favorite string gets about 50% smaller represented in base-2.

0000000001000100101011010111011101010111001000110110000110100001010111111110011001111010100110000100011

There’s a catch here, which is that this is hard to obfuscate unless we turn it into a single expression. This means that we have to convert build_tree and decode into lambda functions. Unfortunately, they are recursive and lambda functions recurse naturally. Fortunately, we can use Y combinators to get around the problem. These are worth some study since they will pop up again in future JAPHs.

Y = lambda g: (
    lambda f: g(lambda arg: f(f)(arg))) (lambda f: g(lambda arg: f(f)(arg))
)

build_tree = Y(
    lambda f: lambda scheme: (
        (f(scheme[1:])[0], f(f(scheme[1:])[1])[0]),
        f(f(scheme[1:])[1])[1 ]
    ) if scheme.startswith('*') else (scheme[0], scheme[1:])
)

decode = Y(lambda f: lambda x: x[3]+x[1] if not x[2] else (
    f([x[0], x[0], x[2], x[3]+x[1]]) if isinstance(x[1], str) else (
        f([x[0], x[1][0], x[2][1:], x[3]]) if x[2][0] == '0' else (
            f([x[0], x[1][1], x[2][1:], x[3]])
        )
    )
))

tree = build_tree('*****ju*sp*er***yct* h**ka*no')[0]
print(
    decode([
        tree,
        tree,
        bin(10627344201836243859174935587).lstrip('0b').zfill(103), ''
    ])
)

The final version is a condensed (and expanded, oddly) version of the above.

print((lambda t,e,s:(lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:
g(lambda arg: f(f)(arg))))(lambda f:lambda x: x[3]+x[1]if not x[2]else f([
x[0],x[0],x[2],x[3]+x[1]])if isinstance(x[1],str)else f([x[0],x[1][0],x[2]
[1:],x[3]])if x[2][0]=='0'else f([x[0],x[1][1],x[2][1:],x[3]]))([t,t,e,s])
)((lambda g:(lambda f:g(lambda arg:f(f)(arg)))(lambda f:g(lambda arg:f(f)(
arg))))(lambda f:lambda p:((f(p[1:])[0],f(f(p[1:])[1])[0]),f(f(p[1:])[1])[
1])if p.startswith('*')else(p[0],p[1:]))('*****ju*sp*er***yct* h**ka*no')[
0],bin(10627344201836243859179756385-4820798).lstrip('0b').zfill(103),''))

🐪 Reformed JAPHs: Rolling Effect

Mon, 11 Apr 2011 00:00:00 +0000

Note: This post was updated to work with Python 3.12. It may not work with different versions.

Here’s a JAPH composed solely for effect. For each letter in 'just another python hacker' it loops over each the characters ' abcdefghijklmnopqrstuvwxyz', printing each. Between characters it pauses for 0.05 seconds, backing up and moving on to the next if it hasn’t reached the desired one yet. This achieves a sort of rolling effect by which the final string appears on our screen over time.

import string
import sys
import time

letters = ' ' + string.ascii_lowercase
for l in 'just another python hacker':
    for x in letters:
        print(x, end='')
        sys.stdout.flush()
        time.sleep(0.05)

        if x == l:
            break
        else:
            print('\b', end='')

print()

We locate and print each letter in the string with a list comprehension. At the end we have an extra line of code (the eval statement) that gives us our newline.

[[(lambda x,l:str(print(x,end=''))+str(__import__(print.
__doc__[print.__doc__.index('stdout') - 4:print.__doc__.
index('stdout')-1]).stdout.flush()) + str(__import__(''.
join(reversed('emit'))).sleep(0o5*1.01/0x64))+str(print(
'\b',end='\x09'.strip())if x!=l else'*&#'))(x1,l1)for x1
in('\x20'+getattr(__import__(type('phear').__name__+'in'
'g'),dir(__import__(type('snarf').__name__+'ing'))[15]))
[:('\x20'+getattr(__import__(type('smear').__name__+'in'
'g'),dir(__import__(type('slurp').__name__+'ing'))[15]))
.index(l1)+1]]for l1 in'''just another python hacker''']
eval('''\x20\x09eval("\x20\x09eval('\x20 print()')")''')

🐪 Reformed JAPHs: ROT13

Wed, 06 Apr 2011 00:00:00 +0000

Note: This post was updated to work with Python 3.12. It may not work with different versions.

No series of JAPHs would be complete without ROT13. This is the example through which aspiring Perl programmers learn to use tr and its synonym y. In Perl the basic ROT13 JAPH starts as:

$foo = 'whfg nabgure crey unpxre';
$foo =~ y/a-z/n-za-m/;
print $foo;

Python has nothing quite so elegant in its default namespace. However, this does give us the opportunity to explore a little used aspect of strings: the translate method. If we construct a dictionary of ordinals we can accomplish the same thing with a touch more effort.

import string

table = {
    ord(x): ord(y) for x, y in zip(
        string.ascii_lowercase,
        string.ascii_lowercase[13:] + string.ascii_lowercase
    )
}

print('whfg nabgure clguba unpxre'.translate(table))

We obfuscate the construction of this translation dictionary and, for added measure, use getattr to find the print function off of __builtins__. This will likely only work in Python 3.2, since the order of attributes on __builtins__ matters.

getattr(vars()[list(filter(lambda _:'\x5f\x62'in _,dir
()))[0]], dir(vars()[list(filter(lambda _:'\x5f\x62'in
_, dir()))[0]])[list(filter(lambda _:_ [1].startswith(
'\x70\x72'),enumerate(dir(vars()[list(filter(lambda _:
'\x5f\x62'in _,dir()))[0]]))))[0][0]])(getattr('whfg '
+'''nabgure clguba unpxre''', dir('0o52')[0o116])({ _:
(_-0o124) %0o32 +0o141 for _ in range(0o141, 0o173)}))

🐪 Reformed JAPHs: Ridiculous Anagram

Sun, 03 Apr 2011 00:00:00 +0000

Here’s the second in my reformed JAPH series. It takes an anagram of 'just another python hacker' and converts it prior to printing. It sorts the anagram by the indices of another string, in order of their associated characters. This is sort of like a pre-digested Schwartzian transform.

x = 'upjohn tehran hectors katy'
y = '1D0HG6JFO9P5ICKAM87B24NL3E'

print(''.join(x[i] for i in sorted(range(len(x)), key=lambda p: y[p])))

Obfuscation consists mostly of using silly machinations to construct the string we use to sort the anagram.

print(''.join('''upjohn tehran hectors katy'''[_]for _ in sorted(range
(26),key=lambda p:(hex(29)[2:].upper()+str(3*3*3*3-3**4)+'HG'+str(sum(
range(4)))+'JFO'+str((1+2)**(1+1))+'P'+str(35/7)[:1]+'i.c.k.'.replace(
'.','').upper()+'AM'+str(3**2*sum(range(5))-3)+hex(0o5444)[2:].replace
(*'\x62|\x42'.split('|'))+'NL'+hex(0o076).split('x')[1].upper())[p])))

🐪 Reformed JAPHs: Alphabetic Indexing

Fri, 01 Apr 2011 00:00:00 +0000

Note: This post was edited for clarity.

Many years ago, I was a Perl programmer. Then one day I became disillusioned at the progress of Perl 6 and decided to import this.

This seems to be a fairly common story for Perl to Python converts. While I haven’t looked back much, there are a number of things I really miss about perl (lower case intentional). I miss having value types in a dynamic language, magical and ill-advised use of cryptocontext, and sometimes even pseudohashes because they were inexcusably weird. A language that supports so many ideas out of the box enables an extended learning curve that lasts for many years. “Perl itself is the game.”

Most of all I think I miss writing Perl poetry and JAPHs. Sadly, I didn’t keep any of those I wrote, and I’m not competent enough with the language anymore to write interesting ones. At the time I was intentionally distancing myself from a model that was largely implicit and based on archaic systems internals and moving to one that was (supposedly) explicit and simple.

After switching to Python as my primary language, I used the following email signature in a nod to this change in orientation (intended for Python 2):

print 'just another python hacker'

Recently I’ve been experimenting with writing JAPHs in Python. I think of these as “reformed JAPHs.” They accomplish the same purpose as programming exercises but in a more restricted context. In some ways they are more challenging. Creativity can be difficult in a narrowly defined landscape.

I have written a small series of reformed JAPHs which increase monotonically in complexity. Here is the first one, written in plain understandable Python 3.

import string

letters = string.ascii_lowercase + ' '
indices = [
     9, 20, 18, 19, 26,  0, 13, 14, 19, 7,  4, 17, 26,
    15, 24, 19,  7, 14, 13, 26,  7,  0, 2, 10,  4, 17
]

print(''.join(letters[i] for i in indices))

This is fairly simple. Instead of explicitly embedding the string 'just another python hacker' in the program, we assemble it using the index of its letters in the string 'abcdefghijklmnopqrstuvwxyz '. We then obfuscate through a series of minor measures:

Instead of calling the print function, we import sys and make a call to sys.stdout.write.
We assemble string.lowercase + ' ' by joining together the character versions of its respective ordinal values (97 to 123 and 32).
We join together the integer indices using 'l' and split that into a list.
We apply ''' liberally and rely on the fact that python concatenates adjacent strings.

Here’s the obfuscated version:

eval("__import__('''\x73''''''\x79''''''\x73''').sTdOuT".lower()
).write(''.join(map(lambda _:(list(map(chr,range(97,123)))+[chr(
32)])[int(_)],('''9l20l18l19''''''l26l0l13l14l19l7l4l17l26l15'''
'''l24l19l7l14l1''''''3l26l7l0l2l10l4l17''').split('l')))+'\n',)

We could certainly do more, but that’s where I left this one. Stay tuned for the next JAPH.

📈 Simulating GDP Growth

Wed, 23 Feb 2011 00:00:00 +0000

I hope you saw “China’s way to the top” on the Post’s website recently. It’s a very clear presentation of their statement and is certainly worth a look.

So say you’re an economist and you actually do need to produce a realistic estimate of when China’s GDP surpasses that of the USA. Can you use such an approach? Not really. There are several simplifying assumptions the Post made that are perfectly reasonable. However, if the goal is an analytical output from a highly random system such as GDP growth, one should not assume the inputs are fixed. (I’m not saying I have any gripe with their interactive. This post has a different purpose.)

Why is this? The short answer is that randomness in any system can change its output drastically from one run to the next. Even if the mean from a deterministic analysis is correct, it tells us nothing about the variance of our output. We really need a confidence interval of years when China is likely to overtake the USA.

We’ll move in the great tradition of all simulation studies. First we pepare our input. A CSV of GDP in current US dollars for both countries from 1960 to 2009 is available from the World Bank data files. We read this into a data frame and calculate their growth rates year over year. Note that the first value for growth has to be NA.

gdp <- read.csv('gdp.csv')
gdp$USA.growth <- rep(NA, length(gdp$USA))
gdp$China.growth <- rep(NA, length(gdp$China))
for (i in 2:length(gdp$USA)) {
  gdp$USA.growth[i] <- 100 * (gdp$USA[i] - gdp$USA[i-1]) / gdp$USA[i-1]
  gdp$China.growth[i] <- 100 * (gdp$China[i] - gdp$China[i-1]) / gdp$China[i-1]
}

We now analyze our inputs and assign probability distributions to the annual growth rates. In a full study this would involve comparing a number of different distributions and choosing the one that fits the input data best, but that’s well beyond the scope of this post. Instead, we’ll use the poor man’s way out: plot histograms and visually verify what we hope to be true, that the distributions are normal.

And they pretty much are. That’s good enough for our purposes. Now all we need are the distribution parameters, which are mean and standard deviation for normal distributions.

> mean(gdp$USA.growth[!is.na(gdp$USA.growth)])
[1] 7.00594

> sd(gdp$USA.growth[!is.na(gdp$USA.growth)])
[1] 2.889808

> mean(gdp$China.growth[!is.na(gdp$China.growth)])
[1] 9.90896

> sd(gdp$China.growth[!is.na(gdp$China.growth)])
[1] 10.5712code>pre>

Now our input analysis is done. These are the inputs:

$$ \begin{align*} \text{USA Growth} &\sim \mathcal{N}(7.00594, 2.889808^2)\\ \text{China Growth} &\sim \mathcal{N}(9.90896, 10.5712^2) \end{align*} $$

This should make the advantage of such an approach much more obvious. Compare the standard deviations for the two countries. China is a lot more likely to have negative GDP growth in any given year. They’re also more likely to have astronomical growth.

We now build and run our simulation study. The more times we run the simulation the tighter we can make our confidence interval (to a point), so we’ll pick a pretty big number somewhat arbitrarily. If we want to, we can be fairly scientific about determining how many iterations are necessary after we’ve done some runs, but we have to start somewhere.

repetitions <- 10000

This is the code for our simulation. For each iteration, it starts both countries at their 2009 GDPs. It then iterates, changing GDP randomly until China’s GDP is at least the same value as the USA’s. When that happens, it records the current year.

results <- rep(NA, repetitions)
for (i in 1:repetitions) {
  usa <- gdp$USA[length(gdp$USA)]
  china <- gdp$China[length(gdp$China)]
  year <- gdp$Year[length(gdp$Year)]

  while (TRUE) {
    year <- year + 1

    usa.growth <- rnorm(1, 7.00594, 2.889808)
    china.growth <- rnorm(1, 9.90896, 10.5712)

    usa <- usa * (1 + (usa.growth / 100))
    china <- china * (1 + (china.growth / 100))

    if (china >= usa) {
      results[i] <- year
      break
     }
  }
}

From the results vector we see that, given the data and assumptions for this model, China should surpass the USA in 2058. We also see that we can be 95% confident that the mean year this will happen is between 2057 and 2059. This is not quite the same as saying we are confident this will actually happen between those years. The result of our simulation is a probability distribution and we are discovering information about it.

> mean(results)
[1] 2058.494

> mean(results) + (sd(results) / sqrt(length(results)) * qnorm(0.025))
[1] 2057.873

> mean(results) + (sd(results) / sqrt(length(results)) * qnorm(0.975))
[1] 2059.114code>pre>

So what’s wrong with this model? Well, we had to make a number of assumptions:

We assume we actually used the right data set. This was more of a how-to than a proper analysis, so that wasn’t too much of a concern.
We assume future growth for the next 40-50 years resembles past growth from 1960-2009. This is a bit ridiculous, of course, but that’s the problem with forecasting.
*We assume growth is normally distributed and that we don’t encounter heavy-tailed behaviors in our distributions. We assume each year’s growth is independent of the year before it. See the last exercise.

Here are some good simulation exercises if you’re looking to do more:

Note how the outputs are quite a bit different from the Post graphic. I expect that’s largely due to the inclusion of data back to 1960. Try running the simulation for yourself using just the past 10, 20, and 30 years and see how that changes the result.<
Write a simulation to determine the probability China’s GDP surpasses the USA’s in the next 25 years. Now plot the mean GDP and 95% confidence intervals for each country per year.
Assume that there are actually two distributions for growth for each country: one when the previous year had positive growth and another when it was negative. How does that change the output?

🧐 Data Fitting 2a - Very, Very Simple Linear Regression in R

Wed, 16 Feb 2011 00:00:00 +0000

Note: This post was updated to include an example data file.

I thought it might be useful to follow up the last post with another one showing the same examples in R.

R provides a function called lm, which is similar in spirit to NumPy’s linalg.lstsq. As you’ll see, lm’s interface is a bit more tuned to the concepts of modeling.

We begin by reading in the example CSV into a data frame:

responses <- read.csv('example_data.csv')
responses

  respondent vanilla.love strawberry.love chocolate.love dog.love cat.love
1     Aylssa            9               4              9        9        9
2       Ben8            8               6              4       10        4
3         Cy            9               4              8        2        6
4        Eva            3               7              9        4        6
5        Lem            6               8              5        2        5
6      Louis            4               5              3       10        3

A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We’ll create the model for predicting “dog love.” To create the “cat love” model, simply use that column name instead:

fit1 <- lm(
  responses$dog.love ~ responses$vanilla.love +
                       responses$strawberry.love + 
                       responses$chocolate.love
)

The syntax for lm is a little off-putting at first. This call tells it to create a model for “dog love” with respect to (the ~) a function of the form offset + x1 * vanilla love + x2 * strawberry love + x3 * chocolate love. Note that the offset is conveniently implied when using lm, so this is the same as the second model we created in Python. Now that we’ve computed the coefficients for our “dog love” model, we can ask R about it:

summary(fit1)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + responses$strawberry.love + 
    responses$chocolate.love)

Residuals:
      1       2       3       4       5       6 
 3.1827  2.9436 -4.5820  0.8069 -1.9856 -0.3657 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                20.9298    15.0654   1.389    0.299
responses$vanilla.love     -0.2783     0.9934  -0.280    0.806
responses$strawberry.love  -1.4314     1.5905  -0.900    0.463
responses$chocolate.love   -0.7647     0.8214  -0.931    0.450

Residual standard error: 4.718 on 2 degrees of freedom
Multiple R-squared:  0.4206, Adjusted R-squared:  -0.4485 
F-statistic: 0.484 on 3 and 2 DF,  p-value: 0.7272

This gives us quite a bit of information, including the coefficients for our “dog love” model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R’s vectorized arithmetic:

20.9298 - 
  0.2783 * responses$vanilla.love -
  1.4314 * responses$strawberry.love -
  0.7647 * responses$chocolate.love

[1]  5.8172  7.0562  6.5819  3.1928  3.9853 10.3655

You’ll notice the model is essentially the same as the one we got from NumPy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I() function allows us to easily add additional operators to columns. That’s how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I() is more convenient and natural.

fit2 <- lm(responses$dog.love ~ responses$vanilla.love +
  I(responses$vanilla.love^2) + responses$strawberry.love +
  I(responses$strawberry.love^2) + responses$chocolate.love +
  I(responses$chocolate.love^2))

summary(fit2)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) + 
    responses$strawberry.love + I(responses$strawberry.love^2) + 
    responses$chocolate.love + I(responses$chocolate.love^2))

Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                    -357.444        NaN     NaN      NaN
responses$vanilla.love           72.444        NaN     NaN      NaN
I(responses$vanilla.love^2)      -6.111        NaN     NaN      NaN
responses$strawberry.love        59.500        NaN     NaN      NaN
I(responses$strawberry.love^2)   -5.722        NaN     NaN      NaN
responses$chocolate.love          7.000        NaN     NaN      NaN
I(responses$chocolate.love^2)        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 5 and 0 DF,  p-value: NA

We can see that we get the same “dog love” model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):

-357.444 + 
  72.444 * responses$vanilla.love -
  6.111 * responses$vanilla.love^2 +
  59.5 * responses$strawberry.love -
  5.722 * responses$strawberry.love^2 +
  7 * responses$chocolate.love

[1]  9.009 10.012  2.009  4.011  2.016 10.006

🧐 Data Fitting 2 - Very, Very Simple Linear Regression in Python

Tue, 15 Feb 2011 00:00:00 +0000

This post is based on a memo I sent to some former colleagues at the Post. I’ve edited it for use here since it fits well as the second in a series on simple data fitting techniques. If you’re among the many enlightened individuals already using regression analysis, then this post is probably not for you. If you aren’t, then hopefully this provides everything you need to develop rudimentary predictive models that yield surprising levels of accuracy.

Data

For purposes of a simple working example, we have collected six records of input data over three dimensions with the goal of predicting two outputs. The input data are:

$$ \begin{align*} x_1 &= \text{How much a respondent likes vanilla [0-10]}\\ x_2 &= \text{How much a respondent likes strawberry [0-10]}\\ x_3 &= \text{How much a respondent likes chocolate [0-10]} \end{align*} $$

Output data consist of:

$$ \begin{align*} b_1 &= \text{How much a respondent likes dogs [0-10]}\\ b_2 &= \text{How much a respondent likes cats [0-10]} \end{align*} $$

Below are anonymous data collected from a random sample of people.

respondent	vanilla ❤️	strawberry ❤️	chocolate ❤️	dog ❤️	cat ❤️
Alyssa P Hacker	9	4	9	9	8
Ben Bitdiddle	8	6	4	10	4
Cy D. Fect	9	4	8	2	6
Eva Lu Ator	3	7	9	4	6
Lem E. Tweakit	6	8	5	2	5
Louis Reasoner	4	5	3	10	3

Our input is in three dimensions. Each output requires its own model, so we’ll have one for dogs and one for cats. We’re looking for functions, dog(x) and cat(x), that can predict $b_1$ and $b_2$ based on given values of $x_1$, $x_2$, and $x_3$.

Model 1

For both models we want to find parameters that minimize their squared residuals (read: errors). There’s a number of names for this. Optimization folks like to think of it as unconstrained quadratic optimization, but it’s more common to call it least squares or linear regression. It’s not necessary to entirely understand why for our purposes, but the function that minimizes these errors is:

$$\beta = ({A^t}A)^{-1}{A^t}b$$

This is implemented for you in the numpy.linalg Python package, which we’ll use for examples. Much more information than you probably want can be found here.

Below is a first stab at a Python version. It runs least squares against our input and output data exactly as they are. You can see the matrix $A$ and outputs $b_1$ and $b_2$ (dog and cat love, respectively) are represented just as they are in the table.

# Version 1: No offset, no squared inputs

import numpy

A = numpy.vstack([
    [9, 4, 9],
    [8, 6, 4],
    [9, 4, 8],
    [3, 7, 9],
    [6, 8, 5],
    [4, 5, 3]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# Output:
# dog ❤️: [0.72548294      0.53045642     -0.29952361]
# cat ❤️: [2.36110929e-01  2.61934385e-05  6.26892476e-01]

The resulting model is:

dog(x) = 0.72548294 * x1 + 0.53045642 * x2 - 0.29952361 * x3
cat(x) = 2.36110929e-01 * x1 + 2.61934385e-05 * x2 + 6.26892476e-01 * x3

The coefficients before our variables correspond to beta in the formula above. Errors between observed and predicted data, shown below, are calculated and summed. For these six records, dog(x) has a total error of 20.76 and cat(x) has 3.74. Not great.

respondent	predicted b1	b1 error	predicted b2	b2 error
Alyssa P Hacker	5.96	3.04	7.77	1.23
Ben Bitdiddle	7.79	2.21	4.40	0.40
Cy D. Fect	6.25	4.25	7.14	1.14
Eva Lu Ator	3.19	0.81	6.35	0.35
Lem E. Tweakit	7.10	5.10	4.55	0.45
Louis Reasoner	4.66	5.34	2.83	0.17
Total error:		20.76		3.74

Model 2

One problem with this model is that dog(x) and cat(x) are forced to pass through the origin. (Why is that?) We can improve it somewhat if we add an offset. This amounts to prepending 1 to every row in $A$ and adding a constant to the resulting functions. You can see the very slight difference between the code for this model and that of the previous:

# Version 2: Offset, no squared inputs

import numpy

A = numpy.vstack([
    [1, 9, 4, 9],
    [1, 8, 6, 4],
    [1, 9, 4, 8],
    [1, 3, 7, 9],
    [1, 6, 8, 5],
    [1, 4, 5, 3]
])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# Output:
# dog ❤️: [20.92975427  -0.27831197  -1.43135684  -0.76469017]
# cat ❤️: [-0.31744124   0.25133547   0.02978098   0.63394765]

This yields the seconds version of our models:

dog(x) = 20.92975427 - 0.27831197 * x1 - 1.43135684 * x2 - 0.76469017 * x3
cat(x) = -0.31744124 + 0.25133547 * x1 + 0.02978098 * x2 + 0.63394765 * x3

These models provide errors of 13.87 and 3.79. A little better on the dog side, but still not quite usable.

respondent	predicted b1	b1 error	predicted b2	b2 error
Alyssa P Hacker	5.82	3.18	7.77	1.23
Ben Bitdiddle	7.06	2.94	4.41	0.41
Cy D. Fect	6.58	4.58	7.14	1.14
Eva Lu Ator	3.19	0.81	6.35	0.35
Lem E. Tweakit	3.99	1.99	4.60	0.40
Louis Reasoner	10.37	0.37	2.74	0.26
Total error:		13.87		3.79

Model 3

The problem is that dog(x) and cat(x) are linear functions. Most observed data don’t conform to straight lines. Take a moment and draw the line $f(x) = x$ and the curve $f(x) = x^2$. The former makes a poor approximation of the latter.

Most of the time, people just use squares of the input data to add curvature to their models. We do this in our next version of the code by just adding squares of the input row values to our $A$ matrix. Everything else is the same. (In reality, you can add any function of the input data you feel best models the data, if you understand it well enough.)

# Version 3: Offset with squared inputs

import numpy

A = numpy.vstack([
    [1, 9, 9**2, 4, 4**2, 9, 9**2],
    [1, 8, 8**2, 6, 6**2, 4, 4**2],
    [1, 9, 9**2, 4, 4**2, 8, 8**2],
    [1, 3, 3**2, 7, 7**2, 9, 9**2],
    [1, 6, 6**2, 8, 8**2, 5, 5**2],
    [1, 4, 4**2, 5, 5**2, 3, 3**2]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print('dog ❤️:', numpy.linalg.lstsq(A, b1, rcond=None)[0])
print('cat ❤️:', numpy.linalg.lstsq(A, b2, rcond=None)[0])

# dog ❤️: [1.29368307  7.03633306  -0.44795498  9.98093332
#  -0.75689575  -19.00757486  1.52985734]
# cat ❤️: [0.47945896  5.30866067  -0.39644128 -1.28704188
#   0.12634295   -4.32392606  0.43081918]

This gives us our final version of the model:

dog(x) = 1.29368307 + 7.03633306 * x1 - 0.44795498 * x1**2 + 9.98093332 * x2 - 0.75689575 * x2**2 - 19.00757486 * x3 + 1.52985734 * x3**2
cat(x) = 0.47945896 + 5.30866067 * x1 - 0.39644128 * x1**2 - 1.28704188 * x2 + 0.12634295 * x2**2 - 4.32392606 * x3 + 0.43081918 * x3**2

Adding curvature to our model eliminates all perceived error, at least within 1e-16. This may seem unbelievable, but when you consider that we only have six input records, it isn’t really.

respondent	predicted b1	predicted b2
Alyssa P Hacker	9	9
Ben Bitdiddle	10	4
Cy D. Fect	2	6
Eva Lu Ator	4	6
Lem E. Tweakit	2	5
Louis Reasoner	10	3
Total error:

It should be fairly obvious how one can take this and extrapolate to much larger models. I hope this is useful and that least squares becomes an important part of your lives.

🗳 Off the Cuff Voter Fraud Detection

Tue, 30 Nov 2010 00:00:00 +0000

Consider this scenario: You run a contest that accepts votes from the general Internet population. In order to encourage user engagement, you record any and all votes into a database over several days, storing nothing more than the competitor voted for, when each vote is cast, and a cookie set on the voter’s computer along with their apparent IP addresses. If a voter already has a recorded cookie set they are denied subsequent votes. This way you can avoid requiring site registration, a huge turnoff for your users. Simple enough.

Unfortunately, some of the competitors are wily and attached to the idea of winning. They go so far as programming or hiring bots to cast thousands of votes for them. Your manager wants to know which votes are real and which ones are fake Right Now. Given very limited time, and ignoring actions that you could have taken to avoid the problem, how can you tell apart sets of good votes from those that shouldn’t be counted?

One quick-and-dirty option involves comparing histograms of interarrival times for sets of votes. Say you’re concerned that all the votes during a particular period of time or from a given IP address might be fraudulent. Put all the vote times you’re concerned about into a list, sort them, and compute their differences:

# times is a list of datetime instances from vote records
times.sort(reversed=True)
interarrivals = [y-x for x, y in zip(times, times[1:]]

Now use matplotlib to display a histogram of these. Votes that occur naturally are likely to resemble an exponential distribution in their interarrival times. For instance, here are interarrival times for all votes received in a contest:

This subset of votes is clearly fraudulent, due to the near determinism of their interarrival times. This is most likely caused by the voting bot not taking random sleep intervals during voting. It casts a vote, receives a response, clears its cookies, and repeats:

These votes, on the other hand, are most likely legitimate. They exhibit a nice Erlang shape and appear to have natural interarrival times that one would expect:

Of course this method is woefully inadequate for rigorous detection of voting fraud. Ideally one would find a method to compute the probability that a set of votes is generated by a bot. This is enough to inform quick, ad hoc decisions though.

🧐 Data Fitting 1 - Linear Data Fitting

Tue, 23 Nov 2010 00:00:00 +0000

Note: This post was updated to work with Python 3 and PySCIPOpt. The original version used Python 2 and python-zibopt.

Data fitting is one of those tasks that everyone should have at least some exposure to. Certainly developers and analysts will benefit from a working knowledge of its fundamentals and their implementations. However, in my own reading I’ve found it difficult to locate good examples that are simple enough to pick up quickly and come with accompanying source code.

This article commences an ongoing series introducing basic data fitting techniques. With any luck they won’t be overly complex, while still being useful enough to get the point across with a real example and real data. We’ll start with a binary classification problem: presented with a series of records, each containing a set number of input values describing it, determine whether or not each record exhibits some property.

Model

We’ll use the cancer1.dt data from the proben1 set of test cases, which you can download here. Each record starts with 9 data points containing physical characteristics of a tumor. The second to last data point contains 1 if a tumor is benign and 0 if it is malignant. We seek to find a linear function we can run on an arbitrary record that will return a value greater than zero if that record’s tumor is predicted to be benign and less than zero if it is predicted to be malignant. We will train our linear model on the first 350 records, and test it for accuracy on the remaining rows.

This is similar to the data fitting problem found in Chvatal. Our inputs consist of a matrix of observed data, $A$, and a vector of classifications, $b$. In order to classify a record, we require another vector $x$ such that the dot product of $x$ and that record will be either greater or less than zero depending on its predicted classification.

A couple points to note before we start:

Most observed data are noisy. This means it may be impossible to locate a hyperplane that cleanly separates given records of one type from another. In this case, we must resort to finding a function that minimizes our predictive error. For the purposes of this example, we’ll minimize the sum of the absolute values of the observed and predicted values. That is, we seek $x$ such that we find $min \sum_i{|a_i^T x-b_i|}$.
The slope-intercept form of a line, $f(x)=m^T x+b$, contains an offset. It should be obvious that this is necessary in our model so that our function isn’t required to pass through the origin. Thus, we’ll be adding an extra variable with the coefficient of 1 to represent our offset value.
In order to model this, we use two linear constraints for each absolute value. We minimize the sum of these. Our Linear Programming model thus looks like:

$$ \begin{align*} \min\quad & z = x_0 + \sum_i{v_i}\\ \text{s.t.}\quad& v_i \geq x_0 + a_i^\intercal x - 1 &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq 1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{benign tumors}\\ & v_i \geq x_0 + a_i^\intercal x - (-1) &\quad\forall&\quad\text{malignant tumors}\\ & v_i \geq -1 - x_0 - a_i^\intercal x &\quad\forall&\quad\text{malignant tumors} \end{align*} $$

Code

In order to do this in Python, we use SCIP and SoPlex. We start by setting constants for benign and malignant outputs and providing a function to read in the training and testing data sets.

# Preferred output values for tumor categories
BENIGN = 1
MALIGNANT = -1

def read_proben1_cancer_data(filename, train_size):
    '''Loads a proben1 cancer file into train & test sets'''
    # Number of input data points per record
    DATA_POINTS = 9

    train_data = []
    test_data = []

    with open(filename) as infile:
        # Read in the first train_size lines to a training data list, and the
        # others to testing data. This allows us to test how general our model
        # is on something other than the input data.
        for line in infile.readlines()[7:]: # skip header
            line = line.split()

            # Records = offset (x0) + remaining data points
            input = [float(x) for x in line[:DATA_POINTS]]
            output = BENIGN if line[-2] == '1' else MALIGNANT
            record = {'input': input, 'output': output}

            # Determine what data set to put this in
            if len(train_data) >= train_size:
                test_data.append(record)
            else:
                train_data.append(record)

    return train_data, test_data

The next function implements the LP model described above using SoPlex and SCIP. It minimizes the sum of residuals for each training record. This amounts to summing the absolute value of the difference between predicted and observed output data. The following function takes in input and observed output data and returns a list of coefficients. Our resulting model consists of taking the dot product of an input record and these coefficients. If the result is greater than or equal to zero, that record is predicted to be a benign tumor, otherwise it is predicted to be malignant.

from pyscipopt import Model

def train_linear_model(train_data):
    '''
    Accepts a set of input training data with known output
    values.  Returns a list of coefficients to apply to
    arbitrary records for purposes of binary categorization.
    '''
    # Make sure we have at least one training record.
    assert len(train_data) > 0
    num_variables = len(train_data[0]['input'])

    # Variables are coefficients in front of the data points. It is important
    # that these be unrestricted in sign so they can take negative values.
    m = Model()
    x = [m.addVar(f'x{i}', lb=None) for i in range(num_variables)]

    # Residual for each data row
    residuals = [m.addVar(lb=None, ub=None) for _ in train_data]
    for r, d in zip(residuals, train_data):
        # r will be the absolute value of the difference between observed and
        # predicted values. We can model absolute values such as r >= |foo| as:
        #
        #   r >=  foo
        #   r >= -foo
        m.addCons(sum(x * y for x, y in zip(x, d['input'])) + r >= d['output'])
        m.addCons(sum(x * y for x, y in zip(x, d['input'])) - r <= d['output'])

    # Find and return coefficients that min sum of residuals.
    m.setObjective(sum(residuals))
    m.setMinimize()
    m.optimize()

    solution = m.getBestSol()
    return [solution[xi] for xi in x]

We also provide a convenience function for counting the number of correct predictions by our resulting model against either the test or training data sets.

def count_correct(data_set, coefficients):
    '''Returns the number of correct predictions.'''
    correct = 0
    for d in data_set:
        result = sum(x*y for x, y in zip(coefficients, d['input']))

        # Do we predict the same as the output?
        if (result >= 0) == (d['output'] >= 0):
            correct += 1

    return correct

Finally we write a main method to read in the data, build our linear model, and test its efficacy.

from pprint import pprint

if __name__ == '__main__':
    # Specs for this input file
    INPUT_FILE_NAME = 'cancer1.dt'
    TRAIN_SIZE = 350

    train_data, test_data = read_proben1_cancer_data(
        INPUT_FILE_NAME,
        TRAIN_SIZE
    )

    # Add the offset variable to each of our data records
    for data_set in [train_data, test_data]:
        for row in data_set:
            row['input'] = [1] + row['input']

    coefficients = train_linear_model(train_data)
    print('coefficients:')
    pprint(coefficients)

    # Print % of correct predictions for each data set
    correct = count_correct(train_data, coefficients)
    print(
        '%s / %s = %.02f%% correct on training set' % (
            correct, len(train_data),
            100 * float(correct) / len(train_data)
        )
    )

    correct = count_correct(test_data, coefficients)
    print(
        '%s / %s = %.02f%% correct on testing set' % (
            correct, len(test_data),
            100 * float(correct) / len(test_data)
        )
    )

Results

The result of running this model against the cancer1.dt data set is:

coefficients:
[1.4072882449702786,
 -0.14014055927954652,
 -0.6239513714263405,
 -0.26727681774258882,
 0.067107753841131157,
 -0.28300216102808429,
 -1.0355594670918404,
 -0.22774451038152174,
 -0.69871243677663608,
 -0.072575089848659444]
328 / 350 = 93.71% correct on training set
336 / 349 = 96.28% correct on testing set

The accuracy is pretty good here against the both the training and testing sets, so this particular model generalizes well. This is about the simplest model we can implement for data fitting, and we’ll get to more complicated ones later, but it’s nice to see we can do so well so quickly. The coefficients correspond to using a function of this form, rounding off to three decimal places:

$$ \begin{align*} f(x) =\ &1.407 - 0.140 x_1 - 0.624 x_2 - 0.267 x_3 + 0.067 x_4 - \\ &0.283 x_5 - 1.037 x_6 - 0.228 x_7 - 0.699 x_8 - 0.073 x_9 \end{align*} $$

Resources

cancer1.dt data file from proben1
Full source listing

🐍 Monte Carlo Simulation in Python

Thu, 08 Oct 2009 00:00:00 +0000

Note: This post was updated to work with Python 3.

One of the most useful tools one learns in an Operations Research curriculum is Monte Carlo Simulation. Its utility lies in its simplicity: one can learn vital information about nearly any process, be it deterministic or stochastic, without wading through the grunt work of finding an analytical solution. It can be used for off-the-cuff estimates or as a proper scientific tool. All one needs to know is how to simulate a given process and its appropriate probability distributions and parameters if that process is stochastic.

Here’s how it works:

Construct a simulation that, given input values, returns a value of interest. This could be a pure quantity, like time spent waiting for a bus, or a boolean indicating whether or not a particular event occurs.
Run the simulation a, usually large, number of times, each time with randomly generated input variables. Record its output values.
Compute sample mean and variance of the output values.

In the case of time spent waiting for a bus, the sample mean and variance are estimators of mean and variance for one’s wait time. In the boolean case, these represent probability that the given event will occur.

One can think of Monte Carlo Simulation as throwing darts. Say you want to find the area under a curve without integrating. All you must do is draw the curve on a wall and throw darts at it randomly. After you’ve thrown enough darts, the area under the curve can be approximated using the percentage of darts that end up under the curve times the total area.

This technique is often performed using a spreadsheet, but that can be a bit clunky and may make more complex simulations difficult. I’d like to spend a minute showing how it can be done in Python. Consider the following scenario:

Passengers for a train arrive according to a Poisson process with a mean of 100 per hour. The next train arrives exponentially with a rate of 5 per hour. How many passers will be aboard the train?

We can simulate this using the fact that a Poisson process can be represented as a string of events occurring with exponential inter-arrival times. We use the sim() function below to generate the number of passengers for random instances of the problem. We then compute sample mean and variance for these values.

import random

PASSENGERS = 100.0
TRAINS     =   5.0
ITERATIONS = 10000

def sim():
    passengers = 0.0

    # Determine when the train arrives
    train = random.expovariate(TRAINS)

    # Count the number of passenger arrivals before the train
    now = 0.0
    while True:
        now += random.expovariate(PASSENGERS)
        if now >= train:
            break
        passengers += 1.0

    return passengers

if __name__ == '__main__':
    output = [sim() for _ in range(ITERATIONS)]

    total = sum(output)
    mean = total / len(output)

    sum_sqrs = sum(x*x for x in output)
    variance = (sum_sqrs - total * mean) / (len(output) - 1)

    print('E[X] = %.02f' % mean)
    print('Var(X) = %.02f' % variance)

⚡️ On the Beauty of Power Sets

Fri, 27 Feb 2009 00:00:00 +0000

One of the difficulties we encounter in solving the Traveling Salesman Problem (TSP) is that, for even a small number of cities, a complete description of the problem requires a factorial number of constraints. This is apparent in the standard formulation used to teach the TSP to OR students. Consider a set of $n$ cities with the distance from city $i$ to city $j$ denoted $d_{ij}$. We attempt to minimize the total distance of a tour entering and leaving each city exactly once. $x_{ij} = 1$ if the edge from city $i$ to city $j$ is included in the tour, $0$ otherwise:

$$ \small \begin{align*} \min\quad & z = \sum_i \sum_{j\ne i} d_{ij} x_{ij}\\ \text{s.t.}\quad& \sum_{j\ne i} x_{ij} = 1 &\quad\forall&\ i & \text{leave each city once}\\ & \sum_{i\ne j} x_{ij} = 1 &\quad\forall&\ j & \text{enter each city once}\\ & x_{ij} \in \{0,1\} &\quad\forall&\ i,j \end{align*} $$

This appears like a reasonable formulation until we solve it and see that our solution contains disconnected subtours. Suppose we have four cities, labeled $A$ through $D$. Connecting $A$ to $B$, $B$ to $A$, $C$ to $D$ and $D$ to $C$ provides a feasible solution to our formulation, but does not constitute a cycle. Here is a more concrete example of two disconnected subtours $\{(1,5),(5,1)\}$ and $\{(2,3),(3,4),(4,2)\}$ over five cities:

ampl: display x;
x [*,*]
:   1   2   3   4   5    :=
1   0   0   0   0   1
2   0   0   1   0   0
3   0   0   0   1   0
4   0   1   0   0   0
5   1   0   0   0   0
;

Realizing we just solved the Assignment Problem, we now add subtour elimination constraints. These require that any proper, non-null subset of our $n$ cities is connected by at most $n-1$ active edges:

$$ \sum_{i \in S} \sum_{j \in S} x_{ij} \leq |S|-1 \quad\forall\ S \subset {1, …, n}, S \ne O $$

Indexing subtour elimination constraints over a power set of the cities completes the formulation. However, this requires an additional $\sum_{k=2}^{n-1} \begin{pmatrix} n \\ k \end{pmatrix}$ rows tacked on the end of our matrix and is clearly infeasible for large $n$. The most current computers can handle using this approach is around 19 cities. It remains an instructive tool for understanding the combinatorial explosion that occurs in problems like TSP and is worth translating into a modeling language. So how does one model it on a computer?

Unfortunately, AMPL, the gold standard in mathematical modeling languages, is unable to index over sets. Creating a power set in AMPL requires going through a few contortions. The following code demonstrates power and index sets over four cities:

set cities := 1 .. 4 ordered;

param n := card(cities);
set indices := 0 .. (2^n - 1);
set power {i in indices} := {c in cities: (i div 2^(ord(c) - 1)) mod 2 = 1};

display cities;
display n;
display indices;
display power;

This yields the following output:

set cities := 1 2 3 4;

n = 4

set indices := 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15;

set power[0] := ; # empty
set power[1] := 1;
set power[2] := 2;
set power[3] := 1 2;
set power[4] := 3;
set power[5] := 1 3;
set power[6] := 2 3;
set power[7] := 1 2 3;
set power[8] := 4;
set power[9] := 1 4;
set power[10] := 2 4;
set power[11] := 1 2 4;
set power[12] := 3 4;
set power[13] := 1 3 4;
set power[14] := 2 3 4;
set power[15] := 1 2 3 4;

Note how the index set contains an index for each row in our power set. We can now generate the subtour elimination constraints:

var x {cities cross cities} binary;
s.t. subtours {i in indices: card(power[i]) > 1 and card(power[i]) < card(cities)}:
sum {(c,k) in power[i] cross power[i]: k != c} x[c,k] <= card(power[i]) - 1;

expand subtours;

subject to subtours[3]:  x[1,2] + x[2,1] <= 1;
subject to subtours[5]:  x[1,3] + x[3,1] <= 1;
subject to subtours[6]:  x[2,3] + x[3,2] <= 1;
subject to subtours[7]:  x[1,2] + x[1,3] + x[2,1] + x[2,3] + x[3,1] + x[3,2] <= 2;
subject to subtours[9]:  x[1,4] + x[4,1] <= 1;
subject to subtours[10]: x[2,4] + x[4,2] <= 1;
subject to subtours[11]: x[1,2] + x[1,4] + x[2,1] + x[2,4] + x[4,1] + x[4,2] <= 2;
subject to subtours[12]: x[3,4] + x[4,3] <= 1;
subject to subtours[13]: x[1,3] + x[1,4] + x[3,1] + x[3,4] + x[4,1] + x[4,3] <= 2;
subject to subtours[14]: x[2,3] + x[2,4] + x[3,2] + x[3,4] + x[4,2] + x[4,3] <= 2;

While this does work, the code for generating the power set looks like voodoo. Understanding it required piece-by-piece decomposition, an exercise I suggest you go through yourself if you have a copy of AMPL and 15 minutes to spare:

set foo {c in cities} := {ord(c)};
set bar {c in cities} := {2^(ord(c) - 1)};
set baz {i in indices} := {c in cities: i div 2^(ord(c) - 1)};
set qux {i in indices} := {c in cities: (i div 2^(ord(c) - 1)) mod 2 = 1};

display foo;
display bar;
display baz;
display qux;

This may be an instance where open source leads commercial software. The good folks who produce the SCIP Optimization Suite provide an AMPL-like language called ZIMPL with a few additional useful features. One of these is power sets. Compared to the code above, doesn’t this look refreshing?

set cities := {1 to 4};

set power[] := powerset(cities);
set indices := indexset(power);

📐 Uncapacitated Lot Sizing

Fri, 20 Feb 2009 00:00:00 +0000

Uncapacitated Lot Sizing (ULS) is a classic OR problem that seeks to minimize the cost of satisfying known demand for a product over time. Demand is subject to varying costs for production, set-up, and storage of the product. Technically, it is a mixed binary integer linear program – the key point separating it from the world of linear optimization being that production cannot occur during any period without paying that period’s fixed costs for set-up. Thus it has linear nonnegative variables for production and storage amounts during each period, and a binary variable for each period that determines whether or not production can actually occur.

For $n$ periods with per-period fixed set-up cost $f_t$, unit production cost $p_t$, unit storage cost $h_t$,and demand $d_t$, we define decision variables related to production and storage quantities:

$$ \small \begin{align*} x_t &= \text{units produced in period}\ t\\ s_t &= \text{stock at the end of period}\ t\\ y_t &= 1\ \text{if production occurs in period}\ t, 0\ \text{otherwise} \end{align*} $$

One can minimize overall cost for satisfying all demand on time using the following model per Wolsey (1998), defined slightly differently here:

$$ \small \begin{align*} \min\quad & z = \sum_t{p_t x_t} + \sum_t{h_t s_t} + \sum_t{f_t y_t}\\ \text{s.t.}\quad& s_1 = d_1 + s_1\\ & s_{t-1} + x_t = d_t + s_t &\quad\forall&\ t > 1\\ & x_t \leq M y_t &\quad\forall&\ t\\ & s_t, x_t \geq 0 &\quad\forall&\ t\\ & y_t \in {0,1} &\quad\forall&\ t \end{align*} $$

According to Wolsey, page 11, given that $s_t = \sum_{i=1}^t (x_i - d_i)$ and defining new constants $K = \sum_{t=1}^n h_t(\sum_{i=1}^t d_i)$ and $c_t = p_t + \sum_{i=t}^n h_i$, the objective function can be rewritten as $z = \sum_t c_t x_t + \sum _t f_t y_t - K$. The book lacks a proof of this and it seems a bit non-obvious, so I attempt an explanation in somewhat painstaking detail here.

$$ \small \begin{align*} &\text{Proof}:\\ & & \sum_t p_t x_t + \sum_t h_t s_t + \sum_t f_t y_t &= \sum_t c_t x_t + \sum _t f_t y_t - K\\ &\text{1. Remove} \sum_t f_t y_t:\\ & & \sum_t p_t x_t + \sum_t h_t s_t &= \sum_t c_t x_t - K\\ &\text{2. Expand}\ K\ \text{and}\ c_t:\\ & & \sum_t p_t x_t + \sum_t h_t s_t &= \sum_t (p_t + \sum_{i=t}^n h_i) x_t - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{3. Remove}\ \sum_t p_t x_t:\\ & & \sum_t h_t s_t &= \sum_t x_t (\sum_{i=t}^n h_i) - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{4. Expand}\ s_t:\\ & & \sum_t h_t (\sum_{i=1}^t x_i) - \sum_t h_t (\sum_{i=1}^t d_i) &= \sum_t x_t (\sum_{i=t}^n h_i) - \sum_t h_t (\sum_{i=1}^t d_i)\\ &\text{5. Remove}\ \sum_t h_t (\sum_{i=1}^t d_i):\\ & & \sum_t h_t (\sum_{i=1}^t x_i) &= \sum_t x_t (\sum_{i=t}^n h_i) \end{align*} $$

The result from step 5 becomes obvious upon expanding its left and right-hand terms:

$$ h_1 x_1 + h_2 (x_1 + x_2) + \cdots + h_n (x_1 + \cdots + x_n) =\\ x_1 (h_1 + \cdots + h_n) + x2 (h_2 + \cdots + h_n) + \cdots + x_n h_n $$

In matrix notation with $h$ and $x$ as column vectors in $\bf R^n$ and $L$ and $U$ being $n \times n$ lower and upper triangular identity matrices, respectively, this can be written as:

$$ \small \begin{pmatrix} h_1 \cdots h_n \end{pmatrix} \begin{pmatrix} 1 \cdots 0 \\ \vdots \ddots \vdots \\ 1 \cdots 1 \end{pmatrix} \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} = \begin{pmatrix} x_1 \cdots x_n \end{pmatrix} \begin{pmatrix} 1 \cdots 1 \\ \vdots \ddots \vdots \\ 0 \cdots 1 \end{pmatrix} \begin{pmatrix} h_1 \\ \vdots \\ h_n \end{pmatrix} $$

or $h^T L x = x^T U h$.