Showing posts with label SMP. Show all posts

Monday, February 13, 2012

Transactional Memory in Intel Haswell: The Good, and a Possible Ugly

Note: Part of this post is being updated based on new information received after it was published. Please check back tomorrow, 2/18/12, for an update.

Sorry... no update yet (2/21/12). Bad cold. Also seduction / travails of a new computer. Update soon.

I asked James Reinders at the Fall 2011 IDF whether the synchronization features he had from the X86 architecture were adequate for MIC. (Transcript; see the very end.) He said that they were good for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something different.

Now Intel has announced support for transactional memory in Haswell, the chip which follows their Ivy Bridge chip that is just starting shipment as I write this. So I think I’d now be willing to take bets that this is what James was vaguely hinting at, and will appear in Intel’s MIC HPC architecture as it ships in Knight’s Corner product. I prefer to take bets on sure things.

There have been some light discussion of Intel’s “Transactional Synchronization Extensions” (TSE), as this is formally called, and a good example of its use from James Reinders. But now that an architecture spec has been published for TSE, we can get a bit deeper into what, exactly, Intel is providing, and where there might just be a skeleton in the closet.

First, background: What is this “transactional memory” stuff? Why is it useful? Then we’ll get into what Intel has, and the skeleton I believe is lurking.

Transactional Memory

The term “transaction” comes from contract law, was picked up by banking, and from there went into database systems. It refers to a collection of actions that all happen as a unit; they cannot be divided. If I give you money and you give me a property deed, for example, that happens as if it were one action – a transaction. The two parts can’t be (legally) separated; both happen, or neither. Or, in the standard database example: when I transfer money from my bank account to Aunt Sadie’s, the subtraction from my account and the addition to hers must either both happen, or neither; otherwise money is being either destroyed or created, which would be a bad thing. As you might imagine, databases have evolved a robust technology to do transactions where all the changes wind up on stable storage (disk, flash).

The notion of transactional memory is much the same: a collection of changes to memory is made all-or-nothing: Either all of them happen, as seen by every thread, process, processor, or whatever; or none of them happen. So, for example, when some program plays with the pointers in a linked list to insert or delete some list member, nobody can get in there when the update is partially done and follow some pointer to oblivion.

It applies as much to a collection of accesses – reads – as it does to changes – writes. The read side is necessary to ensure that a consistent collection of information is read and acted upon by entities that may be looking around while another is updating.

To do this, typically a program will issue something meaning “Transaction On!” to start the ball rolling. Once that’s done, everything it writes is withheld from view by all other entities in the system; and anything it reads is put on special monitoring in case someone else mucks with it. The cache coherence hardware is mostly re-used to make this monitoring work; cross-system memory monitoring is what cache coherence does.

This continues, accumulating things read and written, until the program issues something meaning “Transaction Off!”, typically called “Commit!.” Then, hraglblargarolfargahglug! All changes are vomited at once into memory, and the locations read are forgotten about.

What happens if some other entity does poke its nose into those saved and monitored locations, changing something the transactor was depending on or modifying? Well, “Transaction On!” was really “Transaction On! And, by the way, if anything screws up go there.” On reaching there, all the recording of data read and changes made has been thrown away; and there is a block of code that usually sets things up to start again, going back to the “Transaction On!” point. (The code could also decide “forget it” and not try over again.) Quitting like this in a controlled manner is called aborting a transaction. It is obviously better if aborts don’t happen a lot, just like it’s better if a lock is not subject to a lot of contention. However, note that nobody else has seen any of the changes made since “On!”, so half-mangled data structures are never seen by anyone.

Why Transactions Are a Good Thing

What makes transactional semantics potentially more efficient than simple locking is that only those memory locations read or referenced at run time are maintained consistently during the transaction. The consistency does not apply to memory locations that could be referenced, only the ones that actually are referenced.

There are situations where that’s a distinction without a difference, since everybody who gets into some particular transaction-ized section of code will bash on exactly the same data every time. Example: A global counter of how many times some operation has been done by all the processors in a system. Transactions aren’t any better than locks in those situations.

But there are cases where the dynamic nature of transactional semantics can be a huge benefit. The standard example, also used by James Reinders, is a multi-access hash table, with inserts, deletions, and lookups done by many processes, etc.

I won’t go through this is detail – you can read James’ version if you like; he has a nice diagram of a hash table, which I don’t – but consider:

With the usual lock semantics, you could simply have one coarse lock around the whole table: Only one person, read or write, gets in at any time. This works, and is simple, but all access to the table is now serialized, so will cause a problem as you scale to more processors.

Alternatively, you could have a lock per hash bucket, for fine-grained rather than coarse locking. That’s a lot of locks. They take up storage, and maintaining them all correctly gets more complex.

Or you could do either of those – one lock, or many – but also get out your old textbooks and try once again to understand those multiple reader / single writer algorithms and their implications, and, by the way, did you want reader or writer priority? Painful and error-prone.

On the other hand, suppose everybody – readers and writers – simply says “Transaction On!” (I keep wanting to write “Flame On!”) before starting a read or a write; then does a “Commit!” when they exit. This is only as complicated as the single coarse lock (and sounds a lot like an “atomic” keyword on a class, hint hint).

Then what you can bank on is that the probability is tiny that two simultaneous accesses will look at the same hash bucket; if that probability is not tiny, you need a bigger hash table anyway. The most likely thing to happen is that nobody – readers or writers – ever accesses same hash bucket, so everybody just sails right through, “Commit!”s, and continues, all in parallel, with no serialization at all. (Not really. See the skeleton, later.)

In the unlikely event that a reader and a writer are working on the same bucket at the same time, whoever “Commit!”s first wins; the other aborts and tries again. Since this is highly unlikely, overall the transactional version of hashing is a big win: it’s both simple and very highly parallel.

Transactional memory is not, of course, the only way to skin this particular cat. Azul Systems has published a detailed presentation on a Lock-Free Wait-Free Hash Table algorithm that uses only compare-and-swap instructions. I got lost somewhere around the fourth state diagram. (Well, OK, actually I saw the first one and kind of gave up.) Azul has need of such things. Among other things, they sell massive Java compute appliances, going up to the Azul Vega 3 7380D, which has 864 processors sharing 768GB of RAM. Think investment banks: take that, you massively recomplicated proprietary version of a Black-Sholes option pricing model! In Java! (Those guys don’t just buy GPUs.)

However, Azul only needs that algorithm on their port of their software stack to X86-based products. Their Vega systems are based on their own proprietary 54-core Vega processors, which have shipped with transactional memory – which they call Speculative Multi-address Atomicity – since the first system shipped in 2005 (information from Gil Tene, Azul Systems CTO). So, all these notions are not exactly new news.

Anyway, if you want this wait-free super-parallel hash table (and other things, obviously) without exploding your head, transactional memory makes it possible rather simply.

What Intel Has: RTE and HLE

Intel’s Transactional Synchronization Extensions come in two flavors: Restricted Transactional Memory (RTE) and Hardware Lock Elision (HLE).

RTE is essentially what I described above: There’s XBEGIN for “Transaction On!”, XEND for “Commit!” and ABORT if you want to manually toss in the towel for some reason. XBEGIN must be given a there location to go to in case of an abort. When an abort occurs, the processor state is restored to what it was at XBEGIN, except that flags are set indicating the reason for the abort (in EAX).

HLE is a bit different. All the documentation I’ve seen so far always talks about it first, perhaps because it seems like it is more familiar, or they want to brag (no question, it’s clever). I obviously think that’s confusing, so didn’t do it in that order.

HLE lets you take your existing, lock-based, code and transactional-memory-ify it: Lock-based code now runs without blocking unless required, as in the hash table example, with minimal, miniscule change that can probably be done with a compiler and the right flag.

I feel like adding “And… at no time did their fingers leave their hands!” It sounds like a magic trick.

In addition to being magical, it’s also clearly strategic for Intel’s MIC and its Knights SDK HPC accelerators. Those are making a heavy bet on people just wanting to recompile and run without the rewrites needed for accelerators like GPGPUs. (See my post MIC and the Knights.)

HLE works by setting a new instruction prefix – XACQUIRE – on any instruction you use to try to acquire a lock. Doing so causes there to be no change to the lock data: the lock write is “elided.” Instead it (a) takes a checkpoint of the machine state; (b) saves the address of the instruction that did this; (c) puts the lock location in the set of data that is transactionally read; and (d) does a “Transaction On!”

So everybody goes charging right through the lock without stopping, but now every location read is continually monitored, and every write is saved, not appearing in memory.

If nobody steps on anybody else’s feet – writes someone else’s monitored location – then when the instruction to release the lock is done, it uses an XRELEASE prefix. This does a “Commit!” hraglblargarolfargahglug flush of all the saved writes into memory, forgets everything monitored, and turns off transaction mode.

If somebody does write a location someone else has read, then we get an ABORT with its wayback machine: back to the location that tried to acquire the lock, restoring the CPU state, so everything is like it was just before the lock acquisition instruction was done. This time, though, the write is not elided: The usual semantics apply, and the code goes through exactly what it did without TSE, the way it worked before.

So, as I understand it, if you have a hash table and read is under way, if a write to the same bucket happens then both the read and the write abort. One of those two gets the lock and does its thing, followed by the other according to the original code. But other reads or writes that don’t have conflicts go right through.

This seems like it will work, but I have to say I’d like to see the data on real code. My gut tells me that anything which changes the semantics of parallel locking, which HLE does, is going to have a weird effect somewhere. My guess would be some fun, subtle, intermittent performance bugs.

The Serial Skeleton in the Closet

This is all powerful stuff that will certainly aid parallel efficiency in both MIC, with it’s 30-plus cores; and the Xeon line, with fewer but faster cores. (Fewer faster cores need it too, since serialization inefficiency gets proportionally worse with faster cores.) But don’t think for a minute that it eliminates all serialization.

I see is no issue with the part of this that monitors locations read and written; I don’t know Intel’s exact implementation, but I feel sure it re-uses the cache coherence mechanisms already present, which operate without (too) much serialization.

However, there’s a reason I used a deliberately disgusting analogy when talking about pushing all the written data to memory on “Commit!” (XEND, XRELEASE). Recall that the required semantics are “all or nothing”: Every entity in the system sees all of the changes, or every entity sees none of them. (I’ve been saying “entity” because GPUs are now prone to directly access cache coherent memory, too.)

If the code has changed multiple locations during a transaction, probably on multiple cache lines, that means those changes have to be made all at once. If locations A and B both change, nobody can possibly see location A after it changed but location B before it changed. Nothing, anywhere, can get between the write of A and the write of B (or the making of both changes visible outside of cache).

As I said, I don’t know Intel’s exact implementation, so could conceivably be wrong, but that for me that implies that every “Commit!” requires a whole system serialization event: Every processor and thread in the whole system has to be not just halted, but pipelines drained. Everything must come to a dead stop. Once that stop is done, then all the changes can be made visible, and everything restarted.

Note that Intel’s TSE architecture spec says nothing about these semantics being limited to one particular chip or region. This is very good; software exploitation would be far harder otherwise. But it implies that in a multi-chip, multi-socket system, this halt and drain applies to every processor in every chip in every socket. It’s a dead stop of everything.

Well, OK, but lock acquire and release instructions always did this dead stop anyway, so likely the aggregate amount of serialization is reduced. (Wait a minute, they always did this anyway?! What the… Yeah. Dirty little secret of the hardware dudes.)

But lock acquire and release only involve one cache line at a time. “Commit!” may involve many. Writes involve letting everybody else know you’re writing a particular line, so they can invalidate it in their cache(s). Those notifications all have to be sent out, serially, and acknowledgements received. They can be pipelined, and probably are, but the process is still serial, and must be done while at a dead stop.

So, if your transactional segment of code modifies, say, 128KB spread over 512K cache lines, you can expect a noticeable bit of serialization time when you “Commit!”. Don’t forget this issue now includes all your old-style locking, thanks to HLE, where the original locking involved updating just one cache line. This is another reason I want to see some real running code with HLE. Who knows what evil lurks between the locks?

But as I said, I don’t know the implementation. Could Intel folks have found a way around this? Maybe; I’m writing this, as I’ve indicated, speculatively. Perhaps real magic is involved. We’ll find out when Haswell ships.

Enjoy your bright, shiny, new, non-blocking transactional memory when it ships. It’ll probably work really well. But beware the dreaded hraglblargarolfargahglug. It bites.

Friday, December 4, 2009

Intel’s Single-Chip Clus… (sorry) Cloud

Intel's recent announcement of a 48-core "single-chip cloud" (SCC) is now rattling around several news sources, with varying degrees of boneheaded-ness and/or willful suspension of disbelief in the hype. Gotta set this record straight, and also raise a few questions I didn't find answered in the sources now available (the presentation, the software paper, the developers' video).

Let me emphasize from the start that I do not think this is a bad idea. In fact, it's an idea good enough that I've lead or been associated with rather similar architectures twice, although not on a single chip (RP3, POWER 4 (no, not the POWER4 chip; this one was four original POWER processors in one box, apparently lost on the internet) (but I've still got a button…)). Neither was, in hindsight, a good idea at their times.

So, some facts:

SCC is not a product. It's an experimental implementation of which about a hundred will be made, given to various labs for software research. It is not like Larrabee, which will be shipped in full-bore product-scale quantities Some Day Real Soon Now. Think concept car. That software research will surely be necessary, since:

SCC is neither a multiprocessor nor a multicore system in the usual sense. They call it a "single chip cloud" because the term "cluster" is déclassé. Those 48 cores have caches (two levels), but cache coherence is not implemented in hardware. So, it's best thought of as a 48-node cluster of uniprocessors on a single chip. Except that those 48 cluster nodes all access the same memory. And if one processor changes what's in memory, the others… don't find out. Until the cache at random happens to kick the changed line out. Sometime. Who knows when. (Unless something else is done; but what? See below.)

Doing this certainly does save hardware complexity, but one might note that quite adequately scalable cache coherence does exist. It's sold today by Rackable (the part that was SGI); Fujitsu made big cache-coherent multi-x86 systems for quite a while; and there are ex-Sequent folks out there who remember it well. There's even an IEEE standard for it (SCI, Scalable Coherent Interface). So let's not pretend the idea is impossible and assume your audience will be ignorant. Mostly they will be, but that just makes misleading them more reprehensible.

To leave cache coherence out of an experimental chip like this is quite reasonable; I've no objection there. I do object to things like the presentation's calling this "New Data-Sharing Options." That's some serious lipstick being applied.

It also leads to several questions that are so far unanswered:

How do you keep the processors out of each others' pants? Ungodly uncontrolled race-like uglies must happen unless… what? Someone says "software," but what hardware does that software exercise? Do they, perhaps, keep the 48 separate from one another by virtualization techniques? (Among other things, virtualization hardware has to keep virtual machine A out of virtual machine B's memory.) That would actually be kind of cool, in my opinion, but I doubt it; hypervisor implementations require cache coherence, among other issues. Do you just rely on instructions that push individual cache lines out to memory? Ugh. Is there a way to decree whole swaths of memory to be non-cacheable? Sounds kind of inefficient, but whatever. There must be something, since they have demoed some real applications and so this problem must have been solved somehow. How?

What's going on with the operating system? Is there a separate kernel for each core? My guess: Yes. That's part of being a clus… sorry, a cloud. One news article said it ran "Rock Creek Linux." Never heard of it? Hint: The chip was called Rock Creek prior to PR.

One iteration of non-coherent hardware I dealt with used cluster single system image to make it look like one machine for management and some other purposes. I'll bet that becomes one of the software experiments. (If you don't know what SSI is, I've got five posts for you to read, starting here.)

Appropriately, there's mention of message-passing as the means of communicating among the cores. That's potentially fast message passing, since you're using memory-to-memory transfers in the same machine. (Until you saturate the memory interface – only four ports shared by all 48.) (Or until you start counting usual software layers. Not everybody loves MPI.) Is there any hardware included to support that, like DMA engines? Or protocol offload engines?

Finally, why does every Intel announcement of gee-whiz hardware always imply it will solve the same set of problems? I'm really tired of those flying cars. No, I don't expect to ever see an answer to that one.

I'll end by mentioning something in Intel's SCC (née Rock Creek) that I think is really good and useful: multiple separate power regions. Voltage and frequency can be varied separately in different areas of the chip, so if you aren't using a bunch of cores, they can go slower and/or draw less power. That's something that will be "jacks or better" in future multicore designs, and spending the effort to figure out how to build and use it is very worthwhile.

Heck, the whole thing is worthwhile, as an experiment. On its own. Without inflated hype about solving all the world's problems.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This has been an unusually topical post, brought to you courtesy of the author's head-banging-wall level of annoyance at boneheaded news stories. Soon we will resume our more usual programming. Not instantly. First I have to close on and move into a house. In Holiday and snow season.)

Friday, September 11, 2009

Of Muffins and Megahertz

Some readers have indicated, offline, that they liked the car salesman dialog about multicore systems that appeared in What Multicore Really Means. So I thought it might be interesting to relate the actual incident that prompted the miles- to muffins-per-hour performance switch I used.

If you haven't read that post, be warned that what follows will be more meaningful if you do.

It was inspired by a presentation to upper-level management of a 6-month-plus study of what was going on in the silicon concerning clock rate, and what if anything could be done about it. This occurred several years ago. I was involved, but not charged with providing any of the key slides. Well, OK, not one of my slides ended up being used.

It of course began with the usual "here's the team, here's how hard we worked" introduction.

First content chart: a cloud of data points collected from all over the industry that showed performance – specifically SPECINT, an integer benchmark – keeling over. It showed a big, obvious switch from the usual rise in performance, with a rough curve fit, breaking to a much lower predicted performance increase from now on. Pretty impressive. The obvious conclusion: Something major has happened. Things are different. There is big trouble.

Now, there's a rule for executive presentations: Never show a problem without proposing a solution. (Kind of like never letting a crisis go to waste.) So,

Second chart: a very similar-looking cloud of data points, sailing on at the usual growth rate for many years to come – labeled as multiprocessor (MP) results, what the industry would do in response. Yay, no problem! It's all right! We just keep on going! MP is the future! Lots of the rest of the pitch was about various forms of MP, from normal to bizarre.

Small print on second chart: It graphed SPECRATE. Not SPECINT.

SPECINT is a single-processor measure of performance. SPECRATE is, basically, how many completely separate SPECINTs you can do at once. Like, say, instead of the response time of PowerPoint, you get the incredibly useful measure of how many different PowerPoint slides you can modify at the same time. Or you change from miles per hour to muffins per hour.

Nothing on any slide or in any verbal statements referred to the difference. The chart makers - mostly high-level silicon technology experts - knew the difference, at least in theory. At least some of them did. I know others definitely did not in any meaningful sense.

At any event, throughout the entire rest of the presentation they displayed no inclination to inform anybody what it really meant. They didn't even distinguish the good result: typical server tasks can in general make really good use of parallelism. (See IT Departments Should NOT Fear Multicore.)

I was aghast. I couldn't believe that would be presented, like that, no matter what political positioning was going on. But I "knew better" than to say anything. Those charts were a result not just of data mining the industry for performance data but of data mining the company politically to get something that would reflect best on everybody involved. Speak up, and you get told that you don't know the big picture.

My opinion about feeding nonsense to anybody should be obvious from this blog. I don't think I'm totally blind in the political spectrum, but hey, guys, come on. That's blatant.

One hopes that the people who were the target knew enough to know the difference. I suspect that the whole point of the exercise, from their point of view, was just to really, firmly, nail down the point that the first chart – SPECINT keeling over – was a physical fact, and not just one of the regularly-scheduled pitches from the silicon folks for more development funds because they were in trouble. The target audience probably stopped paying attention after that first slide.

I don't mean to imply above that the gents who are responsible for the physical silicon don't regularly didn't have real problems; they do. But this situation was a problem of a whole different dimension.

It still is.

Saturday, August 15, 2009

Today’s Graphics Hardware is Too Hard

Tim Sweeny recently gave a keynote at High Performance Graphics 2009 titled "The End of the GPU Roadmap" (slides). Tim is CEO and founder of Epic Games, producers of over 30 games including Gears of War, as well as the Unreal game engine used in 100s of games. There are lots of really interesting points in that 74-slide presentation, but my biggest keeper is slide 71:

[begin quote]

Lessons learned: Today's hardware is too hard!

If it costs X (time, money, pain) to develop an efficient single-threaded algorithm, then…
- Multithreaded version costs 2X
- PlayStation 3 Cell version costs 5X
- Current "GPGPU" version costs: 10X or more
Over 2X is uneconomical for most software companies!
This is an argument against:
- Hardware that requires difficult programming techniques
- Non-unified memory architectures
- Limited "GPGPU" programming models

[end quote]

Judging from the prior slides, by '"GPGPU"' Tim apparently means the DirectX 10 pipeline with programmable shaders.

I'm not sure what else to make of this beyond rehashing Tim's words, and I'd rather point you to his slides than start doing that. The overall tenor somewhat echoes comments I made in one of my first posts; it continues to be the most hit-on page of this blog, so I must have said something useful there.

I will note, though, that Tim's estimates of effort are based on very extensive experience – with game programming. For low-ish levels of parallelism, like 4 or 8, multithreading adds zero cost to typical commercial applications already running under a competent transaction monitor. It just works, since they're already at that level of software multithreading for other reasons (like achieving overlap with IO waits). Of course, that's not at all universally true for commercial applications, particularly for high levels of parallelism, no matter how much cloud evangelists talk about elasticity.

Once again, thanks to my friend who is expert at finding things like this slide set (it's not on the conference web site) and doesn't want his name mentioned.

Short post this time.

Monday, July 20, 2009

Why Accelerators Now?

Accelerators have always been the professional wrestlers of computing. They're ripped, trash-talking superheroes, whose special signature moves and bodybuilder physiques promise to reduce diamond-hard computing problems to soft blobs quivering in abject surrender. Wham! Nvidia "The Green Giant" CUDA body-slams a Black-Scholes equation financial model! Shreik! Intel "bong-da-Dum-da-Dum" Larrabee cobra clutches a fast fourier transform to agonizing surrender!

And they're "green"! Many more FLOPS/OPS/whatever per watt and per square inch of floor space than your standard server! And I'm using way too many exclamation points!!

Logical Sidebar: What is an accelerator, anyway? My definition: An accelerator is a device optimized to enhance the performance or function of a computing system. An accelerator does not function on its own; it requires invocation from host programs. This is by intention and design optimization, not physics, since an accelerator may contain general purpose system parts (like a standard processor), be substantially software or firmware, and (recursively) contain other accelerators. The strategy is specialization; there is no such thing as a "general-purpose" accelerator. Claims to the contrary usually assume just one application area, usually HPC, but there are many kinds of accelerators – see the table appearing later. The big four "general purpose" GPUs – IBM Cell, Intel Larrabee, Nvidia CUDA, ATI/AMD Stream – are just the tip of the iceberg. The architecture of accelerators is a glorious zoo that is home to the most bizarre organizations imaginable, a veritable Cambrian explosion of computing evolution.

So, if they're so wonderful, why haven't accelerators already taken over the world?

Let me count the ways:

Nonstandard software that never quite works with your OS release; disappointing results when you find out you're going 200 times faster – on 5% of the whole problem; lethargic data transfer whose overhead squanders the performance; narrow applicability that might exactly hit your specific problem, or might not when you hit the details; difficult integration into system management and software development processes; and a continual need for painful upgrades to the next, greatest version with its different nonstandard software and new hardware features; etc.

When everything lines up just right, the results can be fantastic; check any accelerator company's web page for numerous examples. But getting there can be a mess. Anyone who was a gamer in the bad old days before Microsoft DirectX is personally familiar with this; every new game was a challenge to get working on your gear. Those perennial problems are also the reason for a split reaction in the finance industry to computational accelerators. The quants want them; if they can make them work (and they're always optimists), a few milliseconds advantage over a competitor can yield millions of dollars per day. But their CIOs' reaction is usually unprintable.

I think there are indicators that this worm may well be turning, though, allowing many more types of accelerators to become far more mainstream. Which implies another question: Why is this happening now?

Indicators

First of all, vendors seem to be embracing actual industry software standards for programming accelerators. I'm referring here to the Khronos Group's OpenCL, which Nvidia, AMD/ATI, Intel, and IBM, among others, are supporting. This may well replace proprietary interfaces like Nvidia's CUDA API and AMD/ATI's CTM, and in doing so have an effect as good and simplifying as Microsoft's DirectX API series, which eliminated a plethora of problems for graphics accelerators (GPUs).

Another indicator is that connecting general systems to accelerators is becoming easier and faster, reducing both the CPU overhead and latency involved in transferring data and kicking off accelerator operations. This is happening on two fronts: IO, and system bus.

On the IO side, there's AMD developing and intermittently showcasing its Torrenza high-speed connection. In addition, the PCI-SIG will, presumably real soon now, publish PCIe version 3.0, which contains architectural features designed for lower-overhead accelerator attachment, like atomic operations and caching of IO data.

On the system bus side, both Intel and AMD have been licensing their inner inter-processor system busses as attachment points to selected companies. This is the lowest-overhead, fastest way to communicate that exists in any system; the latencies are in sub-nanoseconds and the data rates in gigabytes/second. This indicates a real commitment to accelerators, because foreign attachment directly to one's system bus was heretofore unheard-of, for very good reason. The protocols used on system busses, particularly the aspects controlling cache coherence, are mind-numbingly complex. They're the kinds of things best developed and used by a team whose cubes/offices are within whispering range of each other. When they don't work, you get nasty intermittent errors that can corrupt data and crash the system. Letting a foreign company onto your system bus is like agreeing to the most intimate unprotected sex imaginable. Or doing a person-to-person mutual blood transfusion. Or swapping DNA. If the other guy is messed up, you are toast. Yet, it's happening. My mind is boggled.

Another indicator is the width of the market. The vast majority of the accelerator press has focused on GPGPUs, but there are actually a huge number of accelerators out there, spanning an oceanic range of application areas. Cryptography? Got it. Java execution? Yep. XML processing? – not just parsing, but schema validation, XSLT transformations, XPaths, etc. – Oh, yes, that too. Here's a table of some of the companies involved in some of the areas. It is nowhere near comprehensive, but it will give you a flavor (click on it to enlarge) (I hope):

Beyond companies making accelerators, there are a collection of companies who are accelerator arms dealers – they live by making technology that's particularly good for creating accelerators, like semi-custom single-chip systems with your own specified processing blocks and/or instructions. Some names: Cavium, Freescale Semiconductor, Infineon, LSI Logic, Raza Microelectronics, STMicroeletronics, Teja, Tensilica, Britestream. That's not to leave out FPGA vendors who make custom hardware simple by providing chips that are seas of gates and functions you can electrically wire up as you like.

Why Now?

That's all fine and good, but. Tech centers around the world are littered with the debris of failed accelerator companies. There have always been accelerator companies in a variety of areas, particularly floating point computing and the offloading of communications protocols (chiefly TCP/IP); efforts date back to the early 1970s. Is there some fundamental reason why the present surge won't crash and burn like it always has?

A list can certainly be made of how circumstances have changed for accelerator development. There didn't used to be silicon foundries, for example. Or Linux. Or increasingly capable building blocks like FPGAs. I think there's a more fundamental reason.

Until recently, everybody has had to run a Red Queen's race with general purpose hardware. There's no point in obtaining an accelerator if by the time you convince your IT organization to allow it, order it, receive it, get it installed, and modify your software to use it, you could have gone faster by just sitting there on your butt, doing nothing, and getting a later generation general-purpose system. When the general-purpose system has gotten twice as fast, for example, the effective value of your accelerator has halved.

How bad a problem is this? Here's a simple graph that illustrates it:

What the graph shows is this: Suppose you buy an accelerator that does something 10 times faster than the fastest general-purpose "commodity" system does, today. Now, assume GP systems increase in speed as they have over the last couple of decades, a 45% CAGR. After only two years, you're only 5 times faster. The value of your investment in that accelerator has been halved. After four years, it's nearly divided by 5. After five years, it's worthless; it's actually slower than a general purpose system.

This is devastating economics for any company trying to make a living by selling accelerators. It means they have to turn over designs continually to keep their advantage, and furthermore, they have to time their development very carefully – a schedule slip means they have effectively lost performance. They're in a race with the likes of Intel, AMD, IBM, and whoever else is out there making systems out of their own technology, and they have nowhere near the resources being applied to general purpose systems (even if they are part of Intel, AMD, and IBM).

Now look at what happens when the rate of increase slows down:

Look at that graph, keeping in mind that the best guess for single-thread performance increases over time is now in the range of 10%-15% CAGR at best. Now your hardware design can actually provide value for five years. You have some slack in your development schedule.

It means that the single-thread performance reduction of Moore's Law makes accelerators economically viable to a degree they never have been before.

Cambrian explosion? I think it's going to be a Cambrian Fourth-of-July, except that the traditional finale won't end soon.

Objections and Rejoinders

I've heard a couple of objections raised to this line of thinking, so I may as well bring them up and try to shoot them down right away.

Objection 1: Total performance gains aren't slowing down, just single-thread gains. Parallel performance continues to rise. To use accelerators you have to parallelize anyway, so you just apply that to the general purpose systems and the accelerator advantage goes away again.

Response: This comes from the mindset that accelerator = GPGPU. GPGPUs all get their performance from explicit parallelism, and, as the "GP" part says, that parallelism is becoming more and more general purpose. But the world of accelerators isn't limited to GPGPUs; some use architectures that simply (hah! Isn't often simple) embed algorithms directly in silicon. The guts of a crypto accelerator aren't anything like a general-purpose processor, for example. Conventional parallelism on conventional general processors will lose out to it. And in any event, this is comparing past apples to present oranges: Previously you did not have to do anything at all to reap the performance benefit of faster systems. This objection assumes that you do have to do something – parallelize code – and that something is far from trivial. Avoiding it may be a major benefit of accelerators.

Objection 2: Accelerator, schaccelerator, if a function is actually useful it will get embedded into the instruction set of general purpose systems, so the accelerator goes away. SIMD operations are an example of this.

Response: This will happen, and has happened, for some functions. But how did anybody get the experience to know what instruction set extensions were the most useful ones? Decades of outboard floating point processing preceded SIMD instructions. AMD says it will "fuse" graphics functions with processors – and how many years of GPU development and experience will let it pick the right functions to do that with? For other functions, well, I don't think many CPU designers will be all that happy absorbing the strange things done in, say, XML acceleration hardware.

Sunday, February 15, 2009

What Multicore Really Means (and a Larrabee/Cell Example)

So, now everybody's staring in rapt attention as Intel provides a peek at its upcoming eight-core chip. When they're not speculating about Larrabee replacing Cell on PlayStation 4, that is.

Sigh.

I often wish the guts of computers weren't so totally divorced from everyday human experience.

Just imagine if computers could be seen, heard, or felt as easily as, for example, cars. That would make what has gone on over the last few years instantly obvious; we'd actually understand it. It would be as if a guy from the computer (car) industry and a consumer had this conversation:

"Behold! The car!" says car (computer) industry guy. "It can travel at 15 miles per hour!"

"Oh, wow," says consumer guy, "that thing is fantastic. I can move stuff around a lot faster than I could before, and I don't have to scoop horse poop. I want one!"

Time passes.

"Behold!" says the industry guy again, "the 30 mile-an-hour car!"

"Great!" says consumer guy. "I can really use that. At 15 mph, it takes all day to get down to town. This will really simplify my life enormously. Gimme, gimme!"

Time passes once more.

"Behold!" says you-know-who, "the 60 mph car!"

"Oh, I need one of those. Now we can visit Aunt Sadie over in the other county, and not have to stay overnight with her 42 cats. Useful! I'll buy it!"

Some more time.

"Behold!" he says, "Two 62-mph cars!"

"Say what?"

"It's a dual car! It does more!"

"What is that supposed to mean? Look, where's my 120 mph car?"

"This is better! It's 124 mph. 62 plus 62."

"Bu… Wha… Are you nuts? Or did you just arrived from Planet Meepzorp? That's crazy. You can't add up speeds like that."

"Sure you can. One can deliver 62 boxes of muffins per hour, so the two together can deliver 124. Simple."

"Muffins? You changed what mph means, from speed to some kind of bulk transport? Did we just drop down the rabbit hole? Since when does bulk transport have anything to do with speed?"

"Well, of course the performance doubling doesn't apply to every possible workload or use. Nothing ever really did, did it? And this does cover a huge range. For example, how about mangos? It can do 124 mph on those, too. Or manure. It applies to a huge number of cases."

"Look, even if I were delivering mangos, or muffins, or manure, or even mollusks …"

"Good example! We can do those, too."

"Yeah, sure. Anyway, even if I were doing that, and I'm not saying I am, mind you, I'd have to hire another driver, make sure both didn't try to load and unload at the same time, pay for more oil changes, and probably do ten other things I didn't have to do before. If I don't get every one of them exactly right, I'll get less than your alleged 124 whatevers. And I have to do all that instead of just stepping on the gas. This is an enormous pain."

"We have your back on those issues. We're giving Jeb here – say hello, Jeb –"

"Quite pleased to meet you, I'm sure. Be sure to do me the honor of visiting my Universal Mango Loading Lab sometime."

"…a few bucks to get that all worked out for you."

"Hey, I'm sure Jeb is a fine fellow, but right down the road over there, Advanced Research has been working on massively multiple loading for about forty years. What can Jeb add to that?"

"Oh, that was for loading special High-Protein Comestibles, not every day mangos and muffins. HPC is a niche market. This is going to be used by everybody!"

"That is supposed to make it easier? Come on, give me my real 120 mile per hour car. That's a mile, not a munchkin, a monkey, a mattock, or anything else, just a regular, old, mile. That's what I want. In fact, that's what I really need."

"Sorry, the tires melt. That's just the way it is; there is no choice. But we'll have a Quad Car soon, and then eight, sixteen, thirty-two! We've even got a 128-car in our labs!"

"Oh, good grief. What on God's Green Earth am I going to do with a fleet of 128 cars?"

…

Yeah, yeah, I know, a bunch of separate computers (cars) isn't the same as a multi-processor. They're different kinds of things, like a pack of dogs is different from a single multi-headed dog. See illustrations here. The programming is very different. But parallel is still parallel, and anyway Microsoft and others will just virtualize each N-processor chip into N separate machines in servers. I'd bet the high-number multi-cores ultimately morph into a cluster-on-a-chip as time goes on, anyway, passing through NUMA-on-a-chip on the way.

But it's still true that:

Computers no longer go faster. We just get more of them. Yes, clock speeds still rise, but it's like watching grass grow compared to past rates of increase. Lots of software engineers really haven't yet digested this; they still expect hardware to bail them out like it used to.
The performance metrics got changed out from under us. SPECrate is muffins per hour.
Various hardware vendors are funding labs at Berkeley, UIUC, and Stanford to work on using them better, of course. Best of luck with your labs, guys, and I hope you manage to do a lot better than was achieved by 40 years of DARPA/NSF funding. Oh, but that was a niche.

My point in all of this is not to protest the rising of the tide. It's coming in. Our feet are already wet. "There is no choice" is a phrase I've heard a lot, and it's undeniably true. The tires do melt. (I sometimes wonder "Choice to do what?" but that's another issue.)

Rather, my point is this: We have to internalize the fact that the world has changed – not just casually admit it on a theoretical level, but really feel it, in our gut.

That internalization hasn't happened yet.

We should have reacted to multi-core systems like consumer guy seeing the dual car and hearing the crazy muffin discussion, instantly recoiling in horror, recognizing the marketing rationalizations as somewhere between lame and insane. Instead, we hide the change from ourselves, for example letting companies call a multi-core system "a processor" (singular) because it's packaged on one chip, when they should be laughed at so hard even their public relations people are too embarrassed to say it.

Also, we continue to casually talk in terms that suggest a two-processor system has the power of one processor running twice as fast – when they really can't be equated, except at a level of abstraction so high that miles are equated to muffins.

We need to understand that we've gone down a rabbit hole. So many standard assumptions no longer hold that we can't even enumerate them.

To ground this discussion in real GHz and performance, here's an example of what I mean by breaking standard assumptions.

In a discussion on Real World Technologies' Forums about the recent "Intel Larrabee in Sony PS4" rumors, it was suggested that Sony could, for backward compatibility, just emulate the PS3's Cell processor on Larrabee. After all, Larrabee is several processor generations after Cell, and it has much higher performance. As I mentioned elsewhere, the Cell cranks out "only" 204 GFLOPS (peak), and public information about Larrabee puts it somewhere in the range of at least 640 GFLOPS (peak), if not 1280 GFLOPS (peak) (depends on what assumptions you make, so call it an even 1TFLOP).

With that kind of performance difference, making a Larrabee act like a Cell should be a piece of cake, right? All those old games will run just as fast as before. The emulation technology (just-in-time compiling) is there, and the inefficiency introduced (not much) will be covered up by the faster processor. No problem. Standard thing to do. Anybody competent should think if it.

Not so fast. That's pre-rabbit-hole thinking. Those are all flocks of muffins flying past, not simple speed. Down in the warrens where we are now, it's possible for Larrabee to be both faster and slower than Cell.

In simple speed, the newest Cell's clock rate is actually noticeably faster than expected for Larrabee. Cell has shipped for years at 3.2 GHz; the more recent PowerXCell version uses newer fabrication technology to lower power (heat), not to increase speed. Public Larrabee estimates say that when it ships (late 2009 or 2010) it will be somewhere around 2 GHz., so in that sense Cell is about 1.25X faster than Larrabee (both are in-order, both count FLOPS double by having a multiply-add).

Larrabee is "faster" only because it contains much more stuff – many more transistors – to do more things at once than Cell does. This is true at two different levels. First, it has more processors: Cell has 8, while Larrabee at least 16 and may go up to 48. Second, while both Cell and Larrabee gain speed by lining up several numbers and operating on all of them at the same time (SIMD), Larrabee lines up more numbers at once than Cell: The GFLOPS numbers above assume Larrabee does 16 operations at once (512-bit vector registers), but Cell does only four operations at once (128-bit vector registers). To get maximum performance on both of them, you have to line up that many numbers at once. Unless you do, performance goes down proportionally.

This means that to match today's and several years' ago Cell performance, next year's Larrabee would have to not just emulate it, but extract more parallelism than is directly expressed in the program being emulated. It has to find more things to do at once than were there to begin with.

I'm not saying that's impossible; it's probably not. But it's certainly not at all as straightforward as it would have been before we went down the rabbit hole. (And I suspect that "not at all as straightforward" may be excessively delicate phrasing.)

Ah, but how many applications really use all the parallelism in Cell – get all its parts cranking at once? Some definitely do, and people figure out how to do more every day. But it's not a huge number, in part because Cell does not have the usual, nice, maximally convenient programming model exhibited by mainstream systems, and claimed for Larrabee; it traded that off for all that speed (in part). The idea was that Cell was not for "normal" programming; it was for game programming, with most of the action in intense, tight, hand-coded loops doing image creation from models. That happened, but certainly not all the time, and anecdotally not very often at all.

Question: Does that make the problem easier, or harder? Don't answer too quickly, and remember that we're talking about emulating from existing code, not rewriting from scratch.

A final thought about assumption breaking and Cell's notorious programmability issues compared with the usual simpler-to-use organizations: We may, one day, look back and say "It sure was nice back then, but we no longer have the luxury of using such nice, simple programming models." It'll be muffins all the way down. I just hope that we've merely gone down the rabbit hole, and not crossed the Mountains of Madness.

Friday, January 23, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (5)

This is part 5 of a multi-post sequence on this topic which began here. This is the crux, the final part attempting to answer the question

Why Hasn't SSI Taken Over The World?

After last-minute productus interruptus pull-outs and half-hearted product introductions by so many major industry players – IBM, HP, Sun, Tandem, Intel, SCO – you have to ask what's wrong.

Semi-random events like business conditions, odd circumstances, and company politics always play a large part in whether any product sees the light of day, but over multiple attempts those should average out: It can't just be a massive string of bad luck. Can it? This stuff just seems so cool. It practically breaths breathless press-release hype. Why isn't it everywhere?

Well, I'm fully open to suggestions.

I will offer some possibilities here.

Marketing: A Tale of Focus Groups

In one of the last, biggest pushes for this within IBM, sometime around 2002, a real marketer got involved, got funding, and ran focus groups with several sets of customers in multiple cities to get a handle on who would buy it, and for how much. I was involved in the evaluation and listened to every dang cassette tape of every dang multi-hour session. I practically kicked them in frustration at some points.

The format: After a meet-and-greet with a round of donuts or cookies or whatever, a facilitator explained the concept to the group and asked them "Would you buy this? How much would you pay?" They then discussed it among themselves, and gave their opinions.

It turned out that the customers in the focus groups divided neatly into two completely separate groups:

Naives

These were the "What's a cluster?" people. They ran a small departmental server of their own, but had never even heard of the concept of a cluster. They had no clue what a single system image was or why anybody would conceivably want one, and the short description provided certainly didn't enlighten them. This was the part I was near to kicking about; I knew I could have done that better. Now, however, I realize that what they were told was probably fairly close to the snippets they would have heard in the normal course of marketing and trade press articles.

Result: They wouldn't buy it because they wouldn't understand what it was and why they should be interested.

Experts

These guys were sysadmins of their own clusters, and knew every last minute detail of the area. They'd built clusters (usually small ones), configured cluster databases, and kept them running come hell or high water with failover software. The Naives must have thought they were talking Swahili.

They got the point, they got it fast, and they got it deep. They immediately understood implications far beyond what the facilitator said. They were amazed that it was possible, and thought it was really neat, although some expressed doubt that anybody could actually make it work, as in "You mean this is actually possible? Geez."

After a short time, however, all of them, every last one, zeroed in on one specific flaw: Operating system updates are not HA, because you can't in many cases run a different fix level on different nodes simultaneously. Sometimes it can work, but not always. These folks did rolling OS upgrades all the time, updating one node's OS and seeing if it fell over, then moving to the next node, etc.; this is a standard way to avoid planned outages.

Result: They wouldn't buy it either, because, as they repeatedly said, they didn't want to go backwards in availability.

That was two out of two. Nobody would buy it. It's difficult to argue with that market estimate.

But what if those problems were fixed? The explanation can be fixed, for sure; as I said, I all but kicked the cassettes in frustration. Doing kernel updates without an outage is pretty hard, but in all but the worst cases it could also be done, with enough effort.

Even were that done, I'm not particularly hopeful, for the several other reasons discussed below.

Programming Model

As Seymour Cray said of virtual memory, "Memory is like an orgasm: It's better when you don't have to fake it." (Thank you, Eugene Miya and his comp.sys.super FAQ.)

That remains true when it's shared virtual memory bouncing between nodes, even despite a probable lack of disk accesses. Even were an application or framework written to scale up in shared-memory multiprocessor style over many multi-cores in many nodes – and significant levels of multiprocessor scaling will likely be achieved as the single-chip core count rises – that application is going to perform much better if it is rewritten to partition its data so each node can do nearly all accesses into local memory.

But hey, why bother with all that work? Lots of applications have already been set up to scale on separate nodes, so why not just run multiple instances of those applications, and tune them to run on separate nodes? It achieves the same purpose, and you just run the same code. Why not?

Because most of the time it won't work. They haven't been written to run multiple copies on the same OS. Apache is the poster child for this. Simple, silly things get in the way, like using the same names for temp files and other externally-visible entities. So you modify the file system, letting each have an instance-specific /temp and other files… But now you've got to find all those cases.

The massively dominant programming model of our times runs each application on its own copy of the operating system. That has been a major cause of server sprawl and the resultant killer app for virtualization. The issue isn't just silly duplicate file names, although that is still there. The issue is also performance isolation, fault isolation, security isolation, and even inter-departmental political isolation. "Modern" operating systems simply haven't implemented hardening of the isolation between their separate jobs, not because it's impossible – mainframes OSs did it and still do – but because nobody cares any more. Virtualization is instead used to create isolated copies of the OS.

But virtualizing to one OS instance per node on top of a single-system-image OS that unifies the nodes, that's – I'm struggling for strong enough words here. Crazy? Circular? Möbius-strip-like? It would negate the whole point of the SSI.

Decades of training, tool development, and practical experience are behind the one application / one OS / one node programming model. It has enormous cognitive momentum. Multinode single system image is simply swimming upstream against a very strong current on this one, even if in some cases it may seem simpler, at least initially, to implement parallelism using shared virtual memory across multiple nodes. This alone seems to guarantee isolation to a niche market.

Scaling and Single-OS Semantics

Scaling up an application is a massive logical AND. The hardware must scale AND the software must scale. For the hardware to scale, the processing must scale AND the memory bandwidth must scale AND the IO must scale. For the software to scale, the operating system, middleware, AND application must all scale. If anything at all in the whole stack does not scale, you are toast: You have a serial bottleneck.

There are many interesting and useful applications that scale to hundreds or thousands of parallel nodes. This is true both of batch-like big single operations – HPC simulations, MapReduce over Internet-scale data – and the continuous massive river of transactions running through web sites like Amazon.com and eBay. So there are many applications that scale, along with their middleware, on hardware – piles of separate computers – that scales more-or-less trivially. When there's a separate operating system on each node, that part scales trivially, too.

But what happens when the separate operating systems are replaced with a kernel-level SSI operating system? That LCC was used on the Intel Paragon seems to say that it can.

However, a major feature and benefit of the whole idea of SSI is that the single system image matches the semantics of a single-node OS exactly. Getting that exact match is not easy. At one talk I heard, Jerry Popek estimated that getting the last 5% of those semantics was over 80% of the work, but provided 95% of the benefit – because it guarantees that if any piece of code ran on one node, it will simply run on the distributed version. That guarantee is a powerful feature.

Unfortunately, single-node operating systems simply weren't designed with multi-node scaling in mind. The poster child for this one is Unix/Linux systems' file position pointer, which is associated with the file handle, not with the process manipulating the file. The intent, used in many programs, is that a parent process can open a file, read and process some, then pass the handle to a child, which continues reading and processing more of the file; when the child ends, the parent can pick up where the child left off. It's how command-line arguments traditionally get passed: The parent reads far enough to see what kind of child to start up, starts it, and implicitly passes the rest of the input file – the arguments – to the child, which slurps them up. On a single-node system, this is actually the simplest thing to do: You just use one index, associated with the handle, which everybody increments. For a parallel system, that one index is a built-in serial bottleneck. Spawn a few hundred child processes to work in parallel, and watch them contend for a common input command string. The file pointer isn't the only such case, but it's an obvious egregious example.

So how did they make it work on the Intel Paragon? By sacrificing strict semantic compatibility.

For a one-off high-end scientific supercomputer (not clear it was meant to be one-off, but that's life), it's not a big deal. One would assume a porting effort or the writing of a lot of brand new code. For a general higher-volume offering, cutting such corners would mean trouble; too much code wants to be reused.

Conclusions

Making a single image of an operating system span multiple computers sounds like an absolutely wonderful idea, with many clear benefits. It seems to almost magically solve a whole range of problems, cleanly and clearly. Furthermore, it very clearly can be done; it's appeared in a whole string of projects and low-volume products. In fact, it's been around for decades, in a stream of attempts that are a real tribute to the perseverance of the people involved, and to the seductiveness of the concept.

But it has never really gotten any traction in the marketplace. Why?

Well, maybe it was just bad luck. It could happen. You can't rule it out.

On the other hand, maybe (a) you can't sell it; (b) it's at right angles to the vastly dominant application programming model; (c) it scales less well than many applications you would like to run on it.

Bummer.

__________________________

(Postscript: Are there lessons here for cloud computing? I suspect so, but haven't worked it out, and at this point my brain is tired. See you later, on some other topic.)

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (4)

This is part 4 of a multi-post sequence on this topic which began here. This part discusses some implementation issues and techniques.

Implementation, Briefly

The only implementations I know much about were the Locus ones. I use the plural deliberately, since two different organizations were used over time.

Initially, the problem was approached the most straightforward way conceivable: Start at the root of the source tree and crawl through every line of code in the OS. Everywhere you see an assumption that some resource is only on one node, replace it with code that doesn't assume that, participates in cross-system consistency and recovery, and so on.

This is a massive undertaking, requiting a huge number of changes scattered throughout the source tree. It's a mess. And it's massively complicated. I am in awe that they actually got it to work. As you might imagine, after a few of their many ports, they seriously began looking for a better way, and found it in an analogy to Unix/Linux vnode interface.

Vnode is what enables distributed file systems to easily plug into Unix/Linux. Its basic idea is simple: Anytime you want to do anything at all to a file, you funnel the request through vnode; you don't do it any other way. Vnode itself is a simple switch, with an effect that's roughly like this:

  If    <the vnode ID shows this is a native file system file>
then  <just do it: call the native file system code>
Else  <ship the request to the implementer, and return its result>

If the implementer is on another computer, as it is with many distributed file implementations, this means that you ship the request off to the other computer, where it's done and the result passed back to the requestor.

What later implementations of the Locus line of code did was create a vproc interface analogous to vnode, but used for all manipulations of processes. If a process is local, just do it; otherwise, ship the request to wherever the process lives, do it there, and return the result. This neatly consolidates a whole lot of what has to be done into one place, enabling natural reuse and far greater consistency. It is definitely a win.

Unfortunately, the rest of the kernel isn't written to use the vproc interface. So you still have to crawl through OS and convert any code that directly manipulated a process into code that manipulated it through vproc. This is still a pain, but a much lesser one, and you get some better structure out of it. (I believe vproc got into the Linux kernel at some point.) However, this alone doesn't do the job for IO, doesn't create a single namespace for sockets, shared segments, and so on; all that has to be done independently. But those aspects are generally more localized than process manipulation, so you have fixed a significant problem.

Other implementations, like the Virtual Iron one and, I believe, the one by ScaleMP, take a different tack. They concentrate on shared memory first, implementing a distributed shared memory model of some sort (exactly what I'm not sure). That gets you the function of shared memory across nodes, which is a necessity for any full kernel SSI implementation and does appear in Locus, too. But it doesn't cover everything, of course.

Enough with all the background. The next post gets to the ultimate point: If it's so great, why hasn't it taken over the world?

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (3)

This is part 3 of a multi-post sequence on this topic which began here. This part recounts some implementations and history.

History and Examples

The earliest version of this concept that I'm aware of was the Locus project at UCLA, led by Jerry Popek. It had a long, and ultimately frustrating, history.

The research project was successful. It created a version of UNIX that ran across several computers (VAXen) and as a system didn't go down for something over a year of use by students. The project was funded by IBM, and that connection led to key elements being ported experimentally to IBM's AIX in the late 1980s; I saw AIX processes migrating between nodes in 1989-90. Then an IBM executive, in a fit of magnanimity and/or cluelessness, gifted all the IP back to the Locus project. They went off and founded Locus Computing Corporation (LCC). From there it was used on an Intel Paragon (a massively-parallel computing project), which says something about scalability. At one point it was also part of an attempt in IBM to make mainframes and desktops (PS/2s) together look like one big AIX (UNIX) system. Feel free to boggle at that concept. It did run.

But LCC ran into the boughts: They were bought by Tandem, where the code was productized as Tandem NonStop Clusters, which as far as I know shipped in small numbers. Tandem was bought by Compaq, which bought DEC, so the code was ported to DEC's Alpha, and almost used in DEC's brand of Unix before DEC was snuffed. Then Compaq was bought by HP, and shortly thereafter this code stream came perilously close to being part of HP-UX, HP's Unix: In 2004, there was a public announcement that it would be in the next major HP-UX release. About a month later, that was later officially de-committed. It also saw use in a PRPQ (limited-edition product) from IBM that was called POWER4. At some point it also almost became part of SCO's UnixWare.

This is a portrait of frustration for all involved.

The Locus thread, however, isn't the only implementation or planned product of full single system image. There's an open-source implementation now called MOSIX and based on Linux that you can download and try, designed specifically to support HPC. It also has a long history, starting life as MOS in the late 70s at The Hebrew University of Jerusalem. It doesn't distribute absolutely every element of the OS, but does quite enough to be useful.

Sun Microsystems published a fair amount of material about a version for Solaris, which it called Full Moon, including a roadmap indicating complete single-system-image in 1999. It hasn't happened yet, obviously. The most recent news I found about it was that some current variation would go open source (CDDL) in 2007. The odd name, by the way, was chosen because full moons make wolf packs howl, and Microsoft's cluster support was codenamed Wolfpack. ("Wolfpack" referred to my analogy of clusters being like packs of dogs in In Search of Clusters. "Dogpack" wouldn't have had quite the same connotations. But there's no mythological multi-headed wolf to play the part of the multiprocessor in my analogy.)

ScaleMP today sells a multi-multicore single system image product, vSMP foundation, aiming to ride on the speed of modern higher-performance interconnects like InfiniBand (and likely Data Center Ethernet or equivalent), but my understanding is that vSMP is hardware-assisted, a somewhat different subject. Similarly, Virtual Iron was at one point also pushing an ability to glue multiple multi-cores into one larger SMP-like system, with hardware assist, but that has apparently been dropped. Both of these, unlike prior efforts, have some flavor of virtualization to them and/or their implementation.

I wouldn't be at all surprised to learn that there are others who have heard the seductive Siren call of SSI.

So much for history. Implementation comments in the next post.

Tuesday, January 20, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (2)

This is part 2 of a multi-post sequence on this topic that began here. This post describes in more detail what this concept actually is.

What It Is

The figure below illustrates the conventional approach to building a cluster / farm / cloud / data center / whatever you want to call a bunch of computers working more-or-less together.

Each system, above the hardware and its interconnect (black line), has its own independent copy of:

an operating system,
various networking basics (think DNS and the like),
some means of accessing common data storage (like a distributed file system),
assorted application-area tools (like Apache, MapReduce, MPI)
and various management tools to wrangle all the moving parts into some semblance of order.

Each of those has its own inter-instance communication (colored lines), which has to be made to work, too.

The order in which I've stacked the elements above the operating system can be debated; feel free to pick your own, since that's not the point. The point is that there are a collection of separate elements to collect, install, manage, and train administrators and developers to use, all the time hoping the elements will in this case play nicely with each other. The pain of doing this has lessened over time as various stacks become popular, and hence well-understood, but it's still building things out of an erector set each time, carefully keeping in mind how each has to slot into another. Intel even has a Cluster Ready specification and configuration checker whose sole purpose is to avoid many of the standard shoot-yourself-in-the-foot situations when building clusters.

That just leaves one question: Why did I use that cloudy shape to represent the operating system? Because the normal function of an operating system is to turn the content of one node into a pool of resources, which it manages. This is particularly obvious when each node system is a multicore (multiprocessor): The collection of cores and threads is obviously a resource pool. But then again, so is the memory. It's perhaps less obvious that this is happening when you just consider the seemingly singular items attached to a typical system, but look closer and you find it, for example in network bandwidth and disk space.

I also introduced a bit of cosmetic cloudiness to set things up for this figure, by comparison:

Here, you have one single instance of an operating system that runs the whole thing, exactly like one operating system pools and allocates all the processors, memory, and other resources in a multicore chip or multiprocessor system. On that single operating system instance, you install, once, one (just one) set of the application tools, just like you would install it on one node. Since that tool set is then on the (singular) operating system, it can be used on any node by any application or applications.

The vision is of a truly distributed version of the original operating system, looking exactly the same as the single-node version as far as any administrator, user, or program can tell: It maintains all the system interfaces and semantics of the original operating system. That's why I called it a Kernel-level Single System Image in In Search of Clusters, as opposed to a single-system image at the file system level, or at lower hardware levels like memory.

Consider what that would mean (using some Linux/Unix terminology for the examples):

There's one file system with one root (including /dev), one process tree, one name space for sockets & shared segments, etc. They're all exactly the same as they are on the standard single-node OS. All the standard single-system tools for managing this just work.
There is inter-node process migration for load balancing, just like a multicore OS moves processes between cores; some just happen to be on other nodes. (Not as fast, obviously.) Since that works for load balancing, it can also be used to avoid planned outages: Move all the work over to this node, then bring down the just-emptied node.
Application code doesn't know or care about any of this. It never finds out it has been migrated, for example (even network connections). It just works, exactly the way it worked on a single system. The kernel interface (KPI) has a single system image, identical to the interface on a single-node operating system.

With such a system, you could:

add a user – once, for all nodes, using standard single-OS tools; there's no need to learn any new multi-system management tools.
install applications and patches – once, for all nodes, ditto.
set application parameters – once, for all nodes, ditto.

Not only that, it can be highly available under hardware failures. Believing this may take a leap of faith, but hang on for a bit and I'll cite the implementations which did it. While definitely not trivial, you can arrange things so that the system as a whole can survive the death of one, or more, nodes. (And one implementation, Locus, was known for doing rational things under the ugly circumstance of partitioning the cluster into multiple clusters.) Some folks immediately assume that one node has to be a master, and losing it gets you in big trouble. That's not the case, any more than one processor or thread in a multicore system is the master.

Note, however, that all this discussion refers to system availability, not application availability. Applications may need to be restarted if the node they were on got fried or drowned. But it provides plumbing on which application availability can be built.

So this technique provides major management simplification, high availability, and scale up in one blow. This seems particularly to win for the large number of small installations that don't have the budget to spend on an IT shop. Knowing how to run the basic OS is all you need. That SMB (Small and Medium Business) target market seems always to be the place for high growth, so this is smack dab on a sweet spot.

Tie me to the mast, boys, the Sirens are calling. This is a fantastically seductive concept.

Of course, it's not particularly wonderful if you make money selling big multiprocessors, like SGI, IBM, HP, Fujitsu, and a few other "minor" players. Then it's potentially devastating, since it appears that anybody can at least claim they can whip up a big SMP out of smaller, lower-cost systems.

Also, just look at that figure. It's a blinkin' cloud! All you need do is assume the whole thing is accessed by a link to the Internet. It turns a collection of computers into a uniform pool of resources that is automatically managed, directly by the (one) OS. You want elastic? We got elastic. Roll up additional systems, plug them in, it expands to cover them.

{Well, at least it's a cloud is by some definitions of cloud computing; defining "cloud computing" is a cottage industry. Of course for elasticity there's the little bitty problem that the applications have to elasticize, too, but that's just as true of every cloud computing scenario I'm aware of. This is the same issue with high availability.}

So, if it's so seductive, to say nothing of reminiscent of today's hot cloud buzz – and for many, but not all the same reasons – why hasn't it taken over the world?

That's the big question. Before tackling that, though, let's fill in some blanks. Can it actually be done? The answer to that is an unequivocal yes. It's been done, repeatedly. How is it done? Well, not trivial, but there are structuring techniques that help.

That's next. First, the history.

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (part 1)

Wouldn't it be wonderful if you could simply glue together several multicore systems with some software and make the result look like one bigger multicore system?

This sounds like something to make sysadmins salivate and venture capitalists trample each other in a rush to fund it. I spent a whole chapter of In Search of Clusters, both editions, explaining why I thought it was wonderful. Some of these reasons are echoed in today's Cloud Computing hubbub, particularly the ability to use a collection of computers as a seamless, single resource. The flavor is very different, and the implementation totally different, but that intent and the effects are the same.

Unfortunately, while this has been around for quite a while, it has never really caught on despite numerous very competent attempts. One must therefore ask the embarrassing question: Why? What kind of computer halitosis does it have to never have been picked up?

That's the subject of this series of posts. I'm going to start at the beginning, explaining what it really is (and why it's not unlike cloud computing); then describe some of the history of attempts to get it on the market; some of the technology that underlies it; and finally take a run at the central question: If it's so wonderful, can be implemented, and has appeared in products, why hasn't it taken over the world?

Doing this will span several posts; it's a long story. (This time I promise to link them in reading order.)

Note: This is about a purely software approach to multiplying the multi of multicore; it's done by distributing the operating system. Plain, ordinary, communications gear – Ethernet, likely, but whatever you like – is all that connects the systems. Using special hardware to pull the same trick will be the subject of later posts, as may techniques to distribute other entities to the same effect, like a JVM (Java Virtual Machine).

Next, a discussion of what this really is.

Sunday, January 18, 2009

A Redirection

I'm back. Been a while.

Sorry for the long delay. In large part this was due to a readjustment of my intentions and directions for this blog, and ultimately for a new book. It finally became clear to me that the overall negative message I was carrying had two undesirable effects:

"You are toast" is not a message likely to motivate most people to read a lot of material. Positive messages sell better.
It leaves me with less to talk about. The one focused negative theme made it difficult to justify talking about the range of topics I want to comment on, which is wider than just multicore. Realistically, I think this was the primary issue that turned me around.

You may have noticed my new blog subtitle, which I modified to reflect the new orientation. I didn't see any reason to change the title, though. Parallel is still perilous, but you'll be in less peril if you understand it, which is where I and this blog come in.

So you'll find a lot less haranguing about how multicore being a collective big wish that everything will still be just fine, and more about lessons learned and discussions of areas where I see broad confusion.

So, that's the intent from now on. You may, however, be interested or amused by a related random dollop of irony that I fetched up against.

Shortly after I decided to make this change, I just happened to catch The Dark Secret of Hendrick Schön on the Science Channel. Overall, the show is about how Schön was an uber-wunderkind of physics, publishing great discoveries at a ferocious rate, until he was fired in disgrace from Bell Labs (1). Saying more about Schön would be a spoiler and isn't particularly relevant to my point here, which is:

The show starts with a fairly long black-and-white segment that moves about a deserted, rundown city filled with ugly old industrial buildings, paper blowing in the streets, and an occasional apparently homeless person pushing a shopping cart stacked high with trash. While this dystopia rolls by, the narration breathlessly prophesies all the horrors that will unfold across all of modern society if one, specific, terrible technical tragedy occurs.

The tragedy?

Moore's Law runs out.

_________________________________________________________

1. I wasn't even aware that Bell Labs still existed. Apparently it does, still doing work in nanotechnology among other areas. This relates to why the show description in the link above goes gaga over Grey Goo. Grey Goo was not his "dark secret."

Thursday, September 18, 2008

Background: Me and This Blog

Some people still remember me as the author of In Search of Clusters, as I found out when I began posting to the Google group cloud-computing. A hearty Thank You! to all of them. For the rest of you, why not buy a copy? It's still in print, and I still get royalties. :-) (Really only a semi-smiley. Royalties are good.)

For the rest of you, and as a reminder, here’s a short bio:

I was until recently a Distinguished Engineer in the IBM Systems & Technology Group in Austin, Texas. I’m the author of In Search of Clusters, currently in its second edition and still occasionally referred to as “the bible of clusters” even seven years after its publication. I was also, back in the '80s, Chief Scientist of the RP3 massively-parallel computing project in IBM Research (joint with NYU). My job in Austin was to be a leader of future system architecture work, particularly in the areas of accelerators and appliances, and I was chair of the InfiniBand industry standard management workgroup. I've worked on parallel computing for over 30 years, and hold something around 30 patents in parallel computing and computer communications.

I’m currently retired, living in Austin, TX, and intending to move to Colorado (just North of Denver), where I've been invited to be part-time on the faculty of CSU - as soon as I can sell my house. That was supposed to be over and done three or four months ago, but things have ground to a trickle now that buyers find it virtually impossible to get a mortgage. Austin was pretty immune to housing slowdowns until that happened. Dang. Anyway.

In the meantime, I’m working on another book. Its working title is The End of Computing (As We Knew It), and my intention with it is to expose the hole into which the computer industry may be plunging by turning to explicit parallelism. I'm not implying there's any other realistic choice, mind you, but I got a snoot-full while in IBM of how (a) hardware engineers haven’t a clue what a bitch that is to program; (b) most software engineers haven’t a clue that this is being done to them; (c) many upper-management and analyst types have their heads firmly stuck in the sand on what this all will mean.

It particularly bugs me that people still blather on about how Moore's Law will keep on trucking for decades. Maybe it will, interpreted literally. But the Moore's Law that will keep on trucking has been castrated. It lacks a key element (frequency scaling) that drove the computing industry for the last four or more decades. This is a classic case of experts focussing on the veins in the leaves on the trees and ignoring the ravine they're about to fall into.

More. I have this suspicion that many people who really understand how deep into the doodoo we're going are weasel-wording it deliberately. No point in frightening the hoi polloi, now, is there? Maybe there's a cure, who knows, we're not there yet, hm? Horsepuckey.

Or else they're just scared to death and hoping like crazy they'll wake up one morning and somebody will have solved the problem. Unfortunately, that ignores 30 years of history, my friends.

That's the topic. Obviously, I feel strongly about it. This is good. It's motivational.

But...

I am sorry, and somewhat ashamed, to have to say that I’ve made nearly no progress on actually writing the book over the last year. I've messed around starting a couple of chapters, and have an outline, but that's all.

I have been spending a lot of time just keeping up with what’s being said all around this subject, and have amassed a huge number of web references and comments – 600+ MB of them, in fact. Stitching it all together is, however, requiring more focused continuous effort than retirement with a decent pension and waiting for a house to sell seems induce. (Why do it today? There’s always tomorrow.)

It finally occurred to me, through a devious chain of events, that it might help make some of my data collection and musings public.

Hence this blog.

You can expect to find here comments on things like:

Semiconductor technology – in a simple way. I’m no expert in this area, but it’s the basis of the whole issue.
What it takes to make parallel computer hardware, like SMPs and clusters (farms, etc.), interconnects and communications issues in particular.
Issues and difficulties involved in programming that hardware, and what has been learned in about 40 years of experience in the HPC arena (a point seemingly lost in most discussions).
Ways to make use of that hardware that don’t require explicit parallel programming by the mass of programmers, like virtualization, transaction processing.
Yet Another Way to use all those transistors: Accelerators. And why they may now finally have legs and stay around for a while (hint: you can’t win against a 45% CAGR increase in clock speed).
Graphics accelerators and, in particular, Larrabee vs. Nvidia / ATI(AMD). Why it’s needed. Who’s betting on which characteristic, whether they know it or not. (Lessons from HPC.)
Possible killer apps for parallel systems, like virtual worlds, graphics, stream processing, cloud computing (sort of), grids (sort of).
What this all means to the industry.

Readers of In Search of Clusters will notice a lot of familiar material there, and a lot that's new. One big differnce is that I'm going to try to aim for a more popular audience on this one, trying to make it more accessible to more people. There are two reasons for this: First, I think the topic needs to be brought more out into the open, outside the confines of the industry. Second, that sells lots more copies of any book.

In this blog I’ll probably also grouch and rave about some things that have nothing to do with any of the above, just to keep from getting my head clear now and again. Like a short lesson in not to try to carry a three-foot Tai Chi (Taijichuan) Jian, a three-foot straight sword, when flying from Beijing to Shanghai. Or my struggles with using Word to produce a book manuscript, which I apparently have to these days.

I'm also newly twitter-ified (twitified?) as GregPfister, and will try to keep that channel stuffed, too. But this will be where the data is, of necessity.

Oh, and anybody know what the situation is on copyright for blog contents? Until I know otherwise, I’m going to have to be really conservative about posting excerpts of work in progress on the new book.

Anyway, Hi!

I'm here.

(tap, tap) This thing on?

Anybody listening?