The kernel lock validator

[Posted May 31, 2006 by corbet]

Locking is a necessary evil in operating systems; without a solid locking regime, different parts of the system will collide when trying to access the same resources, leading to data corruption and general chaos. But locking has hazards of its own; carelessly implemented locking can cause system deadlocks. As a simple example, consider two locks L₁ and L₂. Any code which requires both locks must take care to acquire the locks in the right order. If one function acquires L₁ before L₂, but another function acquires them in the opposite order, eventually the system will find itself in a situation where each function has acquired one lock and is blocked waiting for the other - a deadlock.

A race condition like the one described above may be a one-in-a-million possibility, but, with computers, it does not take too long to exercise a code path a million times. Sooner or later, a system containing this sort of bug will lock up, leaving its users wondering what is going on. To avoid this sort of situation, kernel developers try to define rules for the order in which locks should be acquired. But, in a system with many thousands of locks, defining a comprehensive set of rules is challenging at best, and enforcing them is even harder. So locking bugs creep into the kernel, lurk until some truly inconvenient time, and eventually surprise some unsuspecting user.

Over time, the kernel developers have made increasing use of automated code analysis tools as those tools become available. The latest such is the first version of the lock validator patch, posted by Ingo Molnar. This patch (a 61-part set, actually) adds a complex infrastructure to the kernel which can then be used to prove that none of the locking patterns observed in a running system could ever deadlock the kernel.

To that end, the lock validator must track real locking patterns in the kernel. There is no point, however, in tracking every individual lock - there are thousands of them, but many of them are treated in exactly the same way by the kernel. For example, every inode structure contains a spinlock, as does every file structure. Once the kernel has seen how locking is handled for one inode structure, it knows how it will be handled for every inode structure. So, somehow, the lock validator needs to be able to recognize that all spinlocks contained within (for example) the inode structure are essentially the same.

To this end, every lock in the system (including rwlocks and mutexes, now) is assigned a specific key. For locks which are declared statically (for example, files_lock, which protects the list of open files), the address of the lock is used as the key. Locks which are allocated dynamically (as most locks embedded within structures are) cannot be tracked that way, however; there may be vast numbers of addresses involved, and, in any case, all locks associated with a specific structure field should be mapped to a single key. This is done by recognizing that these locks are initialized at run time, so, for example, spin_lock_init() is redefined as:

    # define spin_lock_init(lock)			\
    do {						\
	static struct lockdep_type_key __key;		\
							\
	__spin_lock_init((lock), #lock, &__key);	\
    } while (0)

Thus, for each lock initialization, this code creates a static variable (__key) and uses its address as the key identifying the type of the lock. Since any particular type of lock tends to be initialized in a single place, this trick associates the same key with every lock of the same type.

Next, the validator code intercepts every locking operation and performs a number of tests:

The code looks at all other locks which are already held when a new lock is taken. For all of those locks, the validator looks for a past occurrence where any of them were taken after the new lock. If any such are found, it indicates a violation of locking order rules, and an eventual deadlock.
A stack of currently-held locks is maintained, so any lock being released should be at the top of the stack; anything else means that something strange is going on.
Any spinlock which is acquired by a hardware interrupt handler can never be held when interrupts are enabled. Consider what happens when this rule is broken. A kernel function, running in process context, acquires a specific lock. An interrupt arrives, and the associated interrupt handler runs on the same CPU; that handler then attempts to acquire the same lock. Since the lock is unavailable, the handler will spin, waiting for the lock to become free. But the handler has preempted the only code which will ever free that lock, so it will spin forever, deadlocking that processor.
To catch problems of this type, the validator records two bits of information for every lock it knows about: (1) whether the lock has ever been acquired in hardware interrupt context, and (2) whether the lock is ever held by code which runs with hardware interrupts enabled. If both bits are set, the lock is being used erroneously and an error is signaled.
Similar tests are made for software interrupts, which present the same problems.

The interrupt tests are relatively straightforward, requiring just four bits of information for each lock (though the situation is a little more complicated for rwlocks). But the ordering tests require a bit more work. For every known lock key, the validator maintains two lists. One of them contains all locks which have ever been held when the lock of interest (call it L) is acquired; it thus contains the keys of all locks which might be acquired before L. The other list (the "after" list) holds all locks acquired while the L is held. These two lists thus encapsulate the proper ordering of how those other locks should be acquired relative to L.

Whenever L is acquired, the validator checks whether any lock on the "after" list associated with L is already held. It should not find any, since all locks on the "after" list should only be acquired after acquiring L. Should it find a lock which should not be held, an error is signaled. The validator code also takes the "after" list of L, connects it with the "before" lists of the currently-held locks, and convinces itself that there are no ordering or interrupt violations anywhere within that chain. If all the tests pass, the validator updates the various "before" and "after" lists and the kernel continues on its way.

Needless to say, all this checking imposes a certain amount of overhead; it is not something which one will want to enable on production kernels. It is not quite as bad as one might expect, however. As the kernel does its thing, the lock validator maintains its stack of currently-held locks. It also generates a 64-bit hash value from that series of locks. Whenever a particular combination of locks is validated, the associated hash value is stored in a table. The next time that lock sequence is encountered, the code can find the associated hash value in the table and know that the checks have already been performed. This hashing speeds the process considerably.

Of course, there are plenty of exceptions to the locking rules as understood by the validator. As a result, a significant portion of the validator patch set is aimed at getting rid of false error reports. For example, the validator normally complains if more than one lock with the same key is held at the same time - doing so is asking for deadlocks. There are situations, however, where this pattern is legitimate. For example, the block subsystem will often lock a block device, then lock a partition within that device. Since the partition also looks like a block device, the validator signals an error. To keep that from happening, the validator implements the notion of lock "subtypes." In this case, locks on partition devices can be marked with a different subtype, allowing their usage to be validated properly. This marking is done by using new versions of the locking functions (spin_lock_nested(), for example) which take a subtype parameter.

The lock validator was added to 2.6.17-rc5-mm1, so interested people can play with it. Waiting for another -mm release might not be a bad idea, however; there has since been a fairly long series of validator fixes posted.

The key point behind all of this is that deadlock situations can be found without having to actually make the kernel lock up. By watching the sequences in which locks are acquired, the validator can extrapolate a much larger set of possible sequences. So, even though a particular deadlock might only happen as the result of unfortunate timing caused by a specific combination of strange hardware, a rare set of configuration options, 220V power, a slightly flaky video controller, Mars transiting through Leo, an old version of gcc, an application which severely stresses the system (yum, say), and an especially bad Darl McBride hair day, the validator has a good chance of catching it. So this code should result in a whole class of bugs being eliminated from the kernel code base; that can only be a good thing.

Index entries for this article
Kernel	Development tools/Kernel debugging
Kernel	Lockdep
Kernel	Spinlocks

The kernel lock validator

Posted May 31, 2006 3:28 UTC (Wed) by walken (subscriber, #7089) [Link] (2 responses)

This may be a dumb question, but why not assign an explicit rank to every lock type ?

That way, the lock ranks double as documentation of the lock ordering.

(I'm using this quite successfully at work for a kernel-type project, though not related to linux).

The kernel lock validator

Posted May 31, 2006 6:22 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

the list of locks is not static.

remember basic and line numbers? no matter how you started your numbering scheme you eventually ran into trouble.

David Lang

The kernel lock validator

Posted Jun 9, 2006 4:13 UTC (Fri) by slamb (guest, #1070) [Link]

The list of lock keys is static, and that's what would get assigned ranks. I quote from the article:

Locks which are allocated dynamically (as most locks embedded within structures are) [...] should be mapped to a single key. [...] Thus, for each lock initialization, this code creates a static variable (__key) and uses its address as the key identifying the type of the lock. Since any particular type of lock tends to be initialized in a single place, this trick associates the same key with every lock of the same type.

I also work in a project which assigns ranks manually.

Disadvantage: especially given the number of locks in the Linux kernel, it'd be a large initial effort to go through and create a ranking consistent with all the approved locking orders. (On the other hand, not doing it probably means struggling with this list over and over without it actually being written down anywhere.)

Advantage: you wouldn't even need to run through two code paths to know that they can deadlock - just the one you've defined as wrong.

Advantage: it'd remove all the heaviness of the lock validator. There's no need for the "before" and "after" lists; the rank defines them. The stack could be just a bit array - small enough that it's probably practical to preallocate its full extent on thread creation. Thus, the lock validator would not add potential out of memory failures, and it'd run quickly. There'd be no reason not to always have it on.

(Actually, I'm not sure how strict the same-key locking stuff (spin_lock_nested and such) is. To track those, a fancier stack might still be necessary.)

OT: The # operator?

Posted May 31, 2006 6:25 UTC (Wed) by Los__D (guest, #15263) [Link] (5 responses)

I thought that I knew C pretty good, but I NEVER saw a # operator before (Except for preproccesor statements).

What does that #lock in the call do?

Dennis

OT: The # operator?

Posted May 31, 2006 6:33 UTC (Wed) by Los__D (guest, #15263) [Link]

Doh, just preprocessor argument->string conversion, nevermind, too early in the morning for me ;)

OT: The # operator?

Posted Jun 1, 2006 10:20 UTC (Thu) by Octavian (guest, #7462) [Link] (3 responses)

aka "stringification"

OT: The # operator?

Posted Nov 29, 2006 3:05 UTC (Wed) by zlo (guest, #41952) [Link] (2 responses)

Tnx very useful info ofr me!

Spam comment

Posted Nov 30, 2006 17:27 UTC (Thu) by bronson (subscriber, #4806) [Link] (1 responses)

The parent post is clearly spam. Is this the first spammy comment to remain on LWN for any significant amount of time?

Spam comment

Posted Jul 24, 2012 14:29 UTC (Tue) by chrisr (subscriber, #83108) [Link]

It might be the most long-lived :)

The kernel lock validator

Posted May 31, 2006 6:28 UTC (Wed) by flewellyn (subscriber, #5047) [Link] (14 responses)

Y'know, it's funny. Back in college, in my Operating Systems course, we learned that it was
impossible to design code which would reliably detect all possible deadlock conditions in a
kernel.

We also learned that schedulers could not run in constant time.

That's twice now that Ingo Molnar has basically rewritten the literature on us.

The kernel lock validator

Posted May 31, 2006 6:54 UTC (Wed) by smurf (subscriber, #17840) [Link] (12 responses)

Umm, the validator doesn't detect all *possible* deadlock conditions -- just those that, given current and past locking history, might occur. It won't find tomorrow's deadlock if the code that could possibly cause it didn't run today.

Oh, and the scheduler doesn't run in constant time. It just delegates the job to small trivial chunks that run when each process starts+stops using the CPU, so technically it's still O(n), not O(1).

Re: detecting all possible deadlock conditions

Posted May 31, 2006 7:17 UTC (Wed) by mingo (guest, #31122) [Link] (8 responses)

Well, the validator does reduce the incredibly complex QA task of "trigger all possible locking races that can lead to deadlocks, assuming an arbitrary number of CPUs and interrupt sources" to "trigger all locking codepaths at least once".

In other words, we've reduced a highly probabilistic parallel QA (and manual code review) task to a comparatively _much_ simpler task of serial code coverage.

But yes, to increase the 'reach' of the formal proof of correctness we need to increase code coverage during testing. Projects like LTP do that systematically, they use kernel instrumentation and visualization tools to see which code wasnt executed yet and create testcases to trigger it. Plus chaos helps us too: if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)

Put differently: it's not bad idea to run through every possible branch of code at least once anyway ;-) If some code is never triggered in practice, why is the code there and how was it tested to begin with? There is a world of a difference between code that is never executed even once and narrow 1-instruction races that need many CPUs to even trigger.

So i'd say "total locking correctness" is not an impossible dream anymore, but is attached to another effort (code coverage) that we want to make as complete as possible anyway.

testing and code coverage

Posted May 31, 2006 14:21 UTC (Wed) by jreiser (subscriber, #11027) [Link] (5 responses)

First, probably it never has been tested systemmatically; this problem is endemic to the Linux development process. Second, the problem is not only "branches of code [coverage of basic blocks]." The problem is conditional execution paths through multiple basic blocks, including re-convergent fanout (correlated conditions) and partial correctness with "don't care" conditions.

testing and code coverage

Posted May 31, 2006 14:34 UTC (Wed) by mingo (guest, #31122) [Link] (3 responses)

First, probably it never has been tested systemmatically;

i think the answer to that is in the section you apparently overlooked:

Projects like LTP do that systematically, they use kernel instrumentation and visualization tools to see which code wasnt executed yet and create testcases to trigger it. Plus chaos helps us too: if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)

and thus this IMO misleading proposition results in a misleading conclusion:

this problem is endemic to the Linux development process.

;-)

LTP has a long way to go

Posted May 31, 2006 15:12 UTC (Wed) by jreiser (subscriber, #11027) [Link] (2 responses)

I did not overlook "Projects like LTP do that systematically." The Linux Testing Project is a drop in the bucket: far too little. Given its size, the kernel should have over ten thousand individual tests that are run and analyzed [automation helps!] before each release.

LTP has a long way to go

Posted May 31, 2006 15:55 UTC (Wed) by mingo (guest, #31122) [Link] (1 responses)

the LTP testsuite's 'testcases/' directory sports 7000+ files, most of which are individual testcases. Testcase files often contain more than 1 testcase. LTP is being run not "before each release" but on Linus' nightly GIT trees - and yes, it's all automated.

furthermore, there are random automated testing efforts as well like scrashme, which can (and do) hit bugs by chance as well.

while i dont claim that LTP is perfect (if it were we'd have no bugs in the kernel), it is certainly alot more than "a drop in the bucket".

but i digress ...

LTP has a long way to go

Posted Jun 6, 2006 19:34 UTC (Tue) by Blaisorblade (guest, #25465) [Link]

The problem of LTP is that it, in itself, tests syscall semantics, and many tests are unit tests. They can help in some cases (especially for strange arches, like say UML) - and they're very useful when a functionality is introduced, especially if the implementation is complex (that's my experience - I wrote together the implementation and a very complex testcase of it in one project).

nanosleep() tests aren't going to catch many bugs - nanosleep() uses are very simple.

But, say, run LTP on every possible drivers set, with a dmraid setup on "normal" RAID volumes on IDE and SCSI and SATA physical disks (since many bugs are in hardware drivers)...

Or combine networking tests with analisys of packet capture data and match sent data with on-wire data (supposing it's possible - you actually need a full TCP/IP stack to run on captured data, with additional correctness checking features)...

Then you'll find real bugs. However this starts being difficult.

Another possibility is to extract testcases from atypical applications.

UserModeLinux (on which I work) has been an excellent test-case against ptrace behaviour.

It found subtle bugs in ptrace existing in three kernel releases, in the recent past (one bug lived for 2.6.9 and 2.6.10, affecting security; the other affected x86-64 from 2.6.16.5 to 2.6.16.19).

But in this case, who coded the patches didn't run it.

testing and code coverage

Posted May 31, 2006 14:54 UTC (Wed) by mingo (guest, #31122) [Link]

Second, the problem is not only "branches of code [coverage of basic blocks]." The problem is conditional execution paths through multiple basic blocks, including re-convergent fanout (correlated conditions) and partial correctness with "don't care" conditions.

yes - code coverage and testing is alot more than just basic coverage of all existing code.

but coverage that triggers locking is alot simpler than full coverage, because locks are almost always taken in simple ways, without too many branches. So while in theory code like this:


void function1(unsigned long mask)
{
        if (mask & 0x00000001)
                spin_lock(&lock0);
        if (mask & 0x00000002)
                spin_lock(&lock1);
        if (mask & 0x00000004)
                spin_lock(&lock2);
        if (mask & 0x00000008)
                spin_lock(&lock3);
        if (mask & 0x00000010)
                spin_lock(&lock4);
...
        if (mask & 0x80000000)
                spin_lock(&lock31);
}

could exist in the kernel and would require 4 billion values of 'mask' to cycle through all the possible locking scenarios, in practice the kernel is full of much simpler locking constructs:

void function2(unsigned long mask)
{
        spin_lock(&lock);
        ...
        spin_unlock(&lock);
}

where covering the function once is probably enough to map its locking impact. In fact, "tricky", non-straight locking code is being frowned upon from a review and quality POV, which too works in favor of validation quality. To stay with the function1() example, such code will very likely be rejected at a very early review stage.

and finally, even theoretical full code coverage is alot simpler than the possible combinations of codepaths on an multiprocessor system that could trigger deadlocks.

naturally, being a dynamic method, the lock validator can only claim correctness about codepaths it actually observes (any other code doesnt even exist for it), and thus it still depends on how frequently (and with what memory state) those codepaths get triggered.

Re: detecting all possible deadlock conditions

Posted May 31, 2006 15:08 UTC (Wed) by NAR (subscriber, #1313) [Link]

if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)

If this feature is turned on in the user's kernel - isn't the performance hit too much for users to keep this configuration option turned on?

If some code is never triggered in practice, why is the code there and how was it tested to begin with?

Error handling (especially hardware-related error handling) might be triggered extremely rarely, but I don't think it should be thrown out. The code could be tested by manually setting some error flag, but obviously the setting of this flag is not included in the production kernel.

Bye,NAR

Re: detecting all possible deadlock conditions

Posted Jun 8, 2006 8:53 UTC (Thu) by jmansion (guest, #36515) [Link]

>In other words, we've reduced a highly probabilistic
>parallel QA (and manual code review) task to a
>comparatively _much_ simpler task of serial code coverage.

Hardly. If you can legitimately *ever* lock two objects of the same type then there will be an exception that turns off the 'false positive' that will also turn off checking whether you acquire A then B and also on some code paths B then A. Which is exactly the sort of deadlock you normally get in a database server, for example.

Which is not to say that the approach isn't useful, just that its not a panacea. It would have been handy to have a compare/contrast with FreeBSD's WITNESS facility.

Re: O(1) scheduler

Posted May 31, 2006 7:55 UTC (Wed) by mingo (guest, #31122) [Link] (1 responses)

Oh, and the scheduler doesn't run in constant time. It just delegates the job to small trivial chunks that run when each process starts+stops using the CPU, so technically it's still O(n), not O(1).

this statement doesnt parse for me. Sure, if you execute an O(1) operation N times that makes it O(N). What matters is the overhead of context-switching and wakeup, and those are O(1)!

just look at an example: sys_getpid(), the simplest system-call in existence, does:

return current->tgid;

that's as O(1) as it gets - a handful of instructions. But by your argument, since applications may execute getpid() lots of times, in reality it's O(N)? By that argument there would be code at all that would qualify as O(1) code!

Re: O(1) scheduler

Posted May 31, 2006 8:08 UTC (Wed) by frankie (subscriber, #13593) [Link]

The O(1) for the scheduler refers the number of processes. The scheduling time is indeed not dependent on the number of processes. That's all.

The kernel lock validator

Posted May 31, 2006 9:19 UTC (Wed) by simlo (guest, #10866) [Link]

The old scheduler was O(number of processes on the run-queue). The scheduler in 2.6 is O(number of priorities)=O(140)=O(1).

The kernel lock validator

Posted May 31, 2006 7:09 UTC (Wed) by piman (guest, #8957) [Link]

In fact it's trivial to identify all functions that could deadlock (if you heard otherwise, maybe you should've been paying more attention in class :). What's impossible is identifying all and only those functions that deadlock.

The kernel lock validator

Posted May 31, 2006 6:59 UTC (Wed) by mingo (guest, #31122) [Link] (2 responses)

Small additional thought to subtype paragraph: while subtypes do fix false positives, their role isnt just to get around exceptions - their role is to correctly map recursive locking semantics.

That means that the "special rules" added for the sake of lockdep (via the use of spin_lock_nested() APIs, etc.) act as a formal specification of the locking rules, and the validator checks those rules against actual behavior.

To stay with the partitions example: if for example some kernel code locks a component partition before a container partition then the validator will detect this deadlock scenario.

So subtypes are not a watering down of the "proof of correctness", they are more like a mechanism to "program by specification" and to let the validator check against that specification. For non-recursive locking, the validator can build the specification automatically.

(besides subtypes, explicit type keys are another way to specify locking semantics. This all gets us one step closer to a more formal programming model and to a more correct kernel.)

The kernel lock validator

Posted May 31, 2006 9:35 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

... and now reiser4 turns up with a tree of locks where each rank may be taken within the rank above safely, but where there are a potentially unbounded number of ranks...

The kernel lock validator

Posted Jun 6, 2006 16:07 UTC (Tue) by nix (subscriber, #2304) [Link]

Hm, on rereading that's not what it does. Sorry.

The kernel lock validator

Posted May 31, 2006 8:11 UTC (Wed) by pointwood (guest, #2814) [Link] (2 responses)

I didn't know that Darl McBride having a bad hair day could make my kernel deadlock - you learn something new everday :p

The kernel lock validator

Posted May 31, 2006 13:10 UTC (Wed) by kirkengaard (guest, #15022) [Link]

Yes ... when he can't get it just so, he goes out and kills some butterflies out of frustration, changing the state of the universe's entropy pool, and unpredictably modifying system behavior. The occasional kernel deadlock is merely a symptom, though it poses a supremely difficult debugging challenge. ;)

The kernel lock validator

Posted Jun 1, 2006 5:10 UTC (Thu) by walken (subscriber, #7089) [Link]

Just look how Darl can deadlock a court case...

Any results?

Posted Jun 1, 2006 6:32 UTC (Thu) by man_ls (guest, #15091) [Link] (5 responses)

I'm surprised that nobody has asked this before. Solutions that look well in theory are, well, just theoretical; but does it work in practice? I assume that someone, at least the author, has had it running for some time now. How does it behave? Has anyone found potential deadlocks on running code, already? Otherwise, are there too few results, too many false positives?

Any results?

Posted Jun 1, 2006 13:17 UTC (Thu) by corbet (editor, #1) [Link]

Yes, there have been some real deadlocks discussed on the kernel list. See also David Miller's posting, where he notes that it found a number of networking bugs while it was being developed.

Any results?

Posted Jun 1, 2006 14:14 UTC (Thu) by arjan (subscriber, #36785) [Link] (3 responses)

the ratio of "false positive" and real bugs is about 1:1 so far, where the false positives generally are doing some fishy-but-happens-to-be-ok locking most of the time. The false positives generally require more instrumentation in the code for the exact locking rules, which is good for documentation purposes anyway

I've lost track but we're well over the count of 20 with real bugs found so far (several of these got fixed in mainline before lockdep got posted the other day so it's a bit hard to count)

Any results?

Posted Jun 2, 2006 0:33 UTC (Fri) by linportal (guest, #38148) [Link] (1 responses)

What I have missed in all this is the address to forward detected bugs to. What about this one: kseriod deadlock? Is it real or false positive?

Any results?

Posted Jun 3, 2006 11:59 UTC (Sat) by mingo (guest, #31122) [Link]

That's a false positive, caused by recursive locking in the serio code. Arjan has posted the fix:

--- linux/drivers/input/serio/libps2.c.orig
+++ linux/drivers/input/serio/libps2.c
@@ -177,7 +177,7 @@ int ps2_command(struct ps2dev *ps2dev, u
                return -1;
        }

-       mutex_lock(&ps2dev->cmd_mutex);
+       mutex_lock_nested(&ps2dev->cmd_mutex, SINGLE_DEPTH_NESTING);

        serio_pause_rx(ps2dev->serio);
        ps2dev->flags = command == PS2_CMD_GETID ? PS2_FLAG_WAITID : 0;

Any results?

Posted Jun 4, 2006 3:54 UTC (Sun) by nevets (subscriber, #11875) [Link]

the ratio of "false positive" and real bugs is about 1:1 so far, where the false positives generally are doing some fishy-but-happens-to-be-ok locking most of the time.

And as Ingo already pointed out. The fishy-but-happens-to-be-ok are now nicely documented by the work arounds to (in akpm's words) make-lockdep-shut-up patches. ;)