The kernel lock validator
A race condition like the one described above may be a one-in-a-million possibility, but, with computers, it does not take too long to exercise a code path a million times. Sooner or later, a system containing this sort of bug will lock up, leaving its users wondering what is going on. To avoid this sort of situation, kernel developers try to define rules for the order in which locks should be acquired. But, in a system with many thousands of locks, defining a comprehensive set of rules is challenging at best, and enforcing them is even harder. So locking bugs creep into the kernel, lurk until some truly inconvenient time, and eventually surprise some unsuspecting user.
Over time, the kernel developers have made increasing use of automated code analysis tools as those tools become available. The latest such is the first version of the lock validator patch, posted by Ingo Molnar. This patch (a 61-part set, actually) adds a complex infrastructure to the kernel which can then be used to prove that none of the locking patterns observed in a running system could ever deadlock the kernel.
To that end, the lock validator must track real locking patterns in the kernel. There is no point, however, in tracking every individual lock - there are thousands of them, but many of them are treated in exactly the same way by the kernel. For example, every inode structure contains a spinlock, as does every file structure. Once the kernel has seen how locking is handled for one inode structure, it knows how it will be handled for every inode structure. So, somehow, the lock validator needs to be able to recognize that all spinlocks contained within (for example) the inode structure are essentially the same.
To this end, every lock in the system (including rwlocks and mutexes, now) is assigned a specific key. For locks which are declared statically (for example, files_lock, which protects the list of open files), the address of the lock is used as the key. Locks which are allocated dynamically (as most locks embedded within structures are) cannot be tracked that way, however; there may be vast numbers of addresses involved, and, in any case, all locks associated with a specific structure field should be mapped to a single key. This is done by recognizing that these locks are initialized at run time, so, for example, spin_lock_init() is redefined as:
# define spin_lock_init(lock) \
do { \
static struct lockdep_type_key __key; \
\
__spin_lock_init((lock), #lock, &__key); \
} while (0)
Thus, for each lock initialization, this code creates a static variable (__key) and uses its address as the key identifying the type of the lock. Since any particular type of lock tends to be initialized in a single place, this trick associates the same key with every lock of the same type.
Next, the validator code intercepts every locking operation and performs a number of tests:
- The code looks at all other locks which are already held when a new
lock is taken. For all of those locks, the validator looks for a past
occurrence where any of them were taken after the new lock. If
any such are found, it indicates a violation of locking order rules,
and an eventual deadlock.
- A stack of currently-held locks is maintained, so any lock being
released should be at the top of the stack; anything else means that
something strange is going on.
- Any spinlock which is acquired by a hardware interrupt handler can
never be held when interrupts are enabled. Consider what happens when
this rule is broken. A kernel function, running in process context,
acquires a specific lock. An interrupt arrives, and the associated
interrupt handler runs on the same CPU; that handler then attempts to
acquire the same lock. Since the lock is unavailable, the handler
will spin, waiting for the lock to become free. But the handler has
preempted the only code which will ever free that lock, so it will
spin forever, deadlocking that processor.
To catch problems of this type, the validator records two bits of information for every lock it knows about: (1) whether the lock has ever been acquired in hardware interrupt context, and (2) whether the lock is ever held by code which runs with hardware interrupts enabled. If both bits are set, the lock is being used erroneously and an error is signaled.
- Similar tests are made for software interrupts, which present the same problems.
The interrupt tests are relatively straightforward, requiring just four bits of information for each lock (though the situation is a little more complicated for rwlocks). But the ordering tests require a bit more work. For every known lock key, the validator maintains two lists. One of them contains all locks which have ever been held when the lock of interest (call it L) is acquired; it thus contains the keys of all locks which might be acquired before L. The other list (the "after" list) holds all locks acquired while the L is held. These two lists thus encapsulate the proper ordering of how those other locks should be acquired relative to L.
Whenever L is acquired, the validator checks whether any lock on the "after" list associated with L is already held. It should not find any, since all locks on the "after" list should only be acquired after acquiring L. Should it find a lock which should not be held, an error is signaled. The validator code also takes the "after" list of L, connects it with the "before" lists of the currently-held locks, and convinces itself that there are no ordering or interrupt violations anywhere within that chain. If all the tests pass, the validator updates the various "before" and "after" lists and the kernel continues on its way.
Needless to say, all this checking imposes a certain amount of overhead; it is not something which one will want to enable on production kernels. It is not quite as bad as one might expect, however. As the kernel does its thing, the lock validator maintains its stack of currently-held locks. It also generates a 64-bit hash value from that series of locks. Whenever a particular combination of locks is validated, the associated hash value is stored in a table. The next time that lock sequence is encountered, the code can find the associated hash value in the table and know that the checks have already been performed. This hashing speeds the process considerably.
Of course, there are plenty of exceptions to the locking rules as understood by the validator. As a result, a significant portion of the validator patch set is aimed at getting rid of false error reports. For example, the validator normally complains if more than one lock with the same key is held at the same time - doing so is asking for deadlocks. There are situations, however, where this pattern is legitimate. For example, the block subsystem will often lock a block device, then lock a partition within that device. Since the partition also looks like a block device, the validator signals an error. To keep that from happening, the validator implements the notion of lock "subtypes." In this case, locks on partition devices can be marked with a different subtype, allowing their usage to be validated properly. This marking is done by using new versions of the locking functions (spin_lock_nested(), for example) which take a subtype parameter.
The lock validator was added to 2.6.17-rc5-mm1, so interested people can play with it. Waiting for another -mm release might not be a bad idea, however; there has since been a fairly long series of validator fixes posted.
The key point behind all of this is that deadlock situations can be found
without having to actually make the kernel lock up. By watching the
sequences in which locks are acquired, the validator can extrapolate a much
larger set of possible sequences. So, even though a particular deadlock
might only happen as the result of unfortunate timing caused by a specific
combination of strange hardware, a rare set of configuration options, 220V
power, a slightly flaky video controller, Mars transiting through Leo, an
old version of gcc, an application which severely stresses the
system (yum, say), and an especially bad Darl McBride hair day,
the validator has a good chance of catching it. So this code should result
in a whole class of bugs being eliminated from the kernel code base; that
can only be a good thing.
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Kernel debugging |
| Kernel | Lockdep |
| Kernel | Spinlocks |
Posted May 31, 2006 3:28 UTC (Wed)
by walken (subscriber, #7089)
[Link] (2 responses)
That way, the lock ranks double as documentation of the lock ordering.
(I'm using this quite successfully at work for a kernel-type project, though not related to linux).
Posted May 31, 2006 6:22 UTC (Wed)
by dlang (guest, #313)
[Link] (1 responses)
remember basic and line numbers? no matter how you started your numbering scheme you eventually ran into trouble.
David Lang
Posted Jun 9, 2006 4:13 UTC (Fri)
by slamb (guest, #1070)
[Link]
I also work in a project which assigns ranks manually.
Disadvantage: especially given the number of locks in the Linux kernel, it'd be a large initial
effort to go through and create a ranking consistent with all the approved locking orders. (On the
other hand, not doing it probably means struggling with this list over and over without it actually
being written down anywhere.)
Advantage: you wouldn't even need to run through two code paths to know that they
can deadlock - just the one you've defined as wrong.
Advantage: it'd remove all the heaviness of the lock validator. There's no need for
the "before" and "after" lists; the rank defines them. The stack could be just a bit array - small
enough that it's probably practical to preallocate its full extent on thread creation. Thus, the lock
validator would not add potential out of memory failures, and it'd run quickly. There'd be no
reason not to always have it on.
(Actually, I'm not sure how strict the same-key locking stuff (spin_lock_nested
and such) is. To track those, a fancier stack might still be necessary.)
Posted May 31, 2006 6:25 UTC (Wed)
by Los__D (guest, #15263)
[Link] (5 responses)
What does that #lock in the call do?
Dennis
Posted May 31, 2006 6:33 UTC (Wed)
by Los__D (guest, #15263)
[Link]
Posted Jun 1, 2006 10:20 UTC (Thu)
by Octavian (guest, #7462)
[Link] (3 responses)
Posted Nov 29, 2006 3:05 UTC (Wed)
by zlo (guest, #41952)
[Link] (2 responses)
Posted Nov 30, 2006 17:27 UTC (Thu)
by bronson (subscriber, #4806)
[Link] (1 responses)
Posted Jul 24, 2012 14:29 UTC (Tue)
by chrisr (subscriber, #83108)
[Link]
Posted May 31, 2006 6:28 UTC (Wed)
by flewellyn (subscriber, #5047)
[Link] (14 responses)
We also learned that schedulers could not run in constant time.
That's twice now that Ingo Molnar has basically rewritten the literature on us.
Posted May 31, 2006 6:54 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (12 responses)
Oh, and the scheduler doesn't run in constant time. It just delegates the job to small trivial chunks that run when each process starts+stops using the CPU, so technically it's still O(n), not O(1).
Posted May 31, 2006 7:17 UTC (Wed)
by mingo (guest, #31122)
[Link] (8 responses)
In other words, we've reduced a highly probabilistic parallel QA (and manual code review) task to a comparatively _much_ simpler task of serial code coverage.
But yes, to increase the 'reach' of the formal proof of correctness we need to increase code coverage during testing. Projects like LTP do that systematically, they use kernel instrumentation and visualization tools to see which code wasnt executed yet and create testcases to trigger it. Plus chaos helps us too: if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)
Put differently: it's not bad idea to run through every possible branch of code at least once anyway ;-) If some code is never triggered in practice, why is the code there and how was it tested to begin with? There is a world of a difference between code that is never executed even once and narrow 1-instruction races that need many CPUs to even trigger.
So i'd say "total locking correctness" is not an impossible dream anymore, but is attached to another effort (code coverage) that we want to make as complete as possible anyway.
Posted May 31, 2006 14:21 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (5 responses)
First, probably it never has been tested systemmatically; this problem is endemic to the Linux development process. Second, the problem is not only "branches of code [coverage of basic blocks]." The problem is conditional execution paths through multiple basic blocks, including re-convergent fanout (correlated conditions) and partial correctness with "don't care" conditions.
Posted May 31, 2006 14:34 UTC (Wed)
by mingo (guest, #31122)
[Link] (3 responses)
i think the answer to that is in the section you apparently overlooked:
Projects like LTP do that systematically, they use kernel instrumentation and visualization tools to see which code wasnt executed yet and create testcases to trigger it. Plus chaos helps us too: if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)
and thus this IMO misleading proposition results in a misleading conclusion:
this problem is endemic to the Linux development process.
;-)
Posted May 31, 2006 15:12 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (2 responses)
Posted May 31, 2006 15:55 UTC (Wed)
by mingo (guest, #31122)
[Link] (1 responses)
the LTP testsuite's 'testcases/' directory sports 7000+ files, most of which are individual testcases. Testcase files often contain more than 1 testcase. LTP is being run not "before each release" but on Linus' nightly GIT trees - and yes, it's all automated.
furthermore, there are random automated testing efforts as well like scrashme, which can (and do) hit bugs by chance as well.
while i dont claim that LTP is perfect (if it were we'd have no bugs in the kernel), it is certainly alot more than "a drop in the bucket".
but i digress ...
Posted Jun 6, 2006 19:34 UTC (Tue)
by Blaisorblade (guest, #25465)
[Link]
nanosleep() tests aren't going to catch many bugs - nanosleep() uses are very simple.
But, say, run LTP on every possible drivers set, with a dmraid setup on "normal" RAID volumes on IDE and SCSI and SATA physical disks (since many bugs are in hardware drivers)...
Or combine networking tests with analisys of packet capture data and match sent data with on-wire data (supposing it's possible - you actually need a full TCP/IP stack to run on captured data, with additional correctness checking features)...
Then you'll find real bugs. However this starts being difficult.
Another possibility is to extract testcases from atypical applications.
UserModeLinux (on which I work) has been an excellent test-case against ptrace behaviour.
It found subtle bugs in ptrace existing in three kernel releases, in the recent past (one bug lived for 2.6.9 and 2.6.10, affecting security; the other affected x86-64 from 2.6.16.5 to 2.6.16.19).
But in this case, who coded the patches didn't run it.
Posted May 31, 2006 14:54 UTC (Wed)
by mingo (guest, #31122)
[Link]
yes - code coverage and testing is alot more than just basic coverage of all existing code.
but coverage that triggers locking is alot simpler than full coverage, because locks are almost always taken in simple ways, without too many branches. So while in theory code like this:
could exist in the kernel and would require 4 billion values of 'mask' to cycle through all the possible locking scenarios, in practice the kernel is full of much simpler locking constructs:
where covering the function once is probably enough to map its locking impact. In fact, "tricky", non-straight locking code is being frowned upon from a review and quality POV, which too works in favor of validation quality. To stay with the function1() example, such code will very likely be rejected at a very early review stage.
and finally, even theoretical full code coverage is alot simpler than the possible combinations of codepaths on an multiprocessor system that could trigger deadlocks.
naturally, being a dynamic method, the lock validator can only claim correctness about codepaths it actually observes (any other code doesnt even exist for it), and thus it still depends on how frequently (and with what memory state) those codepaths get triggered.
Posted May 31, 2006 15:08 UTC (Wed)
by NAR (subscriber, #1313)
[Link]
If this feature is turned on in the user's kernel - isn't the performance hit too much for users to keep this configuration option turned on?
If some code is never triggered in practice, why is the code there and how was it tested to begin with?
Error handling (especially hardware-related error handling) might be triggered extremely rarely, but I don't think it should be thrown out. The code could be tested by manually setting some error flag, but obviously the setting of this flag is not included in the production kernel.
Posted Jun 8, 2006 8:53 UTC (Thu)
by jmansion (guest, #36515)
[Link]
Hardly. If you can legitimately *ever* lock two objects of the same type then there will be an exception that turns off the 'false positive' that will also turn off checking whether you acquire A then B and also on some code paths B then A. Which is exactly the sort of deadlock you normally get in a database server, for example.
Which is not to say that the approach isn't useful, just that its not a panacea. It would have been handy to have a compare/contrast with FreeBSD's WITNESS facility.
Posted May 31, 2006 7:55 UTC (Wed)
by mingo (guest, #31122)
[Link] (1 responses)
this statement doesnt parse for me. Sure, if you execute an O(1) operation N times that makes it O(N). What matters is the overhead of context-switching and wakeup, and those are O(1)!
just look at an example: sys_getpid(), the simplest system-call in existence, does:
return current->tgid;
that's as O(1) as it gets - a handful of instructions. But by your argument, since applications may execute getpid() lots of times, in reality it's O(N)? By that argument there would be code at all that would qualify as O(1) code!
Posted May 31, 2006 8:08 UTC (Wed)
by frankie (subscriber, #13593)
[Link]
Posted May 31, 2006 9:19 UTC (Wed)
by simlo (guest, #10866)
[Link]
Posted May 31, 2006 7:09 UTC (Wed)
by piman (guest, #8957)
[Link]
Posted May 31, 2006 6:59 UTC (Wed)
by mingo (guest, #31122)
[Link] (2 responses)
That means that the "special rules" added for the sake of lockdep (via the use of spin_lock_nested() APIs, etc.) act as a formal specification of the locking rules, and the validator checks those rules against actual behavior.
To stay with the partitions example: if for example some kernel code locks a component partition before a container partition then the validator will detect this deadlock scenario.
So subtypes are not a watering down of the "proof of correctness", they are more like a mechanism to "program by specification" and to let the validator check against that specification. For non-recursive locking, the validator can build the specification automatically.
(besides subtypes, explicit type keys are another way to specify locking semantics. This all gets us one step closer to a more formal programming model and to a more correct kernel.)
Posted May 31, 2006 9:35 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jun 6, 2006 16:07 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted May 31, 2006 8:11 UTC (Wed)
by pointwood (guest, #2814)
[Link] (2 responses)
Posted May 31, 2006 13:10 UTC (Wed)
by kirkengaard (guest, #15022)
[Link]
Posted Jun 1, 2006 5:10 UTC (Thu)
by walken (subscriber, #7089)
[Link]
Posted Jun 1, 2006 6:32 UTC (Thu)
by man_ls (guest, #15091)
[Link] (5 responses)
Posted Jun 1, 2006 13:17 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Jun 1, 2006 14:14 UTC (Thu)
by arjan (subscriber, #36785)
[Link] (3 responses)
I've lost track but we're well over the count of 20 with real bugs found so far (several of these got fixed in mainline before lockdep got posted the other day so it's a bit hard to count)
Posted Jun 2, 2006 0:33 UTC (Fri)
by linportal (guest, #38148)
[Link] (1 responses)
Posted Jun 3, 2006 11:59 UTC (Sat)
by mingo (guest, #31122)
[Link]
Posted Jun 4, 2006 3:54 UTC (Sun)
by nevets (subscriber, #11875)
[Link]
And as Ingo already pointed out. The fishy-but-happens-to-be-ok are now nicely documented by the work arounds to (in akpm's words) make-lockdep-shut-up patches. ;)
This may be a dumb question, but why not assign an explicit rank to every lock type ?The kernel lock validator
the list of locks is not static. The kernel lock validator
The list of lock keys is static, and that's what would get assigned ranks. I quote from the
article:
The kernel lock validator
Locks which are allocated dynamically (as most locks embedded within structures
are)
[...] should be mapped to a single key. [...] Thus, for each lock initialization, this code creates a
static variable (__key) and uses
its address as the key identifying the type of the lock. Since any particular type of lock tends to
be
initialized in a single place, this trick associates the same key with every lock of the same
type.
I thought that I knew C pretty good, but I NEVER saw a # operator before (Except for preproccesor statements).OT: The # operator?
Doh, just preprocessor argument->string conversion, nevermind, too early in the morning for me ;)OT: The # operator?
aka "stringification"OT: The # operator?
Tnx very useful info ofr me!
OT: The # operator?
The parent post is clearly spam. Is this the first spammy comment to remain on LWN for any significant amount of time?Spam comment
Spam comment
Y'know, it's funny. Back in college, in my Operating Systems course, we learned that it was The kernel lock validator
impossible to design code which would reliably detect all possible deadlock conditions in a
kernel.
Umm, the validator doesn't detect all *possible* deadlock conditions -- just those that, given current and past locking history, might occur. It won't find tomorrow's deadlock if the code that could possibly cause it didn't run today.The kernel lock validator
Well, the validator does reduce the incredibly complex QA task of "trigger all possible locking races that can lead to deadlocks, assuming an arbitrary number of CPUs and interrupt sources" to "trigger all locking codepaths at least once".Re: detecting all possible deadlock conditions
Put differently: it's not bad idea to run through every possible branch of code at least once anyway ;-) If some code is never triggered in practice, why is the code there and how was it tested to begin with?
testing and code coverage
First, probably it never has been tested systemmatically;
testing and code coverage
I did not overlook "Projects like LTP do that systematically." The Linux Testing Project is a drop in the bucket: far too little. Given its size, the kernel should have over ten thousand individual tests that are run and analyzed [automation helps!] before each release.LTP has a long way to go
I did not overlook "Projects like LTP do that systematically." The Linux Testing Project is a drop in the bucket: far too little. Given its size, the kernel should have over ten thousand individual tests that are run and analyzed [automation helps!] before each release.
LTP has a long way to go
The problem of LTP is that it, in itself, tests syscall semantics, and many tests are unit tests. They can help in some cases (especially for strange arches, like say UML) - and they're very useful when a functionality is introduced, especially if the implementation is complex (that's my experience - I wrote together the implementation and a very complex testcase of it in one project).LTP has a long way to go
Second, the problem is not only "branches of code [coverage of basic blocks]." The problem is conditional execution paths through multiple basic blocks, including re-convergent fanout (correlated conditions) and partial correctness with "don't care" conditions.
testing and code coverage
void function1(unsigned long mask)
{
if (mask & 0x00000001)
spin_lock(&lock0);
if (mask & 0x00000002)
spin_lock(&lock1);
if (mask & 0x00000004)
spin_lock(&lock2);
if (mask & 0x00000008)
spin_lock(&lock3);
if (mask & 0x00000010)
spin_lock(&lock4);
...
if (mask & 0x80000000)
spin_lock(&lock31);
}
void function2(unsigned long mask)
{
spin_lock(&lock);
...
spin_unlock(&lock);
}
if LTP doesnt trigger it, users will eventually - and will likely not trigger the deadlock itself, but only the codepath that makes it possible. (hence there is much better debug info and a still working system.)
Re: detecting all possible deadlock conditions
>In other words, we've reduced a highly probabilisticRe: detecting all possible deadlock conditions
>parallel QA (and manual code review) task to a
>comparatively _much_ simpler task of serial code coverage.
Oh, and the scheduler doesn't run in constant time. It just delegates the job to small trivial chunks that run when each process starts+stops using the CPU, so technically it's still O(n), not O(1).
Re: O(1) scheduler
The O(1) for the scheduler refers the number of processes. The scheduling time is indeed not dependent on the number of processes. That's all.Re: O(1) scheduler
The old scheduler was O(number of processes on the run-queue). The scheduler in 2.6 is O(number of priorities)=O(140)=O(1).The kernel lock validator
In fact it's trivial to identify all functions that could deadlock (if you heard otherwise, maybe you should've been paying more attention in class :). What's impossible is identifying all and only those functions that deadlock.The kernel lock validator
Small additional thought to subtype paragraph: while subtypes do fix false positives, their role isnt just to get around exceptions - their role is to correctly map recursive locking semantics.The kernel lock validator
... and now reiser4 turns up with a tree of locks where each rank may be taken within the rank above safely, but where there are a potentially unbounded number of ranks...The kernel lock validator
Hm, on rereading that's not what it does. Sorry.The kernel lock validator
I didn't know that Darl McBride having a bad hair day could make my kernel deadlock - you learn something new everday :pThe kernel lock validator
Yes ... when he can't get it just so, he goes out and kills some butterflies out of frustration, changing the state of the universe's entropy pool, and unpredictably modifying system behavior. The occasional kernel deadlock is merely a symptom, though it poses a supremely difficult debugging challenge. ;)The kernel lock validator
Just look how Darl can deadlock a court case...The kernel lock validator
I'm surprised that nobody has asked this before. Solutions that look well in theory are, well, just theoretical; but does it work in practice? I assume that someone, at least the author, has had it running for some time now. How does it behave? Has anyone found potential deadlocks on running code, already? Otherwise, are there too few results, too many false positives?
Any results?
Yes, there have been some real deadlocks discussed on the kernel list. See also David Miller's posting, where he notes that it found a number of networking bugs while it was being developed.
Any results?
the ratio of "false positive" and real bugs is about 1:1 so far, where the false positives generally are doing some fishy-but-happens-to-be-ok locking most of the time. The false positives generally require more instrumentation in the code for the exact locking rules, which is good for documentation purposes anywayAny results?
What I have missed in all this is the address to forward detected bugs to. What about this one: kseriod deadlock? Is it real or false positive?
Any results?
That's a false positive, caused by recursive locking in the serio code. Arjan has posted the fix:
Any results?
--- linux/drivers/input/serio/libps2.c.orig
+++ linux/drivers/input/serio/libps2.c
@@ -177,7 +177,7 @@ int ps2_command(struct ps2dev *ps2dev, u
return -1;
}
- mutex_lock(&ps2dev->cmd_mutex);
+ mutex_lock_nested(&ps2dev->cmd_mutex, SINGLE_DEPTH_NESTING);
serio_pause_rx(ps2dev->serio);
ps2dev->flags = command == PS2_CMD_GETID ? PS2_FLAG_WAITID : 0;
the ratio of "false positive" and real bugs is about 1:1 so far, where the false positives generally are doing some fishy-but-happens-to-be-ok locking most of the time.
Any results?
