Colonel Panic

Occasionally I’ll run across a movie or television show that uses the phrase “kernel panic.” I usually wonder what Joe (or Jo) Average thinks when they hear that…

So, first, let’s get some basic details out of the way.

  1. No, this phrase doesn’t refer to the mental or emotional state believed to be induced by the Greek god Pan amongst commissioned officers just below the rank of brigadier general. Neither does it refer to a specific colonel by the name of Panic.
  2. Yes, the word kernel refers to the inner, nutty part of the computer operating system. Okay, okay, so it’s not so nutty, but it is the inner part – it’s where all of the real magic happens. (You need help…?ed)

People who have long used UNIX-based systems are probably quite familiar with the situation; some device or software subsystem fails catastrophically, and the kernel just doesn’t know what to do with itself. Rather than risk further damage, the kernel has a safety net, of sorts, that it invokes: the kernel panic. From this state, a skilled operator or administrator can usually determine what caused the condition so that it can be further investigated (and hopefully fixed).

Let’s rewind the clock to 2002. It was December, and that newfangled NetBSD 1.6 had just been released a few months before. One of my production servers (that is, one of the servers on active duty) was a dual processor machine with 576 MB of RAM in it, installed in a 64 + 256 + 256 configuration. The problem is, the machine would run for about 12 or 13 hours and then crash – or, more precisely, the kernel would panic – with some bizarre nonsense that pointed to bad memory in the machine. Since removing the 2nd 256 MB memory module from the machine “solved” the problem, I assumed that it was bad RAM, labeled the module as such, and left the machine to do its work.

Last spring I decided that it was time to start looking into the upcoming NetBSD 2.0 release. I had a machine with a full gigabyte and faster processors, and I was excited to see how it performed since the multiprocessor support was finally a standard feature. The problem is, the system would panic during installation of some of the optional parts of the system! Well, okay, they’re optional, so we can forego that for now, right? And so I skipped that and decided to try to boot anyway. The system ran well enough to start, but when I tried to install the optional parts again, it crashed. I gave up.

Fast forward to last week. Now, upgrading my machines to the newly-released NetBSD 2.0 had become not just an interesting exercise but a necessity. I’ve got a critical business application that by itself needs as much memory as the system has installed. Through the assistance of the virtual memory subsystem the kernel is able to temporarily “swap out” inactive memory pages to the hard drive(s), then load them again when they’re needed. However, this is most useful when relatively small chunks of memory are inactive for relatively long periods of time. In this case my application periodically woke up and scanned the entire database – which it needs to keep in memory for performance reasons – approximately once an hour. This led to “thrashing” as the system swapped things between disk and memory and actually caused the application to be down for 5-7 minutes out of every hour. What a mess! Consequently we upgraded the RAM to a full gigabyte (configured the same as the system from last spring).

Can you guess what happened next? That’s right. Boooom! “panic: uvm_fault: …” became a common refrain on the console of a secondary server that I upgraded first (to minimize the risk). This led to a couple of marathon sessions where I tested the memory and related system electronics to see what was going on. Every test I ran said that the memory was good. I even ran a couple of tests that took nearly an hour and a half each. You’d think that the NetBSD kernel would be convinced after watching all of that. I mean, I’m sure it was watching; the disks were spinning, right? Anyway, it promptly ejected its mind when I started it up, so I gave up, removed the “extra” memory modules to get the machine back up, and went back to the drawing board.

Wednesday evening, December 22, I had planned a short “day” (Malaysia time) due to the previously scheduled company clean-up day and office party on Thursday. As I was beating my hand to my forehead over this issue, and about to give up and go home, I spotted something I hadn’t before. Hmm, this looks strange, I thought. The host bridge/controller chip controls memory access timing, among other things, and there was a bug in the original version and at least one later revision that required a specific setting. If you haven’t guessed where this is going, over the course of the next several hours I discovered the history of the hardware problem, removed the “fix” that forced the required setting for the older controller chips, tested it, produced a portable patch to disable the fix for newer chips, and submitted it to the NetBSD folks.

It’s been 10 days and the systems are both stable with a full complement of memory. I haven’t had that much fun in a while!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.