simple debugging HOWTO
William Lee Irwin III <wli@holomorphy.com>
(1) always set up a serial cable and use serial console
things like oopses need to be logged
see Documentation/serial-console.txt in the kernel source
(2) try alt-sysrq if things go wrong
see Documentation/sysrq.txt in the kernel source
if it doesn't work over serial console it's probably because
the serial cable is missing a wire. try it before things go
wrong, too, just to make sure it works. There are several
different things to look at here. most of them will generate
too much info to see if you're not properly logging.
(3) if things appear to deadlock, try the NMI oopser
see Documentation/nmi_watchdog.txt in the kernel source
beware of bad IBM BIOS's here; also, use nmi_watchdog=2
(4) use kgdb!
available as part of -mm at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.*/2.5.*-mm*/broken-out/*kgdb*
(don't fetch _all_ matches to that, just for the right version)
see Documentation/i386/kgdb/ from the patch
works great with the NMI oopser (modulo bad BIOS's as above)
(5) _always_ use CONFIG_KALLSYMS=y
if you don't (or can't), you'll have to use ksymoops on the oops.
for ksymoops(8) documentation, man ksymoops (install if necessary)
(6) if the kernel hangs with no output, use an early printk patch
an old one of mine is at:
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/early_printk/
there are others and/or updates from others around somewhere
also, the newer kgdb patches work much earlier than the old ones,
often early enough to obsolete early printk stuff.
(7) if you get OOM's, log bloatmeter's output
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/bloatmeter/
logging periodic snapshots of /proc/meminfo and /proc/vmstat
(say, every 5s) is also good.
(8) if a syscall mysteriously fails in a new kernel, use strace
for documentation, man strace (install it if need be)
log it on a working kernel and a broken kernel; to see what
I'll be looking at, just use diff(1) on the two logs (but
I'll want both of the whole logs anyway).
(9) if a system comes up missing memory, devices, or cpus
send in the bootlog and the .config used
(10) if a combination of patches doesn't work, bisect!
if there were 4 billion patches, you'd only need 32 boots
to find the bad one. for 1024 you'd only need 10 boots.
O(lg(n)) is good.