simple debugging HOWTO
William Lee Irwin III <wli@holomorphy.com>
(1) always set up a serial cable and use serial console
        things like oopses need to be logged
        see Documentation/serial-console.txt in the kernel source

(2) try alt-sysrq if things go wrong
        see Documentation/sysrq.txt in the kernel source
        if it doesn't work over serial console it's probably because
        the serial cable is missing a wire. try it before things go
        wrong, too, just to make sure it works. There are several
        different things to look at here. most of them will generate
        too much info to see if you're not properly logging.

(3) if things appear to deadlock, try the NMI oopser
        see Documentation/nmi_watchdog.txt in the kernel source
        beware of bad IBM BIOS's here; also, use nmi_watchdog=2

(4) use kgdb!
        available as part of -mm at
        ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.*/2.5.*-mm*/broken-out/*kgdb*
        (don't fetch _all_ matches to that, just for the right version)
        see Documentation/i386/kgdb/ from the patch
        works great with the NMI oopser (modulo bad BIOS's as above)

(5) _always_ use CONFIG_KALLSYMS=y
        if you don't (or can't), you'll have to use ksymoops on the oops.
        for ksymoops(8) documentation, man ksymoops (install if necessary)

(6) if the kernel hangs with no output, use an early printk patch
        an old one of mine is at:
        ftp://ftp.kernel.org/pub/linux/kernel/people/wli/early_printk/
        there are others and/or updates from others around somewhere
        also, the newer kgdb patches work much earlier than the old ones,
        often early enough to obsolete early printk stuff.

(7) if you get OOM's, log bloatmeter's output
        ftp://ftp.kernel.org/pub/linux/kernel/people/wli/bloatmeter/
        logging periodic snapshots of /proc/meminfo and /proc/vmstat
        (say, every 5s) is also good.

(8) if a syscall mysteriously fails in a new kernel, use strace
        for documentation, man strace (install it if need be)
        log it on a working kernel and a broken kernel; to see what
        I'll be looking at, just use diff(1) on the two logs (but
        I'll want both of the whole logs anyway).

(9) if a system comes up missing memory, devices, or cpus
        send in the bootlog and the .config used

(10) if a combination of patches doesn't work, bisect!
        if there were 4 billion patches, you'd only need 32 boots
        to find the bad one. for 1024 you'd only need 10 boots.
        O(lg(n)) is good.