Summary:
--------
- System locked up due to race in timer code, open bug in bugme.osdl.org
    http://bugme.osdl.org/show_bug.cgi?id=4559

- Generated saved Kdump successfully using nmi_watchdog and used the kdump
  successfully to analysis the lockup.


Details:
--------
- The machine locked up after few minutes and nmi_watchdog triggered following
  oops

NMI Watchdog detected LOCKUP on CPU0, eip c04bd3b4, registers:

Modules linked in:
CPU:    0
EIP:    0060:[<c04bd3b4>]    Not tainted VLI
EFLAGS: 00000046   (2.6.12-rc2-mm3-I) 
EIP is at _spin_lock_irqsave+0x24/0xa0
eax: 00000001   ebx: c4710f60  ecx: 00000001   edx: 00000086
esi: c05e6000   edi: c05e7d80   ebp: ffffffff   esp: c05e7d58
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c05e6000 task=c052cc00)
Stack: c4710f60 f7ff1880 c01292c4 c4710f60 c4708f60 f7ff1880 c0129340 f7ff1880 
       c05e7d80 c0622000 00000086 f7ff1800 00000014 f7ff1800 f7ff1800 c036632f 
       f7ff1880 00cef97c 00000292 00000292 f7fbbc40 00000000 c036676d f7ff1800 
Call Trace:
 [<c01292c4>] lock_timer_base+0x24/0x50
 [<c0129340>] __mod_timer+0x50/0xe0
 [<c036632f>] cfq_arm_slice_timer+0x9f/0x120
 [<c036676d>] cfq_select_queue+0x11d/0x130
 [<c0366a5e>] cfq_dispatch_requests+0x3e/0x90
 [<c0366b55>] cfq_next_request+0xa5/0xc0
 [<c0358adc>] elv_next_request+0x3c/0x170
 [<c02e2507>] kobject_get+0x17/0x20
 [<c039d770>] scsi_request_fn+0x1f0/0x3a0
 [<c03973b7>] scsi_put_command+0x87/0xd0
 [<c035b4fe>] blk_run_queue+0x4e/0x70
 [<c039c94d>] scsi_end_request+0xdd/0x110
 [<c039cd25>] scsi_io_completion+0x1b5/0x4b0
 [<c04bd7df>] _spin_unlock_irqrestore+0xf/0x30
 [<c03e182b>] sd_rw_intr+0xdb/0x2b0
 [<c04bd316>] _spin_lock+0x16/0x90
 [<c039c56a>] scsi_device_unbusy+0x4a/0x90
 [<c0397b6a>] scsi_finish_command+0x8a/0xb0
 [<c0397a6f>] scsi_softirq+0xbf/0xe0
 [<c0124eba>] __do_softirq+0x7a/0x100
 [<c0124f93>] do_softirq+0x53/0x60
 [<c0125084>] irq_exit+0x44/0x50
 [<c0105c18>] do_IRQ+0x28/0x40
 [<c0103ea2>] common_interrupt+0x1a/0x20
 [<c0101305>] mwait_idle+0x25/0x50
 [<c030f233>] acpi_processor_idle+0x0/0x25d
 [<c030f336>] acpi_processor_idle+0x103/0x25d
 [<c030f233>] acpi_processor_idle+0x0/0x25d
 [<c01010e1>] cpu_idle+0x41/0x80
 [<c05e89df>] start_kernel+0x16f/0x190
 [<c05e83c0>] unknown_bootoption+0x0/0x1e0
Code: 8d b4 26 00 00 00 00 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 
e8 da cf c5 ff be 00 e0 ff ff 21 e6 9c 5a fa 31 c0 86 03 <84> c0 7e 14 c7 43 
04 00 00 00 00 89 d0 8b 1c 24 8b 74 24 04 83 
console shuts up ...
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!


- The kernel panic initated booting of dump capture kernel and
  the dump is collected by copying /proc/vmcore.

- opened bug # 4559, at bugme.osdl.org.


- Initial dump analysis


root@x206g home]# gdb --core=vmcore vmlinux
GNU gdb Red Hat Linux (6.1post-1.20040607.62rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db
library "/lib/tls/libthread_db.so.1".


warning: exec file is newer than core file.
#0  crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99
99      arch/i386/kernel/crash.c: No such file or directory.
        in arch/i386/kernel/crash.c
(gdb) bt
#0  crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99
#1  0xc0115b0a in crash_save_self () at arch/i386/kernel/crash.c:107
#2  0xc0141c45 in crash_kexec () at kernel/kexec.c:1032
#3  0xc011ee30 in panic (fmt=0x0) at kernel/panic.c:78
#4  0xc01226d1 in do_exit (code=11) at kernel/exit.c:792
#5  0xc01052d1 in die_nmi (regs=0xc05e7d24, msg=0x0) at arch/i386/kernel/traps.c:616
#6  0xc0113ba2 in nmi_watchdog_tick (regs=0x0) at arch/i386/kernel/nmi.c:528
#7  0xc010535b in default_do_nmi (regs=0xc05e7d24) at arch/i386/kernel/traps.c:637
#8  0xc01054cd in do_nmi (regs=0xc05e7d24, error_code=Variable "error_code" is
not available.
) at arch/i386/kernel/traps.c:682
#9  0xc01040ee in nmi_stack_correct () at atomic.h:175
#10 0x00000001 in ?? ()
#11 0x00000086 in ?? ()
#12 0xc05e6000 in ?? () at swab.h:134
#13 0xc05e7d80 in init_thread_union ()
#14 0xffffffff in ?? ()
#15 0x00000001 in ?? ()
#16 0xc471007b in ?? ()
#17 0x0000007b in ?? ()
#18 0x00000001 in ?? ()
#19 0xc04bd3b4 in _spin_lock_irqsave (lock=Variable "lock" is not available.
) at spinlock.h:130
#20 0xc4710f60 in ?? ()
#21 0xf7ff1880 in ?? ()
#22 0xc01292c4 in lock_timer_base (timer=0x46, flags=0xc05e7d80) at
kernel/timer.c:201
#23 0xc0129340 in __mod_timer (timer=0xf7ff1880, expires=13564284) at
kernel/timer.c:221
#24 0xc036632f in cfq_arm_slice_timer (cfqd=0xf7ff1800, cfqq=0xf7fbbc40) at
timer.h:75
#25 0xc036676d in cfq_select_queue (cfqd=0xf7ff1800) at
drivers/block/cfq-iosched.c:1016
#26 0xc0366a5e in cfq_dispatch_requests (q=0xc01040ee, max_dispatch=4) at
drivers/block/cfq-iosched.c:1089
#27 0xc0366b55 in cfq_next_request (q=0xf7f954f0) at
drivers/block/cfq-iosched.c:1175
#28 0xc0358adc in elv_next_request (q=0xf7f954f0) at drivers/block/elevator.c:349
#29 0xc039d770 in scsi_request_fn (q=0xf7f954f0) at drivers/scsi/scsi_lib.c:1300
#30 0xc035b4fe in blk_run_queue (q=0xf7f954f0) at drivers/block/ll_rw_blk.c:1591
#31 0xc039c94d in scsi_end_request (cmd=0xec7ebe00, uptodate=1,
bytes=-1072676626, requeue=1) at drivers/scsi/scsi_lib.c:576
#32 0xc039cd25 in scsi_io_completion (cmd=0xec7ebe00, good_bytes=4096,
block_bytes=512) at drivers/scsi/scsi_lib.c:787
#33 0xc03e182b in sd_rw_intr (SCpnt=0xec7ebe00) at drivers/scsi/sd.c:960
#34 0xc0397b6a in scsi_finish_command (cmd=0xec7ebe00) at drivers/scsi/scsi.c:911
#35 0xc0397a6f in scsi_softirq (h=0xc05dc8a0) at drivers/scsi/scsi.c:822
#36 0xc0124eba in __do_softirq () at kernel/softirq.c:95
#37 0xc0124f93 in do_softirq () at kernel/softirq.c:129
#38 0xc0125084 in irq_exit () at kernel/softirq.c:169
#39 0xc0105c18 in do_IRQ (regs=0xc05e7f78) at arch/i386/kernel/irq.c:110
#40 0xc0103ea2 in common_interrupt () at atomic.h:175
#41 0xc05e6008 in init_thread_union ()
#42 0x00000000 in ?? ()
(gdb) info thread
* 2 process 0  crash_get_current_regs (regs=0xc05e7bf8) at
arch/i386/kernel/crash.c:99
* 1 process 0  crash_get_current_regs (regs=0xc05e7bf8) at
arch/i386/kernel/crash.c:99
(gdb)


<<< gdb somehow is not displaying the stack for second thread correctly, but 
<<< still I was able to locate the second cpu's registers and stack from
<<< the kdump file using readelf and od

ebx     c4ac401c
ecx     00000001
edx     00000001
esi     c4872000
edi     c4710f60
ebp     c03681d0
eax     00000001
xds     0000007b
xes     c487007b
fs      00000000
gs      00000000
orig_eax 00000001
eip     c04bd412  <<< _spin_lock_irq_save+0x130 >>>
xcs     00000060
eflags  00000202
esp     c4873ed0
xss     00000068

And from the second thread's stack dumped in gdb using esp. The following 
call trace is seen

Thread 0                                   Thread 1

blk_run_queue                           cfq_idle_slice_timer()
 <<< takes queue->queue_lock >>>        <<< spinning for  
                                             cfq_data->queue->queue_lock >>>
 scsi_request_fn
  elv_next_request
   cfq_next_request
    cfq_dispatch_requests
     cfq_select_queue
      cfq_arm_slice_timer
       __mod_timer
        lock_timer_base
         <<< spinning for timer->base->lock >>>
          -=-=-=- NMI oops -=-=-=-