Summary: -------- - System locked up due to race in timer code, open bug in bugme.osdl.org http://bugme.osdl.org/show_bug.cgi?id=4559 - Generated saved Kdump successfully using nmi_watchdog and used the kdump successfully to analysis the lockup. Details: -------- - The machine locked up after few minutes and nmi_watchdog triggered following oops NMI Watchdog detected LOCKUP on CPU0, eip c04bd3b4, registers: Modules linked in: CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00000046 (2.6.12-rc2-mm3-I) EIP is at _spin_lock_irqsave+0x24/0xa0 eax: 00000001 ebx: c4710f60 ecx: 00000001 edx: 00000086 esi: c05e6000 edi: c05e7d80 ebp: ffffffff esp: c05e7d58 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c05e6000 task=c052cc00) Stack: c4710f60 f7ff1880 c01292c4 c4710f60 c4708f60 f7ff1880 c0129340 f7ff1880 c05e7d80 c0622000 00000086 f7ff1800 00000014 f7ff1800 f7ff1800 c036632f f7ff1880 00cef97c 00000292 00000292 f7fbbc40 00000000 c036676d f7ff1800 Call Trace: [] lock_timer_base+0x24/0x50 [] __mod_timer+0x50/0xe0 [] cfq_arm_slice_timer+0x9f/0x120 [] cfq_select_queue+0x11d/0x130 [] cfq_dispatch_requests+0x3e/0x90 [] cfq_next_request+0xa5/0xc0 [] elv_next_request+0x3c/0x170 [] kobject_get+0x17/0x20 [] scsi_request_fn+0x1f0/0x3a0 [] scsi_put_command+0x87/0xd0 [] blk_run_queue+0x4e/0x70 [] scsi_end_request+0xdd/0x110 [] scsi_io_completion+0x1b5/0x4b0 [] _spin_unlock_irqrestore+0xf/0x30 [] sd_rw_intr+0xdb/0x2b0 [] _spin_lock+0x16/0x90 [] scsi_device_unbusy+0x4a/0x90 [] scsi_finish_command+0x8a/0xb0 [] scsi_softirq+0xbf/0xe0 [] __do_softirq+0x7a/0x100 [] do_softirq+0x53/0x60 [] irq_exit+0x44/0x50 [] do_IRQ+0x28/0x40 [] common_interrupt+0x1a/0x20 [] mwait_idle+0x25/0x50 [] acpi_processor_idle+0x0/0x25d [] acpi_processor_idle+0x103/0x25d [] acpi_processor_idle+0x0/0x25d [] cpu_idle+0x41/0x80 [] start_kernel+0x16f/0x190 [] unknown_bootoption+0x0/0x1e0 Code: 8d b4 26 00 00 00 00 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 da cf c5 ff be 00 e0 ff ff 21 e6 9c 5a fa 31 c0 86 03 <84> c0 7e 14 c7 43 04 00 00 00 00 89 d0 8b 1c 24 8b 74 24 04 83 console shuts up ... <0>Kernel panic - not syncing: Aiee, killing interrupt handler! - The kernel panic initated booting of dump capture kernel and the dump is collected by copying /proc/vmcore. - opened bug # 4559, at bugme.osdl.org. - Initial dump analysis root@x206g home]# gdb --core=vmcore vmlinux GNU gdb Red Hat Linux (6.1post-1.20040607.62rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1". warning: exec file is newer than core file. #0 crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99 99 arch/i386/kernel/crash.c: No such file or directory. in arch/i386/kernel/crash.c (gdb) bt #0 crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99 #1 0xc0115b0a in crash_save_self () at arch/i386/kernel/crash.c:107 #2 0xc0141c45 in crash_kexec () at kernel/kexec.c:1032 #3 0xc011ee30 in panic (fmt=0x0) at kernel/panic.c:78 #4 0xc01226d1 in do_exit (code=11) at kernel/exit.c:792 #5 0xc01052d1 in die_nmi (regs=0xc05e7d24, msg=0x0) at arch/i386/kernel/traps.c:616 #6 0xc0113ba2 in nmi_watchdog_tick (regs=0x0) at arch/i386/kernel/nmi.c:528 #7 0xc010535b in default_do_nmi (regs=0xc05e7d24) at arch/i386/kernel/traps.c:637 #8 0xc01054cd in do_nmi (regs=0xc05e7d24, error_code=Variable "error_code" is not available. ) at arch/i386/kernel/traps.c:682 #9 0xc01040ee in nmi_stack_correct () at atomic.h:175 #10 0x00000001 in ?? () #11 0x00000086 in ?? () #12 0xc05e6000 in ?? () at swab.h:134 #13 0xc05e7d80 in init_thread_union () #14 0xffffffff in ?? () #15 0x00000001 in ?? () #16 0xc471007b in ?? () #17 0x0000007b in ?? () #18 0x00000001 in ?? () #19 0xc04bd3b4 in _spin_lock_irqsave (lock=Variable "lock" is not available. ) at spinlock.h:130 #20 0xc4710f60 in ?? () #21 0xf7ff1880 in ?? () #22 0xc01292c4 in lock_timer_base (timer=0x46, flags=0xc05e7d80) at kernel/timer.c:201 #23 0xc0129340 in __mod_timer (timer=0xf7ff1880, expires=13564284) at kernel/timer.c:221 #24 0xc036632f in cfq_arm_slice_timer (cfqd=0xf7ff1800, cfqq=0xf7fbbc40) at timer.h:75 #25 0xc036676d in cfq_select_queue (cfqd=0xf7ff1800) at drivers/block/cfq-iosched.c:1016 #26 0xc0366a5e in cfq_dispatch_requests (q=0xc01040ee, max_dispatch=4) at drivers/block/cfq-iosched.c:1089 #27 0xc0366b55 in cfq_next_request (q=0xf7f954f0) at drivers/block/cfq-iosched.c:1175 #28 0xc0358adc in elv_next_request (q=0xf7f954f0) at drivers/block/elevator.c:349 #29 0xc039d770 in scsi_request_fn (q=0xf7f954f0) at drivers/scsi/scsi_lib.c:1300 #30 0xc035b4fe in blk_run_queue (q=0xf7f954f0) at drivers/block/ll_rw_blk.c:1591 #31 0xc039c94d in scsi_end_request (cmd=0xec7ebe00, uptodate=1, bytes=-1072676626, requeue=1) at drivers/scsi/scsi_lib.c:576 #32 0xc039cd25 in scsi_io_completion (cmd=0xec7ebe00, good_bytes=4096, block_bytes=512) at drivers/scsi/scsi_lib.c:787 #33 0xc03e182b in sd_rw_intr (SCpnt=0xec7ebe00) at drivers/scsi/sd.c:960 #34 0xc0397b6a in scsi_finish_command (cmd=0xec7ebe00) at drivers/scsi/scsi.c:911 #35 0xc0397a6f in scsi_softirq (h=0xc05dc8a0) at drivers/scsi/scsi.c:822 #36 0xc0124eba in __do_softirq () at kernel/softirq.c:95 #37 0xc0124f93 in do_softirq () at kernel/softirq.c:129 #38 0xc0125084 in irq_exit () at kernel/softirq.c:169 #39 0xc0105c18 in do_IRQ (regs=0xc05e7f78) at arch/i386/kernel/irq.c:110 #40 0xc0103ea2 in common_interrupt () at atomic.h:175 #41 0xc05e6008 in init_thread_union () #42 0x00000000 in ?? () (gdb) info thread * 2 process 0 crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99 * 1 process 0 crash_get_current_regs (regs=0xc05e7bf8) at arch/i386/kernel/crash.c:99 (gdb) <<< gdb somehow is not displaying the stack for second thread correctly, but <<< still I was able to locate the second cpu's registers and stack from <<< the kdump file using readelf and od ebx c4ac401c ecx 00000001 edx 00000001 esi c4872000 edi c4710f60 ebp c03681d0 eax 00000001 xds 0000007b xes c487007b fs 00000000 gs 00000000 orig_eax 00000001 eip c04bd412 <<< _spin_lock_irq_save+0x130 >>> xcs 00000060 eflags 00000202 esp c4873ed0 xss 00000068 And from the second thread's stack dumped in gdb using esp. The following call trace is seen Thread 0 Thread 1 blk_run_queue cfq_idle_slice_timer() <<< takes queue->queue_lock >>> <<< spinning for cfq_data->queue->queue_lock >>> scsi_request_fn elv_next_request cfq_next_request cfq_dispatch_requests cfq_select_queue cfq_arm_slice_timer __mod_timer lock_timer_base <<< spinning for timer->base->lock >>> -=-=-=- NMI oops -=-=-=-