tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kernel panics during disk tests



Hi!

We are currently testing new storage and did some failover and outage
tests. I deployed 8 VMs running NetBSD/amd64 6.1.5 on four ESXi hosts
and had them generating I/O using bonnie++ to a dedicated virtual disk
(i.e. a second VMDK on each VM).

During the tests, everything performed as expected. The last test was
to cut off all SAN paths between the ESXes and the storage, which
essentially halted the bonnies. When the storage was available again,
the bonnies stopped with errors, but the OS seemed unaffected.

After that, I restarted the bonnies, i.e. the OSes were not rebooted
inbetween. Five of the VMs completed their tests over night, which
took about 11 hours. Three VMs paniced more or less simultaneously
after about 45 minutes:

Jan 22 18:08:09 bonnie-ham3 /netbsd: panic: lock error
Jan 22 18:08:09 bonnie-ham3 /netbsd: cpu0: Begin traceback...
Jan 22 18:08:09 bonnie-ham3 /netbsd: printf_nolog() at netbsd:printf_nolog
Jan 22 18:08:09 bonnie-ham3 /netbsd: lockdebug_abort() at
netbsd:lockdebug_abort+0x3a
Jan 22 18:08:09 bonnie-ham3 /netbsd: rw_vector_enter() at
netbsd:rw_vector_enter+0x281
Jan 22 18:08:09 bonnie-ham3 /netbsd: ufs_balloc_range() at
netbsd:ufs_balloc_range+0xc5
Jan 22 18:08:09 bonnie-ham3 /netbsd: ffs_write() at netbsd:ffs_write+0x289
Jan 22 18:08:09 bonnie-ham3 /netbsd: VOP_WRITE() at netbsd:VOP_WRITE+0x37
Jan 22 18:08:09 bonnie-ham3 /netbsd: vn_write() at netbsd:vn_write+0xf9
Jan 22 18:08:09 bonnie-ham3 /netbsd: do_filewritev() at
netbsd:do_filewritev+0x1fd
Jan 22 18:08:09 bonnie-ham3 /netbsd: syscall() at netbsd:syscall+0xc4
Jan 22 18:08:09 bonnie-ham3 /netbsd: cpu0: End traceback...
Jan 22 18:08:09 bonnie-ham3 /netbsd:
Jan 22 18:08:09 bonnie-ham3 /netbsd: dumping to dev 4,1 offset 2098055
Jan 22 18:08:09 bonnie-ham3 /netbsd: dump succeeded
Jan 22 18:08:09 bonnie-ham3 /netbsd:
Jan 22 18:08:09 bonnie-ham3 /netbsd:
Jan 22 18:08:09 bonnie-ham3 /netbsd: rebooting...

We are currently checking the logs of our storage. But the
distribution of the VMs across the ESX hosts and the storage volumes
does not indicate this as a cause.

Is this a known issue (if the word issue fits, given the
circumstances)? Is there an explanation, why not all VMs were
affected?

As a funny side note: NetBSD one of the rebooted VMs was convinced for
some time, that it had an uptime of 167 days. "date" showed a correct
date and time, but "uptime" showed 167 days. By now, it shows the
correct uptime.

Joern

-- 
Joern Clausen
http://www.oe-files.de/photography/


Home | Main Index | Thread Index | Old Index