You are here: Home / What do you need? / Help and documentation / Unix tricks and information / Kernel errors and NMI

Kernel errors and NMI

by Darrell Kingsley last modified Mar 13, 2014 02:12 PM
We've been getting weird kernel errors with Centos 5.6 and our AMD-based server. Here's what we've discovered...

These are the errors we've been getting in /var/messages and on the command line.

May  2 07:22:24 lucolo0628 kernel: Uhhuh. NMI received for unknown reason 00.
May  2 07:22:24 lucolo0628 kernel: Do you have a strange power saving mode enabled?
May  2 07:22:24 lucolo0628 kernel: Dazed and confused, but trying to continue

These have been happening every few days, but the server has been under negligible load, as it is not yet a production server.

Some research on the web seems to suggest that this is not unknown and that it's related to NMI watchdog using the High Precision Event Timer (hpet). It seems like a non-fatal bug, although it can cause some boxes to hang on boot, or to crash under heavy load, as lots of these errors are generated.

Redhat suggests either preventing hpet from being used by the kernel, or switching off the NMI watchdog. Both of these are done in the grub.conf file in /boot/grub/grub.conf.

We turned off hpet use by adding "nohpet" to the grub.conf file as follows:

title CentOS (2.6.18-308.4.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-308.4.1.el5 ro nohpet root=/dev/VolGroup00/LogVol00 rhgb quiet
        initrd /initrd-2.6.18-308.4.1.el5.img

I'll come back and let you know if that works. In the mean time, here are some of the pages we found which led us to this workaround:

HPET change make no difference

Well, turning off HPET didn't work for us, as we got another one of the messages in the messages log on May 7th. So the next thing to try is changing the power regulation settings in the BIOS to use "OS Control" (presumably changing it from "Hardware control" or something similar) to see if Linux can control power saving modes in a way that doesn't annoy the NMI watchdog.

The next alternatives are, possibly, an upgrade to Centos 6 from 5, or turning the NMI watchdog off.

More as we get it...