[Beowulf] delayed savings time crashes
kewley at gps.caltech.edu
Wed Apr 12 13:42:31 EDT 2006
The reboots were due to a City of Pasadena power glitch at 9:17 that
morning. :) It was raining, and a 34kV city feeder line that runs between
the generating plant at the entrance of the 110 and a substation at Del Mar
& Los Robles faulted. The responsible breaker took 13 cycles to break,
during which time the single-phase voltage seen at Caltech dropped to about
This info comes from the responsible EE at Caltech. As for its effects,
believe me, I know about it the hard way, as it took down 2/3 of our
compute nodes, 1/3 of our disk shelves, and 3/4 of our fileservers. Our
UPS has been on bypass these past 6+ months as we wait for our UPS vendor
to install a fix so that the UPS can handle the tendency of our computer
power supplies' internal Power Factor Correction feedback circuitry to lock
up & induce massive 12Hz oscillations on the room's power lines.
As for the time glitch, that is probably induced by the fact that Daylight
Savings Time changes only take place on the "system" clock, and in a
standard Red Hat system those changes only get synced to the hardware clock
upon a clean shutdown. So if your machine crashes after a DST change, then
upon bootup syslogd gets its time from the hardware clock, which is wrong.
The system clock is only corrected later in the bootup sequence, when ntpd
starts. The best solution is probably to set the hardware clock to UCT
rather than local time. UCT doesn't undergo step changes like most
timezones in the U.S. do, so the compensation for DST happens dynamically
in software, rather than requiring a hardware clock change.
On Wednesday 12 April 2006 09:05, David Mathog wrote:
> This is an odd one. I just realized that 9 of 20 nodes
> rebooted on Apr 4. (Since they all rebooted successfully everything
> was working and there was no reason to think that this had
> taken place.) This appears to be related to the daylights
> savings time change two days before. The reason I think that is
> that the nodes that rebooted have /var/log/messages files like:
> Apr 4 08:01:00 nodename CROND ... /cron/hourly
> Apr 4 09:01:00 nodename CROND ... /cron/hourly
> Apr 4 08:24:33 nodename syslogd 1.4.1; restart
> Notice the time shift backwards between the last normal
> record and the first reboot record.
> As if it finally caught on that the clock had changed and that
> somehow triggered a reboot. Unfortunately none of the log files
> contain a message that indicated exactly what it was that ordered
> the reboot.
> Unclear to me what piece of software could have triggered this.
> Presumably something that had it's own clock stuck one hour off
> on the previous time standard and also has the ability to restart
> the system. ntpd? Ganglia? They were both running.
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf