Wake-on-Lan, Processor Benchmarking, and File System Benchmarks | Beowulf List

In this installment of we look at Wake-on-Lan and processor benchmarking threads from the Beowulf Mailing List and file system benchmarks From the Linux Kernel List.

In this installment of we look at Wake-on-Lan and processor benchmarking threads from the Beowulf Mailing List and file
system benchmarks From the Linux Kernel List.

Turning on Nodes Through the Network

On November 3, 2003, Mathias Brito posted to the Beowulf mailing list
asking how he can boot the master node of his cluster and then have
the slave nodes boot automatically. There were several responses to
his request.  The most common response was to use the Wake-On-Lan (WOL)
feature of the NIC (Network Interface Card) in the nodes to have them
boot when they receive a signal. For WOL to work, the NIC must have a
chipset capable of WOL and typically you connect a cable from the NIC
to the WOL connector on the motherboard. The NIC has a very low-power
mode that monitors the network for a special data packet that will
wake up the system causing it to boot up. Erwan Velu pointed out that
you can use Scyld's, ether-wake program (it's freely available, see
Resources sidebar) to cause the nodes to boot. He added that by
simply running ether-wake in the rc.local on the master node, the
compute nodes will start booting while the master node is booting.
However, care must be taken with this approach so that anything the
compute nodes need from the master node, such as NFS file systems,
DHCP, TFTP is available when they boot.

Don Becker went on to state that he thought a more reliable and
sophisticated approach was to use systems with IPMI (Intelligent
Platform Management Interface) 1.5 support. IPMI is a specification
that defines a standard, abstracted message-based interface to
intelligent platform management hardware. It is used for system health
monitoring, chassis intrusion monitoring, and other aspects of server
monitoring for systems that have intelligent hardware. It is supported
by Intel, Dell, HP, and NEC. Don mentioned that waking each node over
the network was included in the IPMI specification. Don also mentioned
that a most motherboards equipped for IPMI need a Baseboard Management
Controller (BMC) which adds about $25-$150 to the cost of the motherboard.
There was a small discussion about the price, but the final conclusion
was in the range that Don had mentioned.

Better Benchmarking

There was a very interesting discussion that resulted from a question,
asked by Gabriele Butti on October 28th, 2003, as to whether the Itanium 2 (I2)
CPU or the Opteron CPU was better for a new cluster. There was some
initial response to the question, but a response from Richard Walsh
started a very a discussion mostly between Richard and Mark Hahn
about benchmarks. Richard and Mark are both well known on the Beowulf
mailing list. 

Richard started by discussing the I2's SpecFP 2000 performance and
Opteron's Hypertransport bus which allows each processor access to the
full memory bandwidth of the system. SPEC (Standard Performance
Evaluation Corporation) is a non-profit corporation founded to
establish and maintain a relevant set of benchmarks for high-performance
computers. While SPEC has several benchmarks, the primary benchmarks,
SpecINT 2000 and SpecFP 2000, test the integer and floating-point
performance. Their benchmark suite consists of several programs
that test different aspects of the computer system and require the
testers to use a standard set of compiler flags for baseline
performance and also allows the testers to use any combination of
compiler flags to get peak performance.

Mark responded that he thought the SPEC results for the I2 indicated
that the  SPEC  codes were well suited for the large cache of the
I2 but did not necessarily test the I2 itself. Mark and Richard
then provided some very detailed discussion about benchmarking
where Mark also pointed out that some CPUs have very high results
on a certain part of the SPEC tests, but weak results on other
parts. However, despite the fact that  SPEC  uses a geometric mean
to average the scores which should reduce the impact of a large
score on one particular test, a very strong result can skew the
overall SpecFP 2000 number. Mark and Richard also discussed the
cache effects on the SpecFP 2000 benchmark with Richard stating
that he thought a benchmark or two that were sensitive to cache
effects, were important because some real world codes behave the
same way.

Eric Moore joined in the discussion with a very interesting look at
the SpecFP 2000 benchmark and some of the codes that make up the
results. Robert Brown also posted his thoughts that were in line
with both Richard's and Mark's. He suggested that one needs to
look at the results of ALL of the components of the SpecFP2000
benchmark to get a good idea of the performance, rather than look
at the geometric mean. Robert provided some very well written
comments about benchmarks regarding HPC (High-Performance Computing).
He pointed that he likes benchmarks that address various problem
sizes and different aspects of the hardware including the
interconnect. Mark originally brought up the idea of a database
of benchmarks that one could search or combine to generate
meaningful results. Robert seconded this idea. Now, if someone
could find the time to do it...

File System Shootout

I haven't written about the kernel mailing list before, but something
with direct correlation to clusters recently came across the list.
On October 24, 2003, Mike Benoit posted an email to the kernel mailing list
announcing some updates to his file system
shootout. Mike used two
file system benchmarks, Bonnie++ and IOZONE, that are designed to
test hard drive and file system performance. Mike had previously
posted to the kernel mailing list when he first had file system
performance results. There were several suggestions for tuning
specific file systems, and a request for using better hardware.
He made the suggested changes and retested. He now used an Opteron
240 system with 512 Mb of RAM and a PII/450 with 512 Mb of RAM for
the shootout. He tested EXT2, EXT3 with several options, XFS, JFS,
ReiserFS v3 with two options, and Reiser4 with two options, all
using several recent versions of the 2.6 test kernel. He ran the
tests 3 times each on both a SCSI disk and an IDE disk and presents
all of the results in a nice tabular form with some results
highlighted.

Why are file systems so important to clusters? There are many
clusters applications that read and write data to a file system.
This file system can be local (i.e. in the node itself) or part
of a central file system (e.g. NFS). Also, applications could be
using a distributed file system like Lustre or a high-speed parallel
file system like PVFS (Parallel Virtual File System) or GPFS
(General Parallel File System). For all of these configurations,
applications that are I/O (Input/Output) bound, spend a great
deal of time writing to file systems. Hence, file system
performance is important to them.

Mike has several interesting observations. First, in his opinion,
based on his benchmark results, XFS and JFS give the 
best bang for the buck. That is, they are close to EXT2 in
performance with a small amount of CPU usage. It's interesting to
note that journaled file systems are slower than the the
non-journaled file system EXT2. So if you don't mind an occasional,
potentially long file system check (fsck), then EXT2 is still
pretty fast. It might be very useful for relatively small
read-only file systems.

For applications that are bound by I/O performance, he recommends
Reiser4, or XFS, or ReiserFS v3. Remember though, that Reiser4 is
still experimental and has not yet made it into the 2.6 kernel.
However, according to Mike, the results are very encouraging. He
also mentioned that if your file system uses lots of small files,
then ReiserFS v3 is the way to go. However, if your file system
has medium to large size files, then he recommends XFS. Mike
goes on to mention that if you are CPU limited, he recommends
JFS. 

Finally, Mike makes some observations comparing SCSI disks to IDE
disks. He ran the tests on a SCSI disk that was running at
10,000 RPM and an IDE disk running at 7,200 RPM. He found that Reiser4
had about a 50% boost in speed using SCSI disks compared to IDE disks.
Both JFS and EXT3 gained the least speed by moving to the SCSI disks,
only gaining about 5-20%. He also mentions that in one case JFS
actually ran slower on a SCSI disk than on an IDE disk. He finally
suggests that a 5 times cost difference for SCSI drives may not be
worth the cost if an average improvement of 20% is all that is is
seen over IDE drives.

Sidebar One: Links Mentioned in Column

Linux Kernel List Summary

PVFS2

Linux Kernel Mailing List

Beowulf Mailing List

ether-wake

Bonnie++

IOZone

This article was originally published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He occasionally finds time to perform
experiments on clusters in his basement.