|
Page 2 of 3
Number of I/O Servers
A common architectural question is how many I/O servers to use and whether
to utilize all the spare space on the compute nodes for PVFS (i.e. use
the compute nodes as I/O servers). Answering this question is very
difficult due to the myriad of options available. However,
there have been a few studies to try to provide some insight into this
issue. Three of the major studies were by Kent Milfeld, et al at the
Texas Advanced Computing Center (TACC),
Jens Mache, et. al., at
Lewis & Clark College, and
Monica Kashyap,
et al at Dell Computers.
In 2001, Dr. Mache and his associates used PVFS1 to get high
performance disk access from a PC cluster using IDE disks. Their goal
was to break the 1 GB/sec I/O barrier that ASCII Red had broken at a
cost of $1 million. Dr. Mache used 32 AMD Athlon 1.2 GHz nodes connected
with Gigabit Ethernet (GigE). Each node had two IDE disks that were
configured with software RAID-0 (striping). They experimented with varying
the number of client nodes and the number of I/O servers on a small 8
node system while running a ray tracing program to compute a number
of frames of a simple scene. The best configuration consisted of 2
I/O servers and 6 clients. However, when all 8 nodes were made both
clients and I/O servers, the overall completion time was 1.264 times
better than the 2 I/O server/6 client configuration even considering
that the nodes were computing as well as functioning as I/O servers.
Next, Dr. Mache and his team setup all 32 nodes as both clients and I/O
servers. They then ran a code that was a variation of a read/write test
program that comes with PVFS1. The code writes and then reads blocks
of integer data to and from a PVFS1 file. Each node adds 96 MB (Megabytes)
to the global file that has a total of (96*n) MB, where n is the
number of nodes used. They found that after 25 overlapping nodes they
achieved at least 1 GB/sec in read performance and after 29 nodes
they achieved at least 1 GB/sec in write performance.
The cost comparison is even more interesting. ASCII Red spent about
$1 million at the time to achieve 1 GB/sec I/O throughput. Dr. Mache
spent about $7,200. They beat the I/O price/performance by over a
factor of 100!
Kent Milfeld and his associates at TACC have examined PVFS1 performance
in a cluster with a simple read/write code and a simulated workload
code. The first study focused on 16 Intel PIII/1 GHz single CPU nodes
connected with Fast Ethernet. They varied the number of nodes that were
assigned as I/O servers with the remaining number of nodes assigned as
clients with the sum of the two always 16. They found that 8 I/O servers
and 8 clients gave the best performance for the simple read/write test
code. They also found that the Fast Ethernet network handicapped the
throughput of PVFS1.
A second system with 32 nodes of dual PIII/1 GHz connected with
Myrinet 2000 was also tested.
In these tests, they allowed one of
the two CPUs to be used as a client and one to be used for an I/O
server. They found that splitting the functions on a dual CPU system
produced higher throughput than using dedicated nodes. The most
likely reason is that a portion of the I/O was local to the nodes.
They also found that an equal number of clients to I/O servers
produced the best performance. This result is basically the same
overlapped node configuration of Dr. Mache.
Monica Kashyap, et al at Dell Computers performed a similar study.
They used 40 Dell 2650 nodes with dual 2.4 GHz Intel Xeon processors
connected with Myrinet. Up to
24 nodes were used as compute nodes
and up to 16 nodes as I/O servers. Each I/O server had five 33.6 GB SCSI
drives. They used a test code from
ROMIO, called perf,
that performs
concurrent read/write operations to the same file. They examined two
type of write access, those without file synchronization and those
with file synchronization (MPI_File_sync) and two types of read
access, without file synchronization, and read access after file
synchronization.
In general, they found without synchronization they could achieve
very high levels of throughput for write operations. Interestingly,
for a small number of I/O servers, you could rapidly increase the
number of clients from 4 to 24 without too much impact on the
overall throughput. Including synchronization ensures that the data
is on the disk before returning from the function call and as
expected impacted the throughput. However, the general observation
of a small number of I/O servers being somewhat insensitive to the
number of clients, up the number tested, was still true. The file
read access testing exhibited the same trends as the write performance.
The differing results are probably due to the complex nature of
optimizing the best number of I/O servers and configuration options.
There are two interesting things you can take away from these studies.
First, you can "dial-in" your desired performance by adding I/O
servers to a number of clients until you reach the desired throughput.
This option is cost effective because you only add the number I/O
servers needed for a given level or performance. Second, the option
of using compute nodes as both clients and I/O servers has been
shown to be cost effective, but could also lead to some network
congestion if multiple jobs are running at the same time.
Increased Disk Performance
In some cases, people have found that the underlying disk speed is the
primary bottleneck. You can tune the disk for improved performance.
People have been using the command hdparm for several years to
improve the performance of disk drives. See the Resources Sidebar
for more information.
Additionally, there is an easy thing you can do to improve disk
performance, namely using RAID (Redundant Array of Inexpensive
Disks). There are several RAID levels you can use to improve
performance. At the simplest level you can use multiple disks
in a RAID-0 (striping) configuration. As part of their study,
Dell looked at 1 to 4 disks in a hardware RAID-0 configuration.
They found that while the number of disks had only a small impact
on read performance, increasing the number of disks in the RAID-0
set had a large impact on write performance, particularly for
the file synchronization case (synchronizing your data is always
a good idea).
You could also use RAID-5 that would also give you some fault tolerance
(we'll discuss this in the next column). For increased reliability
you could combine RAID-0 with RAID-1 (mirroring). For whatever RAID
level you select you can use a dedicated hardware RAID controller
or use software RAID that is built into Linux.
|