[Beowulf] Home beowulf - NIC latencies
rossen at VerariSoft.Com
Tue Feb 8 09:23:57 EST 2005
Your questions related to the actual cost (in terms of processor
overhead) of achieving the latency numbers that are posted by the
network vendors are very interesting and have important aspects, which
are often overlooked or paid little attention to.
Warning: This posting is long and may be boring.
The ping-pong tests that are often used for measuring the communication
latency (from user level) are an extreme and often unrealistic mode of
operation of the parallel system. Sending bytes across the software
layers and over the network is a fundamental factor for contributing to
fast computation but without looking at the cost and the likelihood (as
Patrick mentioned "crossing the fingers") of producing the best quoted
latencies, you don't usually get the whole picture.
Besides the network hardware/firmware, the implementation (and use) of
the low-level network messaging layer (GM, ELAN, VERBS, etc) and the MPI
library are also of a big importance. The design space of parallel
applications is quite large (size of messages, frequency of messages,
regularity in space and time, synchrony, communication pattern, etc) in
order to hope that any single mode of the entire system would be always
optimal. In this regard, the ping-pong latency test, exercising only one
of these modes, obviously gives you insufficient information on how to
predict the behavior of the communication sub-system in realistic scenarios.
In order to address this issue, our MPI/Pro implementation (plug!) has
long had different modes of using the network and the low-level
messaging layer for all major high-speed networks as well as for TCP/IP
communication. We usually support at least 2 modes - one that optimizes
short message latency (as many of the other MPI implementations do), at
the expense of increased CPU overhead, and one that trades some latency
(communication overhead) for low CPU overhead, higher predictability,
and much better opportunity for overlapping and pipelining. We have
carried out studies for quantifying the degree of overlapping that these
different modes can achieve (using only our MPI implementation, e.g.,
comparing apples to apples) and we have obtained some interesting results.
When you combine all of the complexities of the communication sub-system
(network hardware/firmware, messaging layer, MPI library), the
application, and the OS (let's only take the virtual memory system,
process/thread scheduling, and interrupt/signal handling) you get a
highly probabilistic system, which is hard to quantify and predict by a
single ping-pong latency number.
Our experiments have shown that using a different MPI/Pro mode on the
same application code, executed on the same parallel system, can yield
sometimes substantially different performance results. This shows that
the implementation and the use of the middleware alone can have a
substantial impact on your performance and scalability. Further, the
application code can be written (not always but often) to take advantage
of asynchrony, pipelining, and overlapping. Implementing these
mechanisms in your code (using MPI) often doesn't cost much, but can
speed up your application quite a bit on many parallel systems (running
middleware with the right design) and in the worst case give you no
benefit (on systems that don't provide adequate support for these
So, if you really want to optimize the use of your cluster resources, in
addition to the network and compute nodes, you will need to also
consider the communication middleware and the design of your application
and how they all work together.
Verari Systems Software, Inc.
Vincent Diepeveen wrote:
> At 21:27 5-2-2005 -0500, Patrick Geoffray wrote:
>>Vincent Diepeveen wrote:
>>>>>CPU's are 100% busy and after i know how many times a second the network
>>>>>can handle in theory requests i will do more probes per second to the
>>>>>hashtable. The more probes i can do the better for the game tree search.
>>>>With a gigE network that sounds like 40us or so. With Myrinet or IB
>>>>it's in the 4-6us range. If you bought dual opterons with the special
>>>At the quadrics and dolphin homepage they both claim 12+ us for Myrinet.
>>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X),
>>that includes fibers and a switch in the middle:
>> Length Latency(us) Bandwidth(MB/s)
>> 0 2.684 0.000
>> 1 2.874 0.336
>> 2 2.898 0.690
>> 4 2.978 1.343
>> 8 2.965 2.699
>> 16 2.993 5.347
>> 32 3.409 9.388
>> 64 3.563 17.960
>> 128 3.977 32.185
>> 256 5.699 44.916
>>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I
>>didn't hear about noticeable SCI clusters in a long time.
>>>I am very impressed by the quadrics and dolphin cards. Probably by
>>>infinipath too when i check them out. Will do.
>>>I'm not so impressed yet by myrinet actually, but if cluster builders can
>>>earn a couple of hundreds of dollars more on each node i'm sure they'll
> do it.
>>I don't think Myrinet would be the cheapest, I am sure you can get a
>>better deal from desperate interconnect vendors.
>>What does not impress you in Myrinet ?
> Thanks for your kind answer Patrick,
> Obviously i mentionned that number because i read it elsewhere.
> Well a number of points bother my mind from which majority is true for
> others as well. But first let me note that i'm not against myrinet in
> general. I am just trying to solve a very specific case. For that specific
> case i'm not so impressed.
> Note that so far i didn't find any desperate vendor. For sure quadrics
> doesn't look desperate to me, they aren't even selling old cards anymore
> though they must have still thousands of them lying at home from returned
> upgraded networks. Finding second hand highend cards seems to be very seldom.
> First of all i'm interested in how quick i can get 4-64 bytes from remote
> memory. So not from some kind of network card cache, as myrinet doesn't
> have some megabytes on chip, but just a few tens of kilobytes. The memory
> has to come therefore from the remote nodes main memory, at a random adress
> in the main memory. No streaming at all happens. that 400 ns extra that the
> TLB gives is definitely not the problem i guess.
> The problem for me is to understand: "how do you get that memory at a
> A latency on paper says of course nothing when you can't actually get it
> within that time.
> "Paper supports everything."
> Arturo Ochoa (Caracas, Venezuela)
> I hope everyone realizes that an important consequence from beowulf
> clusters is that you actually want to *use* all those cpu's you have to
> your avail.
> So every cpu has a program running that eats 100% system time. Because if
> it wouldn't use 100% system time, you wouldn't need a cluster!
>>From that 100% system time obviously you must be prepared to give away some
> to serve other nodes as quickly as possible doing a read.
> All latencies i see quoted at all hardware sites, it is very hard to figure
> for me out whether that's a latency that is supported by paper, or whether
> it's a practical latency i can take into account as a programmer with all
> software layers overhead when each cpu is 100% running a program.
> Secondly, but as i'm not a cluster expert i don't know how to avoid that,
> it's of course a big LOSS in sequential speed if my program each few
> instructions must check whether there is some MPI message to get handled.
> If i check a lot that will slow down my program 20 times. If i don't check
> a lot, other cpu's will have to wait longer and that defeats the purpose of
> a fast network card.
> Factor 20 is about the slowdown of the average 'old' supercomputer
> chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens),
> P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my
> own eyes against those programs in world champs and despite that it has
> happened that i played at the same hardware with a similar amount of cpu's
> and a program having factor 100 more chessknowledge (which slows down the
> program *considerable*), the actual speed at which the program searches
> nodes was up to factor 5-10 faster.
> Now a few years ago this was not a major problem because for example
> Cilkchess which obviously ran factor 20-40 times slower than it could, used
> 1800 processors for example in world champs 1995 (Hong kong) and 512
> processors in world champs 1999 (Paderborn). Of course because 1 processor
> was real real fast compared to the speed of 1 pc processor in those days,
> they practical were searching a lot deeper than pc programs (and both
> played excellent for its days, especially Don Dailey needs to get a big
> compliment for that).
> However if i show up with 2 pc's and 2 network cards, then it sure matters
> when i lose a lot of speed.
> Obviously for embarassingly parallel software this is no issue, but usually
> for embarrassingly parallel software all you need is gigabit ethernet.
> There is so many MPI applications which are not exactly embarassingly
> parallel from which you see that a decent programmer single cpu would be
> doing that 20 times faster. Or to quote someone who has been doing such
> rewriting work for some physical applications that run here and there: "I
> didn't blink my eyes when i managed to speedup an application factor 1000".
> So it is very interesting for us all and me especially to understand how
> *fast* you can get that memory under full load of all the logical cpu's.
> Third each pc has 2 cheapo k7 processors which are a lot slower than opterons.
> Second problem i have is that i can get easily dual k7 pc's from
> chessplayers and they can get bought cheap still. Dual k7 is practical same
> speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with
> 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a
> cheapo dual k7 with registered cas3 RAM.
> Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and
> also those who do have A64's or P4's usually don't have pci-x onboard
> either. Sure there is boards that have them and i'm sure that if you make a
> Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX
> mainboards and claim somewhere a paper latency of 1.x us.
> What is the achieved read speed to remote memory myrinet gets at 64 bits /
> 66Mhz in software, so ready to use 4-64 bytes for applications?
> I'm not asking it to be accurate within 400ns, as that's the delay you'll
> have from TLB trashing the remote node. But accuracy within 1.5 us would be
> quite nice.
> First of all for integer intensive applications i'm doing fastest processor
> is opteron, k7 comes second and P4 comes third. Exception is a P4 machine
> equipped with the most expensive stuff (2-2-2 ram and all banks filled)
> good mainboard and northwoods and overclocked at the mainboard. However for
> that price a dual opteron can get bought and it just blows away that P4
> Every year that new software gets released of course that P4 gets slower,
> because newer software only gets more and more complex with more options
> and will fit less perfectly in P4's small tiny caches, let alone when we
> get a lot of 64 bits programs. They won't fit at all in those tiny slow
> So until the dual core opterons arrive at low cost, obviously you can make
> dual k7 nodes for just a few hundreds of dollar a node.
> When adding new nodes which in the future no doubt are dual opteron, you
> still run further with those dual k7 nodes and want to mix them obviously
> with dual opterons. Is that possible?
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf