diskless node + g98?
tekka99 at libero.it
Thu Jan 23 17:28:04 EST 2003
I don't know g98, but could it be of any help considering NBD
(http://www.xss.co.at/linux/NBD/) or ENBD (http://www.it.uc3m.es/~ptb/nbd/)?
Also for some other uses in beowulf clusters?
Anyone good experiences with them?
----- Original Message -----
From: "Ken Chase" <math at velocet.ca>
To: <beowulf at beowulf.org>
Sent: Thursday, January 23, 2003 9:17 PM
Subject: Re: diskless node + g98?
> On Thu, Jan 23, 2003 at 10:29:31AM -0800, Martin Siegert's all...
> > On Thu, Jan 23, 2003 at 11:43:29AM -0600, lmathew at okstate.edu wrote:
> > > Beowulf list readers:
> > >
> > > I have a Beowulf cluster (12 diskless nodes, 1 fileserver/master) with
> > > 26 processors (total) that is configured to run computational
> > > in both parallel and serial (pretty standard for this list). I am
> > > interested in utilizing my cluster to run a series of serial g98
> > > calculations on each node. These calcualtions (as many of you know)
> > > require a "scratch" space. How can this scratch space be provided to
> > > diskless node? Here are a few options that I have identified.
> > I am running a 96 node (192 processor) cluster as a multi-purpose
> > research facility for a university. I have a lot of g98 jobs running
> > on that cluster. All of my nodes have /tmp on a local disk with 15GB of
> > scratch space.
> We started with no local disk on our clusters for G98, and it really
> on a bunch of things (as always).
> At the time, they werent giving hardrives away in cereal boxes, so we
> put them in every node. Our per node cost was only 2x the cost of a drive
> at the time.
> So we could buy N nodes with N disks, or we could buy N*1.5 nodes and put
> everything on a big nfs server (granted, the NFS server was already
> at no cost to the cluster, which doenst make this a fair comparison).
> We would normally lose 10-20% of performance for the types of jobs we were
> running waiting for 'slow' (12 MB/s across 100Mbps) scratching to occur.
> (We found the jobs would want to scratch at that speed for about 5-15% of
> their runtime, so sharing 6 nodes per 100Mbps works fine for total
> throughput on 100Mbps networks).
> 2 things:
> 1) 1.5N * 0.9 = 1.35 diskless throughput vs 1N for putting drives in.
> 2) 1.5N * 0.9 + 1.5N * 0.1 = 1.5 N vs 1N
> (2) stems from the fact that while processes are in NFSIO state, we can
> actually run other jobs on that same node in the meantime, and recover the
> otherwise lost. (In our situation there are *ALWAYS* large low priority
> that have to run for a long time around for us to slam onto a node at idle
> priority level 20 (freebsd idle priority ROCKS), so we can do this.)
> all your jobs are at equal priority so you cant do this. Or perhaps you
> linux so you must give up 5% of your CPU for nice 19'd jobs and thrash
> cache and get hammered on context switches. So this might not work for
> Idle priority solves this when you have any job that can use 100% of the
> for LONG PERIODS OF TIME (anything over a second is a VERY LONG TIME on a
> But, since hardrives are at worst 1/4 the cost of a node (regardless
> of what kinds of nodes you use) this all might require a reanalysis of
> the situation. (As always, tho, as RGB and I keep repeating, people keep
> leaving the $ signs out of their 'performance' calculations.)
> In fact, for a specific number of nodes on a cluster we built last year,
> upgrade was to put drives on half the nodes. (A superlinear moore's law on
> hardrive sizes/$ from the last year has helped alot to underline how
> *WAITING* to purchase and install parts of your cluster can be! $50USD for
> 40Gb drives?! haha!) This allows us to run frequency scan jobs on g98 much
> faster on those nodes. It really depends on what kinds of jobs you run
> G98 as well, and what models of theory you use. Frequency and scan jobs
> the worst for thrashing scratch - just buy a hardrive per node.
> But use it only as scratch - manage the cluster from a share NFS root for
> > > 1). Mount a LARGE ram drive? (1GB in size if possible??)
> > Almost certainly not good enough: most of the g98 jobs that I see on
> > my cluster need more than 1GB of scratch space.
> ram drives dont work so well because they have fixed size. plus, ram
> is more expensive than disk. A better solution: swap-backed ramdisk.
> We swap over the network as required (g98 runs in wired core), and its
> extremely fast (in fact, I find it faster than whatever method g98 uses to
> write its scratch files). And, you dont need to hammer the network til
> run out of ram. Perfect solution.
> However, linux's swap backed ramdisk stuff is far less mature than
> md device. We have had alot of success with it on fbsd >4.5
> 512 Megs of ram on these boards when the jobs really only want (and only
> seem to take, even when forced) 128 or 256 megs means small scratch files
> can be dealt with quick, and only when they get large do you go to the
> > > 2). Install hard disk drives in each of the slave nodes?
> > By far the best solution.
> Yes, by far the FASTEST PERFORMANCE PER NODE. I have a box of extra $
> for your calculations here if you'd like to use them. Then we can all do a
> FASTEST AGGREGATE THROUGHPUT PER DOLLAR calculation. (Doesnt anyone care
> about this? Why? Are we all building ASCI colour superclusters by using
> backhoes to dig into our gravel pit full of money?)
> > > 3). Use a drive mounted via NFS/PVNFS? (large amount of
> > Very bad. I first (because I did not know anything about g98) had g98
> > configured such that it would write its scratch files to the user's home
> > directory over NFS. This did not only drive the performance of the g98
> > towards 0, but what is worse it made life miserable for everybody on the
> > cluster (NFS timeouts, etc.).
> We found no NFS timeouts. We designed things properly with g98 scratching
> mind. The NFS server has 4 raid 0 striped drives that are specifically
> to handle this scratch work. No problems at all, once everyone was warned
> have Nx1.5 nodes *BECAUSE* we have no disk, so do not bitch that you only
> 90% cpu usasge for your jobs! submit another low priority job and soak it
> if you really care." -- worked well.
> Furthermore, since you are running jobs singly per node, total throughput
> all nodes is obviously very important to you (you have sequential jobs you
> mentioned but obviously you dont have just 1 node - you have parallel
> of work to perform across your # of nodes). So in this case, if there's
> way you can split streams up into more than the 26 cpus you have and
> prioritize them differently, then you can soak up the extra cpu you have
> left over from not having disks.
> At least that was the philosophy behind it all when disks were alot more
> than they are now. Again, they're so cheap, it might not make sense
> As I said, get out the $ signs and do the math. It makes less sense with
> faster and more expensive nodes. (The ratio of node cost per disk cost is
> a key part of the calculation).
> Another solution we're actually using on that cluster that now has the
> is, since its been hard to hammer into people's heads to use SPECIFIC
> for SPECIFIC types of jobs (following the concept of a 'cluster tuned
> specifically for the jobs it runs), we've just mounted every odd node with
> disk onto every even diskless node. So you only have 1 nfs client per
> and it gets full 100Mbps performance. Yes, it uses a bit of CPU on the
> node, but its worth it in the long run.
> THe big loss of hardrives is that we've probably instantly doubled the
> failure rate of any component on the cluster - more to go wrong now. :(
> > > Has anyone encountered this? If so...what was the workaround that was
> > > implemented? I am open to any suggestions and comments. :)
> Can do 1 disk per n nodes if you want, but it makes sense to raid 1 them
> avoid downtime/bitching. Maintenance on a raid 1 can wait hours or days
> (weeks?) before being critical. And, with a few simple scripts that can
> chomp .log outputs from g98 and restart jobs where they left off, without
> paying the 3-10% performance hit for the 'checkpoint' feature in g98,
> hammers the disk even harder, downtime hardly matters anymore anyway as
> long as the node comes back up eventually - you dont lose all work to date
> on that node. (This works extremely well with my feelings about designing
> clusters that *CAN* have nodes fail without major impact, allowing one to
> use very cheap parts without any throughput loss.)
> > I am going to stick my head out here: configuring a multi-purpose
> > cluster with diskless nodes is a misconfiguration. Only if you know
> > that you'll never run a job with significant I/O on your cluster
> > you could consider going diskless. Otherwise: stay away from that.
> No, just get out your $ signs and do appropriate calculations. Just be
> a bloody hardnosed cold realist and calcuate the numbers. Look at total
> throughput per dollar from each design.
> I even did some GNUplots of disk usage vs CPU usage as well as network
> bandwidth used (MRTG) to win the cluster contract that had no local
> Yes it was slower per node, but we had far more nodes to more than make up
> it. (And as I keep saying, the extra cpu that's idle can be reabsorbed.)
> > (you could install a high-performance file server on your cluster - we
> > actually have a Netapp NFS server - but for g98 your network becomes
> > the bottleneck. Furthermore, this is definitely more expensive than
> > installing local disks ...)
> It can become a bottleneck at really large numbers of jobs, yes, thats
> true. The bottleneck is transactions on disk per second, not the raw disk
> bandwidth. I'd suggest a disk array per 20-30 CPUs for what we do, but its
> hard to compare what we do with what your doing.
> [ Besides, even if the disk is being hammered, until the disk usage
> reaches a plateau (such that its hammered equally all the time) *AND* the
> performance loss is not worth the extra nodes (regardless of soaking up
> CPU with other jobs that may (or not) hit disk), it may still be worth it.
> Again, it depends on your applications - if you have other non g98 jobs
> to run that hardly touch disk at all, you're laughing here - you'll always
> be able to soak up extra cpu caused by slow disk or network. ]
> G98 has many different job parameters and uses the disk in very different
> ways. It _REALLY DEPENDS_. Run your tests now on a few nodes and then plot
> your results vs dollars spent.
> > Just my $0.02
> > Martin
> > ========================================================================
> > Martin Siegert
> > Academic Computing Services phone: (604) 291-4691
> > Simon Fraser University fax: (604) 291-4242
> > Burnaby, British Columbia email: siegert at sfu.ca
> > Canada V5A 1S6
> > ========================================================================
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> Ken Chase, math at velocet.ca * Velocet Communications Inc. * Toronto,
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf