[Beowulf] File server dual opteron suggestions?

Joe Landman landman at scalableinformatics.com
Fri Aug 4 09:12:34 EDT 2006

Hi Mike

Mike Davis wrote:
> Joe,
> I don't mean  to hijack the thread, but if Dave's users can fit the db's 
> that they are running (Blast for instance against) in /tmp on the 
> compute nodes, overall performance increases. 

Yes, I agree with that.  I like reminding customers that there is 
nothing as fast in aggregate as a local file system.

> This certainly doesn't 
> work with genbank (unless you have 130+gb of /tmp. But it does work well 
> with nr, uniprot, and the other protein db's.I run a relatively large 
> /tmp filesystems on my nodes (55-100GB). But my nodes are more general 

When we build systems that have a large block sequential access 
(read/write), we focus on building a faster local IO capability.  Like 
ram, compute node disk space is cheap, though for sequential access 
dominated loads, more spindles is again almost always better.

> purpose and may be running blast one day, Gaussian 03 or VASP  the next, 
> and Fluent or abaqus after that.
> The performance increase will depend on the size of the db, the size of 
> client and server caches, and the number of spindles.


> Mike
> Joe Landman wrote:
>> Mark Hahn wrote:
>>>> I would recommend upping the memory.  Computing or not, large buffer 
>>>> caches on file servers are with very rare exception, a preferred 
>>>> config.
>>> unclear.  the FS's memory does act as an excellent cache, but then 
>>> again,
>>> the client memory does too.  do you have a pattern of file accesses 
>>> in which
>>> the same files are frequently re-read and would fit in memory?  the 
>>> servers
>>> I've looked at closely have had mostly write and attribute activity,
>>> since the client's own cache already has a high hit-rate.  for 
>>> writes, of
>>> course, more FS memory is not important unless you have extremely high 
>> I was actually assuming read-dominated.  Dave does informatics as I 
>> remember, and most of the informatics we have dealt with tends to be 
>> read dominated.  Doesn't mean much though without the workload info 
>> though.  So I agree with the caution, though I humbly note that a 1GB 
>> stick costs about 120$ +/- a bit these days.  Eg, it is not a large 
>> price, and the potential impact on performance is much higher than for 
>> 10k RPM drives.
>> FWIW I have a pair of 10k RPM SATA raptors and I am not all that 
>> impressed with them.
>>> bandwidth net and disks.  in fact, I've been using the following 
>>> sysctl.conf
>>> entries:
>>> # delay writing dirty blocks hoping to collect further writes 
>>> (default 30s)
>>> vm.dirty_expire_centisecs = 1000
>>> # try writing back every 1s (default 500=5s)
>>> vm.dirty_writeback_centisecs = 100
>>> in short, don't bother working at write caching much.  with a lot of 
>>> memory,
>>> an untuned machine will exhibit unpleasant oscillations of delaying 
>>> writes
>>> then frantically flushing.
>> Yup.  I had my dirty around 250 for a long time.  Write caching is 
>> harder because if you really want to play it safe, you shouldn't cache 
>> the write ...
>>>> 2Gb/socket minimum.  Nothing serves files faster than having them 
>>>> already sitting in ram.
>>> true, but is that actually your working set size?  it would be rather 
>>> embarassing if 3 of the 4 GB were files read once a month...
>> Hmmm... again, this is a good workload problem.  If Dave's users are 
>> going through big "databases" from NCBI, lots of ram is a good thing. 
>> It it is just a buncha small files, yeah, could be overkill.
>> But if I had to spend extra $$ on ram versus 10kRPM drives, I know 
>> where I would spend it ...
>>>>> 4 x 74 Gb disks Ultra320 (or make an argument for a particular SATA)
>>> SATA disks are SATA disks, of course.  dumb controllers are all pretty
>>> similar as well (cheap, fast, not-cpu-consuming).  if you have your
>>> heart set on HW raid, at least get a 3ware 9550, which is quite fast.
>>> (most other HW raid are surprisingly bad.)
>> The LSI SAS unit is pretty good.  I like the 3ware, the Areca, and a 
>> few others.  We just created a nice 500+ MB/s "file server" for a 
>> large customer out of an Areca card, 16 spindles and some tweaking.  I 
>> haven't seen production performance data for it yet, but our in house 
>> testing exceeded the 500 MB/s by a little bit.
>>>>> dual 10/100/1000 ethernet on the mobo
>>>> Careful on this... we and our customers have been badly bitten by 
>>>> tg3 and broadcom NICs.  If the MB doesn't have Intel NICs, get an 
>>>> Intel 1000/MT dual gigabit card.  You won't regret that, and it is 
>>>> money well spent.
>>> that's odd; I have quite a few of both tg3 and bcm nics, and can't 
>>> say I've had any complaints.  what are the problems?
>> Interrupted to death.  The tg3 doesn't seem to have NAPI turned on by 
>> default in the standard distro kernels.  Haven't tried the FC* with 
>> this, hopefully it is saner there.  Under heavy load, we see 
>> interrupts climb past 40k/s, and it context switches like mad.  Seen 
>> this from early 2.6 through 2.6.13 on SuSE and RHEL.  Makes using AOE 
>> (Coraid) nearly useless with Broadcom, formatting the unit with ext3 
>> renders the server unusable for hours.  Drop a nice Intel unit in 
>> there, do the same thing and it works great, server is responsive 
>> during formatting.  Same issues for file service and heavy load.
>> Seen this on Tyan, iWill, Arima?, MSI(ibm e32*), and others.
>>>>> case - 2U (big enough for adequate ventilation, right?)
>>>> Yeah, just make sure you have good airflow.
>>> 2U still requires a custom PS, doesn't it?  it's kind of nice to be 
>>> able to put in an ATX-ish PS.  and is 2U tall enough for stock/standard
>>> heatsink/fans?
>> Don't know if it is custom.  I like the redundant PS, but the small 
>> redundant PSes tend not to supply enough current to boot the system. 
>> Need a 3U case for that.
>> Best cooling designs I have seen involve baffles, and a pull or 
>> push-pull config.  We have used some units where under load the 
>> processors are happily working around 22-28C.  Fans are loud though. 
>> Case (1U) is very cool to the touch.
>> For 2U you still need to worry about flow.  I find it hard to believe 
>> that most people get efficient flow out the back grating on 2U and 
>> larger without a helper fan of some sort.
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list