Large Number of Files
In some cases, the bioinformatics world has a need for handling large numbers of files. This need can be a problem when you are trying to address over 10,000 files in one directory! The people with large mp3 collections can sympathize. The bioclusters mailing list had a very interesting brief discussion about how to handle this. On Jan. 28, 2004, Dan Bolser posted a question looking for new information for an old problem - working with directories with over 10,000 files. Dan had some tools to get around the problem of handling this number of files in bash scripts, but felt that the filesystem was sluggish in working with the files. He said that the file systems used a linear, unindexed search of directories to find files. He said that he accidentally created a directory with more than 300,000 files which he referred to as a "... death trap for the system." He posted some quick thoughts about using a hash table to access the files with each node in the hash table being a directory. You would then follow the directory structure to find the file.
Elijah Wright posted that ReiserFS was designed to cope with exactly this problem (accessing files in directories with a large number of files). Joe Landman said that he liked XFS because it used B*-trees which could easily handle this situation. He said in theory that XFS can handle more than 10**7 files per directory. He thought JFS could handle on the order of 10**4 files per directory. Joe felt that none of the other file systems could handle this problem. Arnon Klein offered the possibility of using MySQL in a file system manner. In particular he mentioned LinFS which is a file system of sorts that uses MySQL as a backend.
Dan, the original poster, mentioned that he would try to persuade the administrators to try ReiserFS or XFS. Joe Landman offered the opinion that if they administrators would not switch, then using the hash table idea that Dan originally mentioned should work well. Joe also mentioned that he has been badly burned by ReiserFS in the past. Elijah Wright and Joe Landman also mentioned that XFS and ReiserFS are not really "new" file systems in that they have been around for several years. Joe Landman also posted some information about ext3. He said that under heavy journal pressure (performing lots of I/O to files) ext3 had problems. He said that the journal can become a liability because he felt it wasn't optimized yet. Joe said that he has several customers that are regularly seeing problems when using ext3 and software RAID.
To end the discussion Tim Cutts posted a nice short Perl script for hashing filenames. It has a hash depth of two directories and Tim said it was good for up to about 64 million files.
The discussion was interesting in that it shows how one can use file systems to improve performance of applications and if that doesn't work or is not possible, how one can use simple user-space scripts to get around problems. While writing scripts to handle problems may not be the most ideal solution to many people, it does allow you to solve your problems.
Hypothetical Situation
Brent Clements posted an interesting conundrum to the Beowulf mailing list. He has had requests from researchers to use a queuing/scheduling system to submit kernel builds and reboots. Preferably, a normal user could compile a customized kernel and boot a cluster node with it. When the job finished or if it failed to boot, reboot to the baseline kernel.
There were a variety of solutions proposed. Many thought UML (User Mode Linux -- Linux running Linux) might do the trick, but they were not sure how to incorporate into a batch system. Others thought diskless nodes and PXE DHCP booting was the way to go. After considering all the input, Brent proposed a series of "stock" kernels known to work with their cluster. The researchers could then modify the source and submit their job using a perl script they had developed. The script allows the users to reboot the the allocated nodes using the new kernel via DHCP and TFTP. If the nodes don't respond within 15 minutes, then the nodes are rebooted with a stock kernel.
This discussion is interesting for several reasons. The first reason is that you do see performance differences with kernels. I have seen differences between the RH series of kernels and the SLES series of kernels (never mind the different between 2.4 and 2.6). The second reason is that there may be some interest in having users run in a "sandbox" on the nodes so that if they crash the OS, they won't crash the node. There is likely to be some performance penalty to pay for this capability, but it does allow the node to stay up so that it doesn't require a reboot. The third reason is that it would be very simple to write a queuing/scheduling script that scheduled a job to be run that also installed an OS before the job is run. The nodes would run some base OS that is know to be quite stable, robust, with all of the latest security patches. Then when a job is run on the node, an OS is installed using something like Xen or UML before the job is run. If the "virtual" OS is fairly small, then the lost time is not too important and you ensure there is no OS skew among the nodes running the job (believe it or not, this is always a problem).
This is a very cool subject and one that we are likely to see more of in the future.
Sidebar One: Links Mentioned in Column |
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at ClusterMonkey.net (don't stick your arms through the bars though).