[Beowulf] Torrents for HPC
Ellis H. Wilson III
ellis at cse.psu.edu
Wed Jun 13 08:07:07 EDT 2012
On 06/13/12 11:43, Peter wrote:
> I read the initial Q that the full data set may be required by any job
> so an upgrade to my personal filters may be required :). If this were
No, you are correct about that, or at least, that's what I understood it
to mean as well. So for instance, Job1 has Task1-30 and the 30GB
DataSet has Chunk1-30, each 1GB in size, spread over the entire cluster.
Hadoop just matches Task1 to the chunk it wants to work on. Yes, this
means there at least must be parts of the process that are emb.
parallel, but that's pretty much taken for granted with big data
computation. The serial parts are typically handled by the shuffle and
reduce phases at the end.
> Given that 30-60Gb is small enough copy everywhere, that sort of takes
I wouldn't expect much performance improvement going from 3 to all 30
chunks on a given node, unless you are incredibly unlucky or something
is terribly misconfigured with your Hadoop instance. While 30GB isn't
too bad to copy elsewhere, it's incredibly poor use of storage
resources, having 30 copies of the data all over.
> The comment regarding the obscuring the replication process was directed
> more towards the user experience, they don't need to know it
> automagically happens BUT behind the scenes the copies are happening all
> the same, with the expected impact incurred on IO etc. So HDFS doesn't
> make the process impact free.
Making 30 copies of a 30GB dataset composed of 30 1GB files is quite
different than 3 copies of each file, in size and work passed onto the
user to manage. Even if you get unlucky and one of your tasks does
require remote data, Hadoop handles streaming it to the task while it
needs it and cleans up afterwards. It's going to be far more
considerate about storage resources than any human being will be.
> If you are able to send more to the list regarding HDFS plan B that
> would be great and certainly something I'd be interested in hearing more
> about. Do you have a blog or similar with references regarding any of
> the above ? If so that would be much appreciated.
Not yet. Working on a website as well -- will let you know as soon as
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf