<HTML>

<HEAD>

<TITLE>Re: [Beowulf] dedupe filesystem</TITLE>

</HEAD>

<BODY>

<FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Isn&#8217;t de-dupe just another flavor, conceptually, of a journaling file system..in the sense that in many systems, only a small part of the file actually changes each time, so saving &#8220;diffs&#8221; allows one to reconstruct any arbitrary version with much smaller file space. <BR>

I guess the de-dupe is a bit more aggressive than that, in that it theoretically can look for common &#8220;stuff&#8221; between unrelated files, so &nbsp;maybe a better model is a &nbsp;&#8220;data compression&#8221; algorithm on the fly. &nbsp;And for that, it&#8217;s all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm. &nbsp;(Reliability factors into it a bit.. Compression removes redundancy, after all, but the defacto redundancy provided by having previous versions around isn&#8217;t a good &#8220;system&#8221; solution, even if it&#8217;s the one people use)<BR>

<BR>

I think one can make the argument that computation is always getting cheaper, at a faster rate than storage density or speed (because of the physics limits on the storage...), so the &#8220;span&#8221; over which you can do compression can be arbitrarily increased over time. TIFF and FAX do compression over a few bits. Zip and it&#8217;s ilk do compression over kilobits or megabits (depending on whether they build a custom symbol table). &nbsp;Dedupe is doing compression over Gigabits and terabits, presumably (although I assume that there&#8217;s a granularity at some point.. A dedupe system looks at symbols that are, say, 512 bytes long, as opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit symbols.<BR>

<BR>

The hierarchical storage is really optimizing along a different axis than compression. &nbsp;It&#8217;s more like cache than compression.. Make the &#8220;average time to get to the next bit you need&#8221; smaller rather than &#8220;make smaller number of bits&#8221;<BR>

<BR>

Granted, for a lot of systems, &#8220;time to get a bit&#8221; is proportional to &#8220;number of bits&#8221;<BR>

<BR>

On 6/5/09 8:00 AM, &quot;Joe Landman&quot; &lt;<a href="landman@scalableinformatics.com">landman@scalableinformatics.com</a>&gt; wrote:<BR>

<BR>

</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>John Hearns wrote:<BR>

&gt; 2009/6/5 Mark Hahn &lt;<a href="hahn@mcmaster.ca">hahn@mcmaster.ca</a>&gt;:<BR>

&gt;&gt; I'm not sure - is there some clear indication that one level of storage is<BR>

&gt;&gt; not good enough?<BR>

<BR>

I hope I pointed this out before, but Dedup is all about reducing the<BR>

need for the less expensive 'tier'. &nbsp;Tiered storage has some merits,<BR>

especially in the 'infinite size' storage realm. &nbsp;Take some things<BR>

offline, leave things you need online until they go dormant. &nbsp;Define<BR>

dormant on your own terms.<BR>

<BR>

</SPAN></FONT></BLOCKQUOTE>

</BODY>

</HTML>