[Beowulf] C vs C++ challenge (awk version)

Joe Landman landman at scalableinformatics.com
Fri Jan 30 20:31:16 EST 2004

On Fri, 2004-01-30 at 16:39, Selva Nair wrote:
> On Fri, 30 Jan 2004, Joe Landman wrote:
> > On Fri, 2004-01-30 at 11:37, Robert G. Brown wrote:
> > 
> > > Note that dwc finds 67505 "distinct words".  This is very interesting,
> > 
> > This is what I find in the Perl version as well.  I wish I could get
> > that Java code compiling on this platform so I could compare the two.
> The awk script I posted earlier counts tokens separated by [:blank:] and I
> guess that is what the C++ version also does -- tokens not words as rgb
> carfully pointed out. I was trying to mimic the C++ code, so case and

Ok.  I changed mine slightly.  Very naive case here

	Perl 5.8.3

	wrnpc11.txt	1.21 +/-  0.01 s
	shaks12.txt	2.09 +/-  0.04 s
	big.txt		14.31 +/- 0.09 s
	vbig.txt	1m11.49 +/- 0.52 s

I could easily optimize it more (do the work on a larger buffer at a
once), but I think enough waste heat has been created here.  This is a
simple 2500+ Athlon XP box (nothing fancy) running 2.4.24-pre3.


	my $infile;
	my $filename		=	shift;
	my (%word,$buffer);
	die "Usage: wcount2.pl filename\n" if (!defined($filename));
	printf "opened file ...\n";
	while (	$buffer = lc(<$infile>) ) 
	      { @word{($buffer=~ /(\w+)/g)}=1; }	
	printf "Words: %s\n",1 + keys %word;



> eloguantly expressed by rgb. I can not see any reason to learn or use C++
> simply for sets or hashes or trees in STL.

The view I have is to use the most appropriate tool for the task.  Some
tools are in search of tasks for which they are appropriate.  Marketing
has in some cases supplanted common sense.

> By the way, the performance of C++ STL appears to vary by up to a factor
> of 10 between different versions of g++. The number I quoted (2.8 s)  was
> the fastest I could get and that was with g++ 2.96. For some reason 
> version 3.0.x runs much slower here..

Alex Stepanov had a benchmark on the cost of the STL and the advanced
C++ features.  Some compilers did rather poorly as I remember.  You
might get better results from Intel's compiler or the PathScale
compilers (for the Opteron).

> Selva 
> Footnotes:
> 1. Googling shows that Shakespeare used 31,534 different
> words and a total of 884,647 words in his complete works 


time ./wcount2.pl shaks12.txt
opened file ...
Words: 23903
2.000u 0.020s 0:02.04 99.0%     0+0k 0+0io 360pf+0w

I am using /(\w+)/g which will grab words into capture buffers.  A word
is a collection of alphanumeric plus others.  I could build other
regexes.  Using /(\S+)/g gives 59509.  Using /\b(.*)\b/g I get 110557.

> (http://www-math.cudenver.edu/~wbriggs/qr/shakespeare.html)
> And some statistical projections indicate he probably
> knew that many more words but did not use. In any case, the awk
> result is surprisingly close to 31,534 inspite of the
> much smaller sample we used -- there is some over counting
> due to numeric tokens and other "junk" in the Gutenberg text,
> but I am still surprised by the close agreement. Also I wonder 
> why the C version comes up with as high as 37,000+ unique words.

It goes to what the definition of "word" is.  I am lower-casing
everything.  Without lower-casing I get 29673.  Not sure what the issue
is, but not willing to spend more time on it.

Of course this has little to do with Beowulf-ery.  Language choice is
largely dictated by the problem, the developers etc.  Force fitting
solutions is often a recipe for disaster (systems designed to fail often

Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list