[Beowulf] C vs C++ challenge (awk version)

Robert G. Brown rgb at phy.duke.edu
Fri Jan 30 18:46:44 EST 2004

On Fri, 30 Jan 2004, Selva Nair wrote:

> 1. Googling shows that Shakespeare used 31,534 different
> words and a total of 884,647 words in his complete works 
> (http://www-math.cudenver.edu/~wbriggs/qr/shakespeare.html)
> And some statistical projections indicate he probably
> knew that many more words but did not use. In any case, the awk
> result is surprisingly close to 31,534 inspite of the
> much smaller sample we used -- there is some over counting
> due to numeric tokens and other "junk" in the Gutenberg text,
> but I am still surprised by the close agreement. Also I wonder 
> why the C version comes up with as high as 37,000+ unique words.

I didn't use toupper or tolower, simply said that to get closer to a
correct count of unique "words" (as opposed to tokens) one should.  If I
do (trivial addition, completed) then I get 31384 words, which is again
curious because a) I don't strip off the Gutenberg headers -- this is on
raw shaks12.txt, so it contains words like cdroms and etexts and of
course this is one reason the raw word count is high as well; and b)
yeah, it still miscounts hyphenations and apostrophes and various other
constructions I see scanning the actual list of unique words. 

It appears that the net effect is negative.  OR, as always, I could have
a bug.  Or the google produced count could be wrong -- depends on how it
was made and the sources, after all.  If you (or anyone) care I can put
the wordlist it generates on my website or something so you can grab it.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list