[Beowulf] C vs C++ challenge (awk version)
Robert G. Brown
rgb at phy.duke.edu
Fri Jan 30 18:46:44 EST 2004
On Fri, 30 Jan 2004, Selva Nair wrote:
> 1. Googling shows that Shakespeare used 31,534 different
> words and a total of 884,647 words in his complete works
> (http://www-math.cudenver.edu/~wbriggs/qr/shakespeare.html)
> And some statistical projections indicate he probably
> knew that many more words but did not use. In any case, the awk
> result is surprisingly close to 31,534 inspite of the
> much smaller sample we used -- there is some over counting
> due to numeric tokens and other "junk" in the Gutenberg text,
> but I am still surprised by the close agreement. Also I wonder
> why the C version comes up with as high as 37,000+ unique words.
I didn't use toupper or tolower, simply said that to get closer to a
correct count of unique "words" (as opposed to tokens) one should. If I
do (trivial addition, completed) then I get 31384 words, which is again
curious because a) I don't strip off the Gutenberg headers -- this is on
raw shaks12.txt, so it contains words like cdroms and etexts and of
course this is one reason the raw word count is high as well; and b)
yeah, it still miscounts hyphenations and apostrophes and various other
constructions I see scanning the actual list of unique words.
It appears that the net effect is negative. OR, as always, I could have
a bug. Or the google produced count could be wrong -- depends on how it
was made and the sources, after all. If you (or anyone) care I can put
the wordlist it generates on my website or something so you can grab it.
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list