[Beowulf] C vs C++ challenge (awk version)

Fri Jan 30 11:37:52 EST 2004

On Thu, 29 Jan 2004, Selva Nair wrote:

> But this one does not count unique words, does it?
> 
> Here is my version in awk. It beats C++ by 1 line in length and
> 1.5 times in speed (1.86s versus 2.83s elapsed time) with shaks12.txt as 
> input.
> 

A modified (hashing/unique word) version of the C yields:

rgb at lilith|T:245>/usr/bin/time wordcount shaks12.txt
Opening shaks12.txt for counting
Total Number of Words = 902303; total number of UNIQUE words = 37499
Command exited with non-zero status 69
1.15user 0.12system 0:01.29elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (100major+4093minor)pagefaults 0swaps

This is using gnu's hcreate, hsearch hash functions (ordinary POSIX
library functions, as it were) and simply adds a secondary loop through
all the words to do the count.  This is still not terribly close to
efficient -- the reason it has to be done in a secondary loop is because
realloc moves the base pointer to the primary data buffer where all the
words are stored, which makes hcreate/hsearch unhappy in midstream.  The
best way around this are to split the code into a segment to read the
whole file into memory efficiently (probably using read()/realloc in 32K
chunks) and a single loop to strsep() parse into words and to hsearch()
them into the hash.  I don't have a good feel for how much time this
would save, but at a guess one could get the overall time down to maybe
a second from 1.29 seconds wallclock, mostly on the I/O side.  Oh, and
one wouldn't need to maintain a secondary indexing vector at all since
the hash would do it all, although I'm sure it isn't horribly expensive.

Of course this STILL doesn't REALLY count unique "words", only unique
"tokens".  To do words we'd have to toupper() all the words before
saving/hashing them, as The and the aren't really unique.  In
Shakespeare, things are even worse as he uses his own characteristic
blend of "interesting" contractions, what say'st though?  Idiomatic, as
it were.  So lots of tokens included a ' and are contractions, others
are caps, still others are real mistakes -- hyphenated words split
across a line but recorded as "two unique words".  So the real count is
more likely a few thousand smaller.  Still, Shakespeare seems pretty
literate -- ~38K unique tokens when linux_words only has order of 45K
total.

I don't know that I'll bother to redo this one more time to speed it any
further as described above -- it IS a toy project.  The main point is
that adding hashing required about 13 lines of code, and in a rewrite
that eliminated the indexing vector I'd get those back and change.  So
using POSIX/Gnu functions (sloppily) it still isn't a horribly long
program, and is still fairly efficient (running in less than half the
time of C++).

One final point:

> [selva at scala distinct_words]$ /usr/bin/time ./dwc < shaks12.txt
> Words: 67505
> 2.79user 0.04system 0:02.83elapsed 100%CPU
> 
> Here is the script:
> 
> #!/bin/awk -f
> {
>   for(i = 1; i <= NF; i++) {
>     if (words[$i]) continue;
>     words[$i] = 1 ;
>     ++nwords;
>   }
> }
> END {
>   printf "Number of distinct words = %i\n", nwords;
> }

Note that dwc finds 67505 "distinct words".  This is very interesting,
as wordcount only finds 38000-odd distinct words, where "distinct" has
to do with just how words are parsed in the first place and where I can
watch, via inserted printf commands, to be dead certain that the process
of word selection and uniquifying is proceeding correctly.  In fact in C
this is ALL strictly under my control -- I can alter the list of word
delimiters as I wish, and I can even (with a bit of effort but not a
huge penalty in time or lines of code) "fix" the hyphenation problem:

 a) stop parsing on LF/CR (currently they are delimiters
 b) IF a word ends on LF and/or CR AND the preceeding character is a
hyphen
 c) THEN shift the following word left N spaces (to be computed) in the
buffer to reconnect the word halves.
 d) AND save the word as the reconnected halves (via offsets or hash or
both.

Fix >>this<< in C++ using the code already presented -- hmmm, not so
easy.  Fix it in awk (or even figure out what is wrong with the script
above that it is returning almost twice as many words as I count, where
my script and e.g. wc return precisely the same count on
/usr/share/dict/linux_words and where I can as noted watch it do what it
does and hence have a fair bit of confidence in it).  Fix it so that (in
both cases) one still has the entire text in memory, indexed by and
retrievable by words one at a time AND has access to the hash of unique
words only without significantly increasing the time or storage
requirement.

I bring these up simply to reiterate that real world code is rarely
simple.  This is hardly a "complex project" at first glance, but if one
examines what one actually GETS as unique words when one just reads in a
block of arbitrary (but real) text, parses it badly (or at least in a
way that is out of your control) one finds that it is pretty unlikely to
return what you really want if you use a one-size-fits-all library
function to do all the work.  To reassert control and fix special cases
without microscopic access to the raw data stream requires more and more
contortion, and can easily introduce multiple passes across the data to
do first one thing, then another, then another, when they could all be
done in a single loop if one did it "right".

It is precisely this that I was referring to when I said that using a
library, ESPECIALLY a powerful high level library, shapes your code,
sometimes in highly inefficient ways.  This isn't surprising -- a
variety of philosophers e.g. Dorothy Lee and Bronislaw Malinoski have
long argued that the very way we think is governed by the language we
think in, and this is obviously even more true when it comes to computer
"languages", where commands encapsulate entire algorithms with very
specific and complex results.  Similar tradeoffs are visible in
pictographic languages (where every "thought" has its own unique token
that must be learned in order to read it or write it) and phonetic
languages (where every thought has a nearly unique VERBAL token, but
where a relatively small set of building blocks can represent all the
phonemic encodings so that one can learn to read and write years
earlier, and where the language is far more flexible and capable of
growth and evolution at a far greater rate).

So my final conclusion (this thread has probably gone on long enough and
at least SOME people are likely bored with it by now:-) is my original
one -- raw C (or C++ written as procedural C with minimal and sensible
use of the extensions, a la Mark Hahn:-) will nearly always result in
the fastest execution times, and gives one COMPLETE control of your data
and execution loops.  Using objects and object-oriented methodology is
still entirely possible and I'd even say best practice in BOTH language
variants.  

C++ (including the STL library) has a wider set of "standard" tools,
although those tools come with a very distinct ease of programming vs
execution time tradeoff, and with another more subtle tradeoff in the
way that they shape your code (which may be in ways that are just fine,
or MAY be in ways that are moderately detrimental when the library calls
aren't terribly good fits to what you want to do but your reliance on
them has slowed the development of skills to do what you want without
them).  Again, this isn't an intrinsic criticism of C++ -- it is just an
empirical observation and I'd make an entirely similar observation about
many other high level tools including markup languages vs word
processors and more.  If you use a nail gun all the time doing rough
carpentry, don't be surprised if you hit your thumb and leave divots in
the work when it comes time to hammer finish nails into the trim or cut
nails into a cinder block wall with an actual hammer. (Don't you just
love metaphors?:-)

Other languages have similar tradeoffs -- few will beat C in executable
speed, although it is possible (even humble fortran might beat C in
certain tasks, partly because the fixed data language structure permits
some things to be done by the compiler very efficiently where C uses
library calls on more arbitrary pointer based data structures less
efficiently). Many have things C does not that make PROGRAMMING a lot
easier and may even enable certain things (such as writing a
self-modifying program) POSSIBLE that are horribly difficult in raw C.

SO I like C, but I also like PHP, Perl, might one day like C++ and
python.  I even like awk, although aside from very short awk programs to
do very specialized things perl is infinitely easier.  I therefore
remain highly ecumenical -- I might tease people for using other
languages and do think that one should match language to task with some
care, but I can hardly justify being overly zealous in advancing any
single language as being THE best as it is always a matter of which
tradeoffs one prefers or are led to by one's problem.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#define _GNU_SOURCE
#include <search.h>

#define PAGE 4096
#define K 1024

static char delim[] = {
 ' ',       /* blank */
 (char) 9,  /* tab */
 ',',       /* comma */
 (char) 10, /* LF */
 (char) 13, /* CR */
 '.',       /* . */
 '!',       /* ! */
 '?',       /* ? */
 ';',       /* ; */
 '"',       /* " */
 '*',       /* * */
 '<',       /* < */
 '>',       /* > */
 '@',       /* @ */
 '[',
 ']',
 '(',
 ')',
 '{',
 '}',
 '+',
 '=',
 '/',
 '\\',
 '#',
 ':',
 (char) 0  /* terminator */
 };

 static char dummy[] = "aapvoinwe"; 

main(int argc, char **argv)
{

 char filename[128],*word,*words,*words_head,*words_tail,*words_last;
 int *word_offset;
 size_t offset,word_length,words_length,wordlist_length,
        word_count,unique_count,i,j,nel;
 FILE *infile;
 ENTRY item,*search_result;

 if(argc != 2){
   fprintf(stderr,"Usage: wordcount filename\n");
   exit(1);
 }
 argv++;
 sprintf(filename,"%s",*argv);
 printf("Opening %s for counting\n",filename);
 infile = fopen(filename,"r");

 /*
  * The word list as a single buffer.  We allocate two pages to start with, and
  * reallocate whenever the free space in the buffer drops below one page.  We
  * also maintain a pointer to the beginning of the unused part of this buffer.
  */
 words_length = 2*PAGE;
 words = (char *) malloc(words_length);
 words_head = words;
 words_tail = words + words_length;

 /*
  * We do the exact same thing here, sort of.  We allocate a PAGE of offsets
  * (relative to the words head!)  and reallocate whenever the number remaining
  * drops below K.  i will both index the wordlist and count the words as we go.
  */
 wordlist_length = PAGE;
 word_offset = (int *) malloc(wordlist_length*sizeof(int));
 word_count = 0;

 /*
  * Each pass through the while loop reads a single line from
  * infile into words at offset wptr (current tail).  We will
  * write the code to manage lines up to 1K in length.
  */
 word_offset[0] = 0;
 /*
  * Set up a hash table to extract/count only UNIQUE words.  300000
  * words could come close to doing the OED, at least OTC versions.
  * Not much speed penalty for making nel bigger, BIG penalty for
  * making it too small...
  */
 nel = 300000;
 hcreate(nel);
 unique_count = 0;
 while(fgets(words_head,(words_tail - words_head),infile)!= NULL){

   /*
    * Check to add a PAGE to the words vector/buffer.  Since realloc
    * may move pointer, we have to recompute offsets.
    */
   if((words_tail - words_head) < PAGE) {
     words_length += PAGE;

     /* save offset */
     offset = words_head - words;
     words = (char *) realloc(words,words_length);

     /* displace by saved offset */
     words_head = words + offset;
     words_tail = words + words_length;
     /* printf("reallocating words at %0x, new length = %d\n",words,words_length); */
   }

   /*
    * Check to add a PAGE to the wordlist vector/buffer.  Here we
    * don't have to worry about offsets.
    */
   if((wordlist_length - word_count) < K) {
     wordlist_length += PAGE;
     word_offset = (int *) realloc(word_offset,wordlist_length*sizeof(int));
     /* printf("reallocating words_offset at %0x, new length = %d\n",word_offset,wordlist_length); */
   }

   /*
    * Now we rip through the line.
    */
   while(1){
     /* Get a word */
     word = strsep(&words_head,delim);
     if(words_head == NULL) break;

     /* Save its address for after we are done with this line */
     words_last = words_head;

     /* Save its offset in the offset vector */
     if(word[0] != (char)0){
       word_offset[word_count] = word - words;

       /* for debugging offsets etc */
       /* printf("%d: word at %0x = %s, offset is %d\n",word_count,word,word,word_offset[word_count]); */

       word_count++;

     }
   }
   /*
    * strsep finishes parsing a line with a NULL.  Restore the right pointer.
    */
   words_head = words_last;

 }

 /*
  * Let's see, we can try sorting and uniq'ing here.  As expected, it is
  * gawdawful slow and scales terribly with size and even though we COULD
  * move the first loop up into the code above easily enough, this is a
  * poor way to do it.  We'll leave the fragment here, disabled.
 for(i=0;i<word_count;i++){
   if((words + word_offset[i])[0] != 1){
     unique_count++;
     for(j=i+1;j<word_count;j++){
       if(strcmp(words + word_offset[i],words + word_offset[j]) == 0){
         (words + word_offset[j])[0] = 1;
       }
     }
   }
 }
  */

 /*
  * This COULD be done in the loop above, and would be much faster if we
  * did.  However, it really uses the pointers in item, and since realloc
  * shifts the base several times in a long run, it segment faults when
  * this happens if one does this in the primary loop.
  *
  * A better overall approach would likely be to a) read the entire file
  * into a data block all at once using read and malloc'ing space in
  * much larger chunks (probably 32K, 8 pages at a time); b) do the
  * strsep() and hash insertion all at once -- in fact, we wouldn't
  * even need to mess with an indexing vector in this approach as the
  * hash table would manage all "indexing".
  */
 for(i=0;i<word_count;i++){
   item.key = words+word_offset[i];
   search_result = hsearch(item,FIND);
   if(search_result == 0){
     unique_count++;
     item.key = words+word_offset[i];
     item.data = words+word_offset[i];
     hsearch(item,ENTER);
     /* printf("unique word %d: %s\n",unique_count,item.data); */
   }
 }

 /*
  * That's about it.  We print the results.
  * (or not, in a speed trial).
 for(i=0;i<word_count;i++){
   printf("word[%d] = %s\n",i,words + word_offset[i]);
 }
  */
 printf("Total Number of Words = %d; total number of UNIQUE words = %d\n",word_count,unique_count);

}

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf