[Beowulf] C vs C++ challenge (C version)

Thu Jan 29 15:07:08 EST 2004

On Thu, 29 Jan 2004, dc wrote:

> > I still guarantee you two things:
> > 1) Your code will be longer
> > 2) Your program will be slower
> >
> > As always, I love to be proven wrong  ;)
> 
> 
> Here is another try at that, this time in Java.
> 
> file            size        C++         j client    j server
...
> shaks12.txt     5582655     0m4.476s    0m3.321s    0m2.842s

And here is a version in C.  It is longer, no question.  It does its own
memory management, in detail, in pages (which should be nearly optimally
efficient).  It is moderately robust, and smart enough to recognize all
sorts of separators (when counting words, separators matter -- hence
this program will find more words than e.g. wc because it splits things
differently).

It's timing (compared to a REALLY efficient wc):

rgb at ganesh|T:406>/usr/bin/time wordcount shaks12.txt 
Opening shaks12.txt for counting
Total Number of Words = 902303
Command exited with non-zero status 31
0.46user 0.02system 0:00.47elapsed 101%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (93major+2334minor)pagefaults 0swaps
rgb at ganesh|T:407>/usr/bin/time wc shaks12.txt
 124456  901325 5582655 shaks12.txt
0.17user 0.02system 0:00.19elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (130major+21minor)pagefaults 0swaps

0.47 wall clock time is far, far faster than C++ or java.  It is still
2x and change wc, but it is far from optimally efficient, for many
reasons:

  a) It uses fgets() instead of raw read().  It would be easy to
change this but I'm lazy and haven't bothered as even the version I've
come up with blows away the competition.  At a guess, given an I/O
dominated process it would cut processing time roughly in half if I did.
  b) It uses strsep() to tokenize lines of input read directly into a
buffer (but a line at a time).  wc works with the character stream
directly (and splits words on fewer characters) and is much faster.
  c) It saves all the words as separate strings in the buffer and
creates a list of offsets so each word can be addressed by its own
index.  This meets the requirement that not only do I count the words, I
read them into a list where they can be accessed individually and
(presumably) processed further.
  d) As noted, I'm lazy.

If anyone complains about accessing the words via words+word_offset[i],
I should point out that it is absolutely trivial to repack the offsets
(now that we have them) in a char **wordlist so that the words can be
addressed as wordlist[i].  If anyone complains that we "just got a char
** list of words", I should ALSO point out that with trivial
modifications the code could allocate a 

struct {
 char *word;
 int index;
 int length;
} *wordlist;

and fill IT in so that each word is in a word "object".  Neither of
these steps would significantly increase the time required (more than a
few percent).

As for the code, well, 144 lines including all comments isn't horrible,
and if I cut the comments and empty lines it's probably about 80.

  rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>

#define PAGE 4096
#define K 1024

static char delim[] = {
 ' ',       /* blank */
 (char) 9,  /* tab */
 ',',       /* comma */
 (char) 10, /* LF */
 (char) 13, /* CR */
 '.',       /* . */
 '!',       /* ! */
 '?',       /* ? */
 '"',       /* " */
 '*',       /* * */
 '<',       /* < */
 '>',       /* > */
 '@',       /* @ */
 '[',
 ']',
 '(',
 ')',
 '{',
 '}',
 '+',
 '=',
 '/',
 '\\',
 '#',
 ':',
 (char) 0  /* terminator */
 };

main(int argc, char **argv)
{

 char filename[128],*word,*words,*words_head,*words_tail,*words_last;
 int *word_offset;
 size_t offset,word_length,words_length,wordlist_length,word_count,i;
 FILE *infile;

 if(argc != 2){
   fprintf(stderr,"Usage: wordcount filename\n");
   exit(1);
 }
 argv++;
 sprintf(filename,"%s",*argv);
 printf("Opening %s for counting\n",filename);
 infile = fopen(filename,"r");

 /*
  * The word list as a single buffer.  We allocate two pages to start with, and
  * reallocate whenever the free space in the buffer drops below one page.  We
  * also maintain a pointer to the beginning of the unused part of this buffer.
  */
 words_length = 2*PAGE;
 words = (char *) malloc(words_length);
 words_head = words;
 words_tail = words + words_length;

 /*
  * We do the exact same thing here, sort of.  We allocate a PAGE of offsets
  * (relative to the words head!)  and reallocate whenever the number remaining
  * drops below K.  i will both index the wordlist and count the words as we go.
  */
 wordlist_length = PAGE;
 word_offset = (int *) malloc(wordlist_length*sizeof(int));
 word_count = 0;

 /*
  * Each pass through the while loop reads a single line from
  * infile into words at offset wptr (current tail).  We will
  * write the code to manage lines up to 1K in length.
  */
 word_offset[0] = 0;
 while(fgets(words_head,(words_tail - words_head),infile)!= NULL){

   /*
    * Check to add a PAGE to the words vector/buffer.  Since realloc
    * may move pointer, we have to recompute offsets.
    */
   if((words_tail - words_head) < PAGE) {
     words_length += PAGE;

     /* save offset */
     offset = words_head - words;
     words = (char *) realloc(words,words_length);

     /* displace by saved offset */
     words_head = words + offset;
     words_tail = words + words_length;
   }

   /*
    * Check to add a PAGE to the wordlist vector/buffer.  Here we
    * don't have to worry about offsets.
    */
   if((wordlist_length - word_count) < K) {
     wordlist_length += PAGE;
     word_offset = (int *) realloc(word_offset,wordlist_length*sizeof(int));
   }

   /*
    * Now we rip through the line.
    */
   while(1){
     /* Get a word */
     word = strsep(&words_head,delim);
     if(words_head == NULL) break;

     /* Save its address for after we are done with this line */
     words_last = words_head;

     /* Save its offset in the offset vector */
     if(word[0] != (char)0){
       word_offset[word_count++] = word - words;

       /* for debugging offsets etc */
       /* printf("%d: word at %0x = %s, offset is %d\n",word_count,word,word,word_offset[word_count - 1]); */
     }
   }
   /*
    * strsep finishes parsing a line with a NULL.  Restore the right pointer.
    */
   words_head = words_last;

 }

 /*
  * That's about it.  We print the results.
  * (or not, in a speed trial).
 for(i=0;i<word_count;i++){
   printf("word[%d] = %s\n",i,words + word_offset[i]);
 }
  */
 printf("Total Number of Words = %d\n",word_count);

}

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf