[Beowulf] C vs C++ challenge (C version)
Robert G. Brown
rgb at phy.duke.edu
Thu Jan 29 15:07:08 EST 2004
On Thu, 29 Jan 2004, dc wrote:
> > I still guarantee you two things:
> > 1) Your code will be longer
> > 2) Your program will be slower
> >
> > As always, I love to be proven wrong ;)
>
>
> Here is another try at that, this time in Java.
>
> file size C++ j client j server
...
> shaks12.txt 5582655 0m4.476s 0m3.321s 0m2.842s
And here is a version in C. It is longer, no question. It does its own
memory management, in detail, in pages (which should be nearly optimally
efficient). It is moderately robust, and smart enough to recognize all
sorts of separators (when counting words, separators matter -- hence
this program will find more words than e.g. wc because it splits things
differently).
It's timing (compared to a REALLY efficient wc):
rgb at ganesh|T:406>/usr/bin/time wordcount shaks12.txt
Opening shaks12.txt for counting
Total Number of Words = 902303
Command exited with non-zero status 31
0.46user 0.02system 0:00.47elapsed 101%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (93major+2334minor)pagefaults 0swaps
rgb at ganesh|T:407>/usr/bin/time wc shaks12.txt
124456 901325 5582655 shaks12.txt
0.17user 0.02system 0:00.19elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (130major+21minor)pagefaults 0swaps
0.47 wall clock time is far, far faster than C++ or java. It is still
2x and change wc, but it is far from optimally efficient, for many
reasons:
a) It uses fgets() instead of raw read(). It would be easy to
change this but I'm lazy and haven't bothered as even the version I've
come up with blows away the competition. At a guess, given an I/O
dominated process it would cut processing time roughly in half if I did.
b) It uses strsep() to tokenize lines of input read directly into a
buffer (but a line at a time). wc works with the character stream
directly (and splits words on fewer characters) and is much faster.
c) It saves all the words as separate strings in the buffer and
creates a list of offsets so each word can be addressed by its own
index. This meets the requirement that not only do I count the words, I
read them into a list where they can be accessed individually and
(presumably) processed further.
d) As noted, I'm lazy.
If anyone complains about accessing the words via words+word_offset[i],
I should point out that it is absolutely trivial to repack the offsets
(now that we have them) in a char **wordlist so that the words can be
addressed as wordlist[i]. If anyone complains that we "just got a char
** list of words", I should ALSO point out that with trivial
modifications the code could allocate a
struct {
char *word;
int index;
int length;
} *wordlist;
and fill IT in so that each word is in a word "object". Neither of
these steps would significantly increase the time required (more than a
few percent).
As for the code, well, 144 lines including all comments isn't horrible,
and if I cut the comments and empty lines it's probably about 80.
rgb
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#define PAGE 4096
#define K 1024
static char delim[] = {
' ', /* blank */
(char) 9, /* tab */
',', /* comma */
(char) 10, /* LF */
(char) 13, /* CR */
'.', /* . */
'!', /* ! */
'?', /* ? */
'"', /* " */
'*', /* * */
'<', /* < */
'>', /* > */
'@', /* @ */
'[',
']',
'(',
')',
'{',
'}',
'+',
'=',
'/',
'\\',
'#',
':',
(char) 0 /* terminator */
};
main(int argc, char **argv)
{
char filename[128],*word,*words,*words_head,*words_tail,*words_last;
int *word_offset;
size_t offset,word_length,words_length,wordlist_length,word_count,i;
FILE *infile;
if(argc != 2){
fprintf(stderr,"Usage: wordcount filename\n");
exit(1);
}
argv++;
sprintf(filename,"%s",*argv);
printf("Opening %s for counting\n",filename);
infile = fopen(filename,"r");
/*
* The word list as a single buffer. We allocate two pages to start with, and
* reallocate whenever the free space in the buffer drops below one page. We
* also maintain a pointer to the beginning of the unused part of this buffer.
*/
words_length = 2*PAGE;
words = (char *) malloc(words_length);
words_head = words;
words_tail = words + words_length;
/*
* We do the exact same thing here, sort of. We allocate a PAGE of offsets
* (relative to the words head!) and reallocate whenever the number remaining
* drops below K. i will both index the wordlist and count the words as we go.
*/
wordlist_length = PAGE;
word_offset = (int *) malloc(wordlist_length*sizeof(int));
word_count = 0;
/*
* Each pass through the while loop reads a single line from
* infile into words at offset wptr (current tail). We will
* write the code to manage lines up to 1K in length.
*/
word_offset[0] = 0;
while(fgets(words_head,(words_tail - words_head),infile)!= NULL){
/*
* Check to add a PAGE to the words vector/buffer. Since realloc
* may move pointer, we have to recompute offsets.
*/
if((words_tail - words_head) < PAGE) {
words_length += PAGE;
/* save offset */
offset = words_head - words;
words = (char *) realloc(words,words_length);
/* displace by saved offset */
words_head = words + offset;
words_tail = words + words_length;
}
/*
* Check to add a PAGE to the wordlist vector/buffer. Here we
* don't have to worry about offsets.
*/
if((wordlist_length - word_count) < K) {
wordlist_length += PAGE;
word_offset = (int *) realloc(word_offset,wordlist_length*sizeof(int));
}
/*
* Now we rip through the line.
*/
while(1){
/* Get a word */
word = strsep(&words_head,delim);
if(words_head == NULL) break;
/* Save its address for after we are done with this line */
words_last = words_head;
/* Save its offset in the offset vector */
if(word[0] != (char)0){
word_offset[word_count++] = word - words;
/* for debugging offsets etc */
/* printf("%d: word at %0x = %s, offset is %d\n",word_count,word,word,word_offset[word_count - 1]); */
}
}
/*
* strsep finishes parsing a line with a NULL. Restore the right pointer.
*/
words_head = words_last;
}
/*
* That's about it. We print the results.
* (or not, in a speed trial).
for(i=0;i<word_count;i++){
printf("word[%d] = %s\n",i,words + word_offset[i]);
}
*/
printf("Total Number of Words = %d\n",word_count);
}
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list