[Beowulf] C vs C++ challenge (awk version)

Thu Jan 29 17:59:13 EST 2004

On Thu, 29 Jan 2004, rgb wrote:

> On Thu, 29 Jan 2004, dc wrote:
> 
> > > I still guarantee you two things:
> > > 1) Your code will be longer
> > > 2) Your program will be slower
> > >
> > > As always, I love to be proven wrong  ;)
> > 
> > 
> > Here is another try at that, this time in Java.
> > 
> > file            size        C++         j client    j server
> ...
> > shaks12.txt     5582655     0m4.476s    0m3.321s    0m2.842s
> 
> And here is a version in C.  It is longer, no question.  It does its own
> memory management, in detail, in pages (which should be nearly optimally
> efficient).  It is moderately robust, and smart enough to recognize all
> sorts of separators (when counting words, separators matter -- hence
> this program will find more words than e.g. wc because it splits things
> differently).

But this one does not count unique words, does it?

Here is my version in awk. It beats C++ by 1 line in length and
1.5 times in speed (1.86s versus 2.83s elapsed time) with shaks12.txt as 
input.

[selva at scala distinct_words]$ wc shaks12.txt
 124456  901325 5458199 shaks12.txt
This copy of shaks12.txt has been filtered by dos2unix.

Timings:
First my awk script (with GNU awk 3.1.0)

[selva at scala distinct_words]$ /usr/bin/time ./dwc.awk shaks12.txt
Number of distinct words = 67505
1.82user 0.04system 0:01.86elapsed 99%CPU

Now the original C++ code (compiled by g++ 2.96).

[selva at scala distinct_words]$ /usr/bin/time ./dwc < shaks12.txt
Words: 67505
2.79user 0.04system 0:02.83elapsed 100%CPU

Here is the script:

#!/bin/awk -f
{
  for(i = 1; i <= NF; i++) {
    if (words[$i]) continue;
    words[$i] = 1 ;
    ++nwords;
  }
}
END {
  printf "Number of distinct words = %i\n", nwords;
}

Selva

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf