Hypercyber...    Filtering the buzz

Filtering the buzz

What's New Too! is an automatic announcement site that, unlike Yahoo, does not verify and filter its links. In exchange it is faster, with a promised turn-around-time between entry and publication of only 36 hours. Every day, the service produces about 500 k worth of announcements.

While trying to scan them, I was repeatedly running up against the same buzzwords and -phrases, and was promising myself to write a tool to hit back and filter out those entries, given a text file that I've downloaded from What's New Too. I had expected to be done with a simple parametrization of agrep; in the end, Byron Rakitzis' version of Tom Duff's rc design turned out to be the best tool for the job.

#!/home/pub/bin/rc

nl='
'
beep='^G'

exclude_file=/home/kbs/jutta/etc/exclude
source_file=/home/kbs/jutta/tmp/newtoo/in

source=``($beep){sed 's/^$/'$beep/ $source_file}
exclude=``($nl){cat $exclude_file}

for (s in $source) if (! ~ $s * ^ $exclude ^ *) echo $s
(The ^G above should be control-G, not circumflex G.)

With my list of patterns, the filter took about 10 seconds to throws out one third of all entries from the 300 K file. And I'll never have to read a triple exclamation mark again.