Skip to main content

Examples 3

Write a bash script that collects web pages from the internet, starting from a root page, extracts the words in each page, and creates a list of words.

vim crawler.sh
crawler.sh
#!/bin/bash

# stop the program after running for this many seconds
RUNTIME=10

check() {
# check dependencies
hash lynx || { echo "Please install lynx" >&2; exit 1; }
hash w3m || { echo "Please install w3m" >&2; exit 1; }

# check that there is an argument
[[ -z $1 ]] && { echo "Usage: $0 <url>" >&2; exit 1; }
}

start_timer() {
{
sleep $RUNTIME
echo "...Timeout..."
local nr_words=$(cat tmp/words.txt | wc -l)
echo "The number of collected words: $nr_words"
kill -9 $$ >/dev/null 2>&1
} &
}

get_urls() {
lynx "$1" -listonly -nonumbers -dump 2>/dev/null \
| grep -E '^https?://' \
| sed -e 's/#.*$//'
}

get_words() {
w3m -dump "$1" \
| sed -e "s/[^[:alnum:]' -]\+//g" \
| tr -cs A-Za-z\' '\n' \
| tr A-Z a-z \
| sort -u
}

get_words_1() {
wget -q -O- "$1" \
| tr "\n" ' ' \
| sed -e 's/<[^>]*>/ /g' \
| sed -e 's/&[^;]*;/ /g' \
| tr -cs A-Za-z\' '\n' \
| tr A-Z a-z \
| sort -u
}

main() {
check "$@"
start_timer

mkdir -p tmp/
rm -f tmp/{todo,done}.txt
touch tmp/{todo,done,words}.txt

echo "$1" > tmp/todo.txt
local url
while true; do
# pop the top url from todo.txt
url=$(head -1 tmp/todo.txt)
sed -i tmp/todo.txt -e 1d

# check whether we have processed already this url
grep -qF "$url" tmp/done.txt && continue
echo "$url"

# extract the links from it and append them to todo.txt
get_urls "$url" >> tmp/todo.txt

# extract all the words from this url
get_words "$url" >> tmp/words.txt
sort -u tmp/words.txt > tmp/words1.txt
mv tmp/words1.txt tmp/words.txt

# mark this url as processed
echo "$url" >> tmp/done.txt
done
}

# call the main function
main "$@"

The file tmp/todo.txt contains a list of URLs, one per line, that are to be visited. Initially we add to it the root URL that is given as an argument.

The file tmp/done.txt contains a list of URLs, one per line, that are already visited. The file tmp/words.txt contains a list of words that have been collected so far, one per line and sorted alphabetically.

There is an infinite loop in which we do these steps:

  1. Get the top URL from tmp/todo.txt (and delete it from the file).
  2. If not valid (does not start with http:// or https://), continue with the next URL.
  3. If valid, extract all the URLs on this page and append them to tmp/todo.txt, in order to visit them later.
  4. Get all the words from this page and merge them to tmp/words.txt.
  5. Append this URL to tmp/done.txt, so that we don't process it again.

This infinite loop is never stopped, but there is a timer which stops the program after running for a certain amount of seconds.

./crawler.sh
sudo apt install -y w3m
./crawler.sh
url=https://en.wikipedia.org/wiki/Linux
./crawler.sh $url
less tmp/words.txt

Let's increase the RUNTIME:

sed -i crawler.sh -e '/^RUNTIME=/ c RUNTIME=100'
rm -rf tmp/
./crawler.sh $url
cat tmp/words.txt | wc -l
less tmp/words.txt
Loading asciinema cast...