Examples 3

Write a bash script that collects web pages from the internet, starting from a root page, extracts the words in each page, and creates a list of words.

vim crawler.sh

crawler.sh

#!/bin/bash

# stop the program after running for this many seconds
RUNTIME=10

check() {
    # check dependencies
    hash lynx || { echo "Please install lynx" >&2; exit 1; }
    hash w3m || { echo "Please install w3m" >&2; exit 1; }

    # check that there is an argument
    [[ -z $1 ]] && { echo "Usage: $0 <url>" >&2; exit 1; }
}

start_timer() {
    {
        sleep $RUNTIME
        echo "...Timeout..."
        local nr_words=$(cat tmp/words.txt | wc -l)
        echo "The number of collected words: $nr_words"
        kill -9 $$ >/dev/null 2>&1
    } &
}

get_urls() {
    lynx "$1" -listonly -nonumbers -dump 2>/dev/null \
        | grep -E '^https?://' \
	| sed -e 's/#.*$//'
}

get_words() {
    w3m -dump "$1" \
        | sed -e "s/[^[:alnum:]' -]\+//g" \
        | tr -cs A-Za-z\' '\n' \
        | tr A-Z a-z \
        | sort -u
}

get_words_1() {
    wget -q -O- "$1" \
        | tr "\n" ' ' \
        | sed -e 's/<[^>]*>/ /g' \
        | sed -e 's/&[^;]*;/ /g' \
        | tr -cs A-Za-z\' '\n' \
        | tr A-Z a-z \
        | sort -u
}

main() {
    check "$@"
    start_timer

    mkdir -p tmp/
    rm -f tmp/{todo,done}.txt
    touch tmp/{todo,done,words}.txt

    echo "$1" > tmp/todo.txt
    local url
    while true; do
        # pop the top url from todo.txt
        url=$(head -1 tmp/todo.txt)
        sed -i tmp/todo.txt -e 1d
    
        # check whether we have processed already this url
        grep -qF "$url" tmp/done.txt && continue
	echo "$url"
    
        # extract the links from it and append them to todo.txt
        get_urls "$url" >> tmp/todo.txt
    
        # extract all the words from this url
        get_words "$url" >> tmp/words.txt
        sort -u tmp/words.txt > tmp/words1.txt
        mv tmp/words1.txt tmp/words.txt
    
        # mark this url as processed
        echo "$url" >> tmp/done.txt
    done
}

# call the main function
main "$@"

The file tmp/todo.txt contains a list of URLs, one per line, that are to be visited. Initially we add to it the root URL that is given as an argument.

The file tmp/done.txt contains a list of URLs, one per line, that are already visited. The file tmp/words.txt contains a list of words that have been collected so far, one per line and sorted alphabetically.

There is an infinite loop in which we do these steps:

Get the top URL from tmp/todo.txt (and delete it from the file).
If not valid (does not start with http:// or https://), continue with the next URL.
If valid, extract all the URLs on this page and append them to tmp/todo.txt, in order to visit them later.
Get all the words from this page and merge them to tmp/words.txt.
Append this URL to tmp/done.txt, so that we don't process it again.

This infinite loop is never stopped, but there is a timer which stops the program after running for a certain amount of seconds.

./crawler.sh

sudo apt install -y w3m

./crawler.sh

url=https://en.wikipedia.org/wiki/Linux

./crawler.sh $url

less tmp/words.txt

Let's increase the RUNTIME:

sed -i crawler.sh -e '/^RUNTIME=/ c RUNTIME=100'

rm -rf tmp/

./crawler.sh $url

cat tmp/words.txt | wc -l

less tmp/words.txt

Download lesson15/part3.cast

Loading asciinema cast...