Examples 3
Write a bash script that collects web pages from the internet, starting from a root page, extracts the words in each page, and creates a list of words.
vim crawler.sh
crawler.sh
#!/bin/bash
# stop the program after running for this many seconds
RUNTIME=10
check() {
    # check dependencies
    hash lynx || { echo "Please install lynx" >&2; exit 1; }
    hash w3m || { echo "Please install w3m" >&2; exit 1; }
    # check that there is an argument
    [[ -z $1 ]] && { echo "Usage: $0 <url>" >&2; exit 1; }
}
start_timer() {
    {
        sleep $RUNTIME
        echo "...Timeout..."
        local nr_words=$(cat tmp/words.txt | wc -l)
        echo "The number of collected words: $nr_words"
        kill -9 $$ >/dev/null 2>&1
    } &
}
get_urls() {
    lynx "$1" -listonly -nonumbers -dump 2>/dev/null \
        | grep -E '^https?://' \
	| sed -e 's/#.*$//'
}
get_words() {
    w3m -dump "$1" \
        | sed -e "s/[^[:alnum:]' -]\+//g" \
        | tr -cs A-Za-z\' '\n' \
        | tr A-Z a-z \
        | sort -u
}
get_words_1() {
    wget -q -O- "$1" \
        | tr "\n" ' ' \
        | sed -e 's/<[^>]*>/ /g' \
        | sed -e 's/&[^;]*;/ /g' \
        | tr -cs A-Za-z\' '\n' \
        | tr A-Z a-z \
        | sort -u
}
main() {
    check "$@"
    start_timer
    mkdir -p tmp/
    rm -f tmp/{todo,done}.txt
    touch tmp/{todo,done,words}.txt
    echo "$1" > tmp/todo.txt
    local url
    while true; do
        # pop the top url from todo.txt
        url=$(head -1 tmp/todo.txt)
        sed -i tmp/todo.txt -e 1d
    
        # check whether we have processed already this url
        grep -qF "$url" tmp/done.txt && continue
	echo "$url"
    
        # extract the links from it and append them to todo.txt
        get_urls "$url" >> tmp/todo.txt
    
        # extract all the words from this url
        get_words "$url" >> tmp/words.txt
        sort -u tmp/words.txt > tmp/words1.txt
        mv tmp/words1.txt tmp/words.txt
    
        # mark this url as processed
        echo "$url" >> tmp/done.txt
    done
}
# call the main function
main "$@"
The file tmp/todo.txt contains a list of URLs, one per line, that
are to be visited. Initially we add to it the root URL that is given
as an argument.
The file tmp/done.txt contains a list of URLs, one per line, that
are already visited. The file tmp/words.txt contains a list of words
that have been collected so far, one per line and sorted
alphabetically.
There is an infinite loop in which we do these steps:
- Get the top URL from 
tmp/todo.txt(and delete it from the file). - If not valid (does not start with 
http://orhttps://), continue with the next URL. - If valid, extract all the URLs on this page and append them to
tmp/todo.txt, in order to visit them later. - Get all the words from this page and merge them to 
tmp/words.txt. - Append this URL to 
tmp/done.txt, so that we don't process it again. 
This infinite loop is never stopped, but there is a timer which stops the program after running for a certain amount of seconds.
./crawler.sh
sudo apt install -y w3m
./crawler.sh
url=https://en.wikipedia.org/wiki/Linux
./crawler.sh $url
less tmp/words.txt
Let's increase the RUNTIME:
sed -i crawler.sh -e '/^RUNTIME=/ c RUNTIME=100'
rm -rf tmp/
./crawler.sh $url
cat tmp/words.txt | wc -l
less tmp/words.txt
Download lesson15/part3.cast
Loading asciinema cast...