Examples 3
Write a bash script that collects web pages from the internet, starting from a root page, extracts the words in each page, and creates a list of words.
vim crawler.sh
crawler.sh
#!/bin/bash
# stop the program after running for this many seconds
RUNTIME=10
check() {
# check dependencies
hash lynx || { echo "Please install lynx" >&2; exit 1; }
hash w3m || { echo "Please install w3m" >&2; exit 1; }
# check that there is an argument
[[ -z $1 ]] && { echo "Usage: $0 <url>" >&2; exit 1; }
}
start_timer() {
{
sleep $RUNTIME
echo "...Timeout..."
local nr_words=$(cat tmp/words.txt | wc -l)
echo "The number of collected words: $nr_words"
kill -9 $$ >/dev/null 2>&1
} &
}
get_urls() {
lynx "$1" -listonly -nonumbers -dump 2>/dev/null \
| grep -E '^https?://' \
| sed -e 's/#.*$//'
}
get_words() {
w3m -dump "$1" \
| sed -e "s/[^[:alnum:]' -]\+//g" \
| tr -cs A-Za-z\' '\n' \
| tr A-Z a-z \
| sort -u
}
get_words_1() {
wget -q -O- "$1" \
| tr "\n" ' ' \
| sed -e 's/<[^>]*>/ /g' \
| sed -e 's/&[^;]*;/ /g' \
| tr -cs A-Za-z\' '\n' \
| tr A-Z a-z \
| sort -u
}
main() {
check "$@"
start_timer
mkdir -p tmp/
rm -f tmp/{todo,done}.txt
touch tmp/{todo,done,words}.txt
echo "$1" > tmp/todo.txt
local url
while true; do
# pop the top url from todo.txt
url=$(head -1 tmp/todo.txt)
sed -i tmp/todo.txt -e 1d
# check whether we have processed already this url
grep -qF "$url" tmp/done.txt && continue
echo "$url"
# extract the links from it and append them to todo.txt
get_urls "$url" >> tmp/todo.txt
# extract all the words from this url
get_words "$url" >> tmp/words.txt
sort -u tmp/words.txt > tmp/words1.txt
mv tmp/words1.txt tmp/words.txt
# mark this url as processed
echo "$url" >> tmp/done.txt
done
}
# call the main function
main "$@"
The file tmp/todo.txt
contains a list of URLs, one per line, that
are to be visited. Initially we add to it the root URL that is given
as an argument.
The file tmp/done.txt
contains a list of URLs, one per line, that
are already visited. The file tmp/words.txt
contains a list of words
that have been collected so far, one per line and sorted
alphabetically.
There is an infinite loop in which we do these steps:
- Get the top URL from
tmp/todo.txt
(and delete it from the file). - If not valid (does not start with
http://
orhttps://
), continue with the next URL. - If valid, extract all the URLs on this page and append them to
tmp/todo.txt
, in order to visit them later. - Get all the words from this page and merge them to
tmp/words.txt
. - Append this URL to
tmp/done.txt
, so that we don't process it again.
This infinite loop is never stopped, but there is a timer which stops the program after running for a certain amount of seconds.
./crawler.sh
sudo apt install -y w3m
./crawler.sh
url=https://en.wikipedia.org/wiki/Linux
./crawler.sh $url
less tmp/words.txt
Let's increase the RUNTIME
:
sed -i crawler.sh -e '/^RUNTIME=/ c RUNTIME=100'
rm -rf tmp/
./crawler.sh $url
cat tmp/words.txt | wc -l
less tmp/words.txt
Download lesson15/part3.cast
Loading asciinema cast...