Skip to main content

Examples 1

  1. This is a simple script that gets as argument the URL of a web page and returns all the URLs inside that page:

    get_urls.sh
    #!/bin/bash
    # Get all the URLs inside a given HTML page.

    PAGE=$1

    if [[ -z $PAGE ]]; then
    echo "Usage: $0 <html-page>" >&2
    exit 1
    fi

    wget -qO- "$PAGE" \
    | grep -Eoi '<a [^>]+>' \
    | grep -Eo 'href="?([^\"]+)"?' \
    | grep -v 'mailto:' \
    | sed -e 's/"//g' -e 's/href=//'
    vim get_urls.sh
    ./get_urls.sh
    url=http://linuxcommand.org/
    ./get_urls.sh $url

    Let's see how it works:

    wget -qO- $url
    wget -qO- $url | grep -Eoi '<a [^>]+>'

    The option -E is for extended regexp syntax, -o is for displaying only the matching part, and -i is for case insensitive. We are extracting all the anchor tags.

    wget -qO- $url | grep -Eoi '<a [^>]+>' | grep -Eo 'href="?([^\"]+)"?'

    Extracting the attribute href.

    wget -qO- $url \
    | grep -Eoi '<a [^>]+>' \
    | grep -Eoi 'href="?([^\"]+)"?' \
    | grep -v 'mailto:' \
    | sed -e 's/"//g' -e 's/href=//'
  2. This is a simple script that gets as argument the URL of a web page and returns a list of the 100 most frequently used words inside it:

    get_words.sh
    #!/bin/bash
    # Return a list of the 100 most frequently used words inside a given page.

    PAGE=$1

    if [[ -z $PAGE ]]; then
    echo "Usage: $0 <html-page>" >&2
    exit 1
    fi

    wget -q -O- "$PAGE" \
    | tr "\n" ' ' \
    | sed -e 's/<[^>]*>/ /g' \
    | sed -e 's/&[^;]*;/ /g' \
    | tr -cs A-Za-z\' '\n' \
    | tr A-Z a-z \
    | sort \
    | uniq -c \
    | sort -k1,1nr -k2 \
    | sed 100q \
    | sed -E 's/^ +//' \
    | cut -d' ' -f2
    vim get_words.sh
    ./get_words.sh
    url=https://en.wikipedia.org/wiki/Linux
    ./get_words.sh $url
    ./get_words.sh $url | less
    ./get_words.sh $url | wc -l
    wget -qO- $url \
    | tr "\n" ' ' \
    | sed -e 's/<[^>]*>/ /g' \
    | sed -e 's/&[^;]*;/ /g' \
    | tr -cs A-Za-z\' '\n' \
    | tr A-Z a-z \
    | less
    wget -qO- $url \
    | tr "\n" ' ' \
    | sed -e 's/<[^>]*>/ /g' \
    | sed -e 's/&[^;]*;/ /g' \
    | tr -cs A-Za-z\' '\n' \
    | tr A-Z a-z \
    | sort \
    | uniq -c \
    | less
    wget -qO- $url \
    | tr "\n" ' ' \
    | sed -e 's/<[^>]*>/ /g' \
    | sed -e 's/&[^;]*;/ /g' \
    | tr -cs A-Za-z\' '\n' \
    | tr A-Z a-z \
    | sort \
    | uniq -c \
    | sort -k1,1nr -k2 \
    | sed 100q \
    | less
Loading asciinema cast...