April 8, 2022

HTML Scripting Toolsa few free packages I’ve come to use

I wrote a couple scripts this week that got me thinking about all the programs I’ve found over the years for working with HTML documents on the command line. Each heading below is the name of the relevant Debian package, which you can install on Debian (and Ubuntu) systems with sudo apt install $name.

libxml2-utils

Here’s a snippet from a script that checks the website of the City of Oakland and sends me an e-mail if they’ve posted a new homeless encampment services schedule:

        url=$(curl --silent -L "https://www.oaklandca.gov/resources/homeless-encampment-cleanup-schedule" | xmllint --html --nowarning --xpath 'string(//a[@aria-label="Access Homeless Encampment Clean Up Schedule"]/@href)' - 2>/dev/null)

      

curl fetches the webpage. xmllint has an --xpath flag that allows extracting specific content. Its --html flag allows it to parse HTML, even fairly “dirty” HTML, rather than strict XML. By common convention, - as the input path means /dev/stdin.

XPath is one of those arcane standard syntaxes that only starts to look normal when you’ve seen too much of it. We can read this particular one as “give the string value of the href attribute of the a element with aria-label attribute equal to Access Homeless Encampment Clean Up Schedule”. If you right-click the button on the the source of the City of Oakland webapge and choose “Inspect”, or view the page’s source, you’ll see they helpfully set Aria attributes to meet government accessibility requirements.

The Chrome and Firefox webpage inspector features also allow searching by XPath. I often draft my XPaths in the inspector, interactively, and then copy into a script.

html-xml-utils

Here’s a snippet from a script I use to look up words in Russian. It combines results from two websites, gramota.ru, which has the latest revision of Kuznetsov’s Big Explanatory Dictionary, and Russian Wiktionary, which has all word forms, with phonological stress, according to Zaliznyak’s morphological dictionary, and a lot more:

        {
# gramota.ru
curl -L -s "http://www.gramota.ru/slovari/dic/?word=$word&all=x" \
  | hxclean \
  | hxextract '.block-content' - \
  | iconv -f cp1251 -t utf8 \
  | sed 's!<span class="accent">\([^<]\+\)</span>!\1\&#x0301;!g' \
  | hxclean \
  | hxremove "script, form"

echo "<hr>"

# Wiktionary
curl -L -s "https://ru.wiktionary.org/wiki/$word" \
  | hxclean \
  | hxextract '.mw-parser-output' - \
  | hxclean \
  | hxremove '.clear, .gap-saver, .toc'
} >> "$tmp"

      

The hx- tools all come from the html-xml-utils package. hxclean cleans up malformed HTML pages, which are very common. hxextract pulls out particular elements by name or CSS class. Note that it expects a file path input, where - denotes /dev/stdin. hxremove removes elements by CSS selector.

iconv is often necessary with Russian-language websites written in cp1251, the old Windows character encoding for Russian, rather than utf8.

If you’re scratching your head about that sed call: gramota.ru wraps stressed vowels in <span class="accent"> tags, which are then styled with red text. That sed call replaces them with Unicode combining accent characters.

lynx

The first part of the dictionary lookup script above compiles an HTML page with snippets from gramota.ru and Wiktionary. The script then displays that file in the terminal:

        lynx -width=60 -dump "$tmp" | less

      

lynx is actually a full-featured, interactive terminal web browser. Here I’m using the -dump flag, so it just displays the HTML file on /dev/stdout and quits. The default display is a bit wide for my taste in terminal size, so -width=60 tells it to keep to sixty columns. Dictionary lookups don’t tend to fit on one screen, so I pipe to less as pager.

tidy

How about writing HTML? Here’s a formathtml script I use to fix up and indent HTML written or edited by hand:

        #!/bin/sh
exec tidy \
  -quiet \
  -indent \
  -utf8 \
  --indent-spaces 2 \
  --wrap 0 \
  --tidy-mark no \
  --show-info no \
  --show-warnings no \
  --show-errors 0

      

In Vim, I bind this to <leader>f:

        augroup html
  autocmd!
  " ...snip...
  autocmd FileType html noremap <buffer> <leader>f <Esc>mq:%!formathtml<CR>`q
augroup END

autocmd BufNewFile,BufRead *.html set filetype=html

      

When I’m done hacking up an HTML file, I can hit <leader>f and have the buffer formatted for me.

Update: pup

Thanks to Anton Semjonov, today I learned that Debian repositories now distribute Eric Chiang’s pup in binary form, so Debian-based distro users can install with sudo apt install pup. I’ve also used pup for many scripts, and expect that many folks would find it easier to work with than the hx- tools.

        $ pup -h
Usage
    pup [flags] [selectors] [optional display function]
# ...

      

I hesitated to recommend pup in this post if installing pup meant installing a Go language build chain. But that’s not longer necessary, at least on Debian-based systems.

Your thoughts and feedback are always welcome by e-mail.