the empty quarter

Remove HTML tags with sed

October 17th, 2011 | Tags: commandline, sed

Sed can be used to strip out all HTML or XML tags from a file and get the plain text version. Suppose you have file gnulinux.html with the following contents:

<p>The combination of <a href=“/gnu/linux-and-gnu.html“>GNU and Linux</a> is the <strong>GNU/Linux operating system</strong>, now used by millions and sometimes incorrectly called simply “Linux“.</p>

Tempting but incorrect – sed finds the longest possible match which in this case is the entire file, and thus will output nothing:
$sed -e 's/<.*>//g' gnulinux.html

Correct version:
$sed -e 's/<[^>]*>//g' gnulinux.html The combination of GNU and Linux is the GNU/Linux operating system, now used by millions and sometimes incorrectly called simply “Linux“.

Remove HTML tags with sed

Recent Posts

Categories

Tags

Archives