Remove HTML tags with sed

October 17th, 2011 | Tags: ,

Sed can be used to strip out all HTML or XML tags from a file and get the plain text version. Suppose you have file gnulinux.html with the following contents:


<p>The combination of <a href=“/gnu/linux-and-gnu.html“>GNU and Linux</a> is the <strong>GNU/Linux operating system</strong>, now used by millions and sometimes incorrectly called simply “Linux“.</p>

Tempting but incorrect – sed finds the longest possible match which in this case is the entire file, and thus will output nothing:

$sed -e 's/<.*>//g' gnulinux.html
 

Correct version:

$sed -e 's/<[^>]*>//g' gnulinux.html
The combination of GNU and Linux is the GNU/Linux operating system, now used by millions and sometimes incorrectly called simply “Linux“.

No comments yet.