Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

1 thought on “Fun With Regex

Leave a Reply

%d bloggers like this: