Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

Teeshirts

Occasionally people ask about my teeshirts. (One can also find them on my Flickr teeshirt tag or Teeshirts I Own Pinterest board.)
Follow Ezra’s board Teeshirts I Own on Pinterest.

Thinkgeek Shirts
got root?
got root?
/Everyone stand back/ I know regular expressions
Regex Front
You Are Dumb in binary
You Are Dumb in binary
Rays cast from this shirt travel at over 670,000,000 MPH
Rays cast from this shirt travel at over 670,000,000 MPH
Do or do not. There is no try. (in shell)
Do or do not. There is no try.
98% Chimp
98% Chimp
Reverse Engineer
Reverse Engineer
I’m blogging this.
I'm blogging this.
I failed the Turing test
I failed the Turing test
Come to the dark side, we have cookies -V
Come to the dark side, we have cookies -V
There’s more than one way to do it.
There's more than one way to do it.
There are 10 types of people in the world: those who understand binary and those who dont
There are 10 types of people in the world; those who understand binary and those who dont
1UP Mushroom
1UP
You read my t-shirt. That’s enough social interaction for now.
You read my t-shirt.
There’s no place like ::1
There's no place like ::1
<BODY>
<BODY>
/* No Comment */
No Comment
i never finish anyth
i never finish anyth
if you are not part of the solution, then you are part of the precipate.
Are you the Precipitate?
I see dead pixels
I see dead pixels
OBEY GRAVITY it’s the LAW!
OBEY GRAVITY it's the LAW!
Threadless Shirt
 Video games ruined my life. Good thing I have two extra lives.
Video games ruined my life. Good thing I have two extra lives.
Shirt.Woot Shirts
Shutterbug
Shutterbug
@-@ Imperial Walker
AT-AT
Misc Other Printer Shirt
sudo Make me a sandwich
365 Days #14 2006-12-27
Blackborg
Blackborg Teeshirt
New Kids Under the Block
New Kids Under the Block
WordPress Galaxy Tee
WordPress Galaxy Tee

P.S. There are way too many photos of me wearing the Turing test shirt. I’m gonna have to bench it for a while.