Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

Interactive Archives

My jaw dropped at the end of this blog post Cloud Hosting and Academic Research.

There is a value in keeping significant old systems around, even if they no longer have active user bases.  A cloud hosting model seems so right to me–it’s scalable and robust. It just makes sense. But the hosting costs are a problem. Even if the total amount of money is small, grants are for specific work and have end dates. I can still be running a 10+ year old UNIX box, but I can’t still be paying hosting fees for a research project whose funding ended years ago, no matter how small that bill is.  Grants end–there’s no provision for “long term hosting.”  Our library can help us archive data, but they are not yet ready to “archive” an interactive system.  I hope companies that provide hosting services will consider donating long-term hosting for research.

Opening up a new area of digital archives by preserving the really cool works of the faculty seems like something I might enjoy.

My mentor in web design and server administration might have been described as a pack rat. He… Well, I guess, we kept around versions of web pages a decade old. Nothing really found deletion. The public just missed it by use of permissions.

When building my portfolio, my mistake was not gathering up the whole files to replicate the sites I designed. I’m no longer doing web design or even programming. So it is okay.

A professor in Geology had a pretty cool Virtual Museum for Fossils. The site moved around a few times, eventually ending up on the main web server also hosting WWW. Of course, HTML, images, and Flash files are easy to archive. Take the files and place them on a web server. Since they are static, it is easy to keep around for a long time. As long as the standards remain honored, they should be good. Developers of web browsers have pressure to go for the new, which potentially abandons the old eventually.

Scripted web sites using Perl, PHP, ASP, or JSP, JavaScript, or AJAX require a working interpreter. Still, some things might not be backwards compatible.

About a year ago my mother ran across 8mm video film. An uncle found a place who converted it to DVD. Will we even be using DVDs in a decade? Maybe the 8mm needs to go on Blueray?

Going back to the scripted web sites, should an archived web site’s code be updated to work on the new version of the interpreter? Maybe. If makers of the interpreters allowed for running in a backwards compatible mode, then all would be good. Even better, to be able to add to a script a variable that tells the interpreter which back version to pretend to use. For administrators, they could have the programmers check non-working scripts by just telling the interpreter to simulate an older version.

Apple Trying To Poach IE6 Users

Attempted to watch the Transformer’s 3 trailer, but apparently Chrome on Linux was a no-go for the JavaScript which hides the web site and displays the trailer. Fancy but broken. So I thought I would look at the HTML and get the .mov file. I found this snippet of code in the HTML quite interesting.

<!–[if lt IE 7]>
<div id=”ie6-message”>
<h2>You are currently using an outdated browser.</h2>
<p>Please upgrade to a <a href=”http://www.apple.com/safari/”>modern browser</a> to fully experience this site.<p>
</div>

Where most places would have someone upgrade to a newer version of the software they are currently using, Apple is trying to poach Microsoft users. Bravo! Bravo!