Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

March of the Machines (Automation)

Saw a tweet about and interesting piece in ABC News Australia Digital disruption: How science and the human touch can help employees resist the march of the machines. Basically, many jobs are going away due to automation. W.I.R.E.D. has a similar story: Robots Will Steal Our Jobs, But They’ll Give Us New Ones.

One of the long struggles I have ever pushed in my career is automation of machines. My approach falls along the line of: if it is going to be done more than once or will take a really long time by hand, then it needs to be automated. This is hard to do. The temptation is to do it by hand once, see how it went, then write a script which does it for the next time. The trouble being that if this is done between having completed the first one and the second, then there is little incentive. Best is to make the automation part of doing it the first time, the second time can include any remediation necessary to make it more perfect.

All this automation makes us more effective employees. My team of three managed hundreds of web servers and dozens of database servers for ten sites. Without automation that would have been a nightmare. The replacement product was more difficult to automate so with fewer servers we needed more people. Yet the drive to better automation is making lives easier. (Technically I left that program about a year ago when my replacement was hired and took over my spot in the on-call rotation.)

A fear I hear about automation is that people will lose their jobs. It reminds me globalization and manufacturing moving overseas to China. Highly repetitive, mindnumbing jobs were the most at risk and as those work forces got better, what was at risk moved up the complexity ladder.

The fear of both globalization and automation led to books like A Whole New Mind. The idea is that if your job is highly repetitive or analytical, then it is at risk to these forces. Becoming the person who designs, describes, coordinates, or finds meaning in stuff (aka “right brain” activities) is the way to survive the coming storm. This book very influenced how I started thinking about my work.

Back in 2003, I automated everything I could because I was overwhelmed with work and little resources beyond great computers and my own skill to make it better. My supervisees focused on meeting with the clients to talk about the web site they wanted and build that. I wrote code to report about or fix problems to prevent people needing to call or email about problems.

Where I wish we would head is more like You Really Don’t Need To Work So Much. I meant to send this to my boss (maybe he’s reading this blog)? All our efficiencies should mean we have less to do not more, so why do we work so hard?

The past fifty years have seen massive gains in productivity, the invention of countless labor-saving devices, and the mass entry of women into the formal workforce. If we assume that there is, to a certain degree, a fixed amount of work necessary for society to function, how can we at once be more productive, have more workers, and yet still be working more hours? Something else must be going on.

From my experience, the to-do list gets ever larger. Not because there is more to do, but because more is possible. I’d just rather spend more of my time on solving hard problems than easy repetitive tasks.

P.S. This post really only exists because I loved the phrase “March of the Machines” enough I wanted it as a title for something on this blog.

Check Backups

I get a daily report about backups for each of the projects. One particular one has shown some odd results. The report has columns for: Completed, Successful, Partial, Failed, Missed, and Active. The particular problem is that backups show up in none of those columns when it is actually still actively running. (So it should show up as active?)

So I wrote the other day a Bash script to check some things. In the future I can quickly assess if this is the SNAFU (Situation Normal All F***ed Up) or something else. Really, the other day I had to track back into my history this information to do it manually, which is stupid.

#!/bin/bash
###########################################################
# Check backup status.
# 2015-JUL-24 Ezra Freelove, email@domain.com
###########################################################
# Look for running processes
echo “… Backup processes”
ps -ef | grep [p]rocname
# Report logs
echo “… Logs”
ls -ltr /path/to/agent.log /session/path/to/clientlogs/*.log
echo “==========”
# Specific lines excluding spam and blank
tail -100 /path/to/agent.log | grep -vf /home/me/myscripts/backups/exclude/ckrunning.txt | grep -v “^ $”

If the backup is still running, then the ps will show the PID and time started.

In addition the logs are reported with the most recently changed at the end.

The agent talks to a central service to find out when it should be doing stuff. The exclude/ckrunning.txt file uses the below entries to ignore spammy lines where the agent is checking but not told to do anything. This ignores about 70 of the 100 lines when everything is normally operating from my check.

  • Sleeping
  • Workorder received: sleep
  • Requesting work

Long-term, I guess I should ask why the long-running backup does not appear in the Active category when it obviously should.