Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

Another Way to Verify Cookie Domain

Just finished a Oracle WebLogic Server 11g: Administration Essentials class today. So there are lots of things floating about in my head I want try. (Thankfully we have lots of development clusters for me to break beyond repair. Kidding. Sorta.)

One of the common support questions Blackboard asks for those of us CE/Vista clients running a cluster is whether we have changed the cookie domain in weblogic.xml. This has to do with specifying where the JSESSIONIDVISTA cookie is valid. By default the value in the weblogic.xml file is set to .webct.com which is not valid anywhere (not even Blackboard.com). One of the install steps is if one is running a cluster, in the administrator node Weblogic Domain directory run some commands to extract the weblogic.xml, edit it, then run some commands to add it back to the WAR file. Placing a “REFRESH” empty file on all the managed nodes deletes the staged and cached copies of the WAR.

No big deal and easy.

Except when it isn’t?

Occasionally someone will distrust your work and want you to verify the right setting is there. Normally they say to extract the weblogic.xml again and verify it is correct there. I had a thought. Why not verify in each managed node’s cache it has the correct value?

It is easier than it sounds. In the Weblogic domain directory (where setEnv.sh is located), change directories to

$WL_DOMAIN/servers/node_name/tmp/_WL_user/webct

(NOTE: Anything I put in bold means it is custom to you and not something I can anticipate what you would use there.)

Here I just used these greps to look for my domain. If I get results for the first one, then all is well. If I don’t get results for the first, then the second one should confirm the world is falling because we are missing the cookie domain.

grep “.domain.edu” */war/WEB-INF/weblogic.xml
grep “.webct.com” */war/WEB-INF/weblogic.xml

Since we use dsh for a lot of this kind of thing, I would use our regex for the node name and add on the path pieces in common. I have not yet studied the pieces between webct and war to know for certain who they are derived except to say they appear to 6 characters long and sufficiently random as to not repeat. Any [ejw]ar exploded into the cache appears to get a unique one. So this might work?

grep “.domain.edu” $WL_DOMAIN/servers/node_name_regex/tmp/_WL_user/webct/??????/war/WEB-INF/weblogic.xml

If not, then try:

cd $WL_DOMAIN/servers/node_name_regex/tmp/_WL_user/webct/
&& pwd && grep “.domain.edu” */war/WEB-INF/weblogic.xml

I’m envisioning this method to verify a number of different things in the nodes. It especially confirms the managed node received what I expected not that the admin node has the correct something.