Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

March of the Machines (Automation)

Saw a tweet about and interesting piece in ABC News Australia Digital disruption: How science and the human touch can help employees resist the march of the machines. Basically, many jobs are going away due to automation. W.I.R.E.D. has a similar story: Robots Will Steal Our Jobs, But They’ll Give Us New Ones.

One of the long struggles I have ever pushed in my career is automation of machines. My approach falls along the line of: if it is going to be done more than once or will take a really long time by hand, then it needs to be automated. This is hard to do. The temptation is to do it by hand once, see how it went, then write a script which does it for the next time. The trouble being that if this is done between having completed the first one and the second, then there is little incentive. Best is to make the automation part of doing it the first time, the second time can include any remediation necessary to make it more perfect.

All this automation makes us more effective employees. My team of three managed hundreds of web servers and dozens of database servers for ten sites. Without automation that would have been a nightmare. The replacement product was more difficult to automate so with fewer servers we needed more people. Yet the drive to better automation is making lives easier. (Technically I left that program about a year ago when my replacement was hired and took over my spot in the on-call rotation.)

A fear I hear about automation is that people will lose their jobs. It reminds me globalization and manufacturing moving overseas to China. Highly repetitive, mindnumbing jobs were the most at risk and as those work forces got better, what was at risk moved up the complexity ladder.

The fear of both globalization and automation led to books like A Whole New Mind. The idea is that if your job is highly repetitive or analytical, then it is at risk to these forces. Becoming the person who designs, describes, coordinates, or finds meaning in stuff (aka “right brain” activities) is the way to survive the coming storm. This book very influenced how I started thinking about my work.

Back in 2003, I automated everything I could because I was overwhelmed with work and little resources beyond great computers and my own skill to make it better. My supervisees focused on meeting with the clients to talk about the web site they wanted and build that. I wrote code to report about or fix problems to prevent people needing to call or email about problems.

Where I wish we would head is more like You Really Don’t Need To Work So Much. I meant to send this to my boss (maybe he’s reading this blog)? All our efficiencies should mean we have less to do not more, so why do we work so hard?

The past fifty years have seen massive gains in productivity, the invention of countless labor-saving devices, and the mass entry of women into the formal workforce. If we assume that there is, to a certain degree, a fixed amount of work necessary for society to function, how can we at once be more productive, have more workers, and yet still be working more hours? Something else must be going on.

From my experience, the to-do list gets ever larger. Not because there is more to do, but because more is possible. I’d just rather spend more of my time on solving hard problems than easy repetitive tasks.

P.S. This post really only exists because I loved the phrase “March of the Machines” enough I wanted it as a title for something on this blog.

Check Backups

I get a daily report about backups for each of the projects. One particular one has shown some odd results. The report has columns for: Completed, Successful, Partial, Failed, Missed, and Active. The particular problem is that backups show up in none of those columns when it is actually still actively running. (So it should show up as active?)

So I wrote the other day a Bash script to check some things. In the future I can quickly assess if this is the SNAFU (Situation Normal All F***ed Up) or something else. Really, the other day I had to track back into my history this information to do it manually, which is stupid.

#!/bin/bash
###########################################################
# Check backup status.
# 2015-JUL-24 Ezra Freelove, email@domain.com
###########################################################
# Look for running processes
echo “… Backup processes”
ps -ef | grep [p]rocname
# Report logs
echo “… Logs”
ls -ltr /path/to/agent.log /session/path/to/clientlogs/*.log
echo “==========”
# Specific lines excluding spam and blank
tail -100 /path/to/agent.log | grep -vf /home/me/myscripts/backups/exclude/ckrunning.txt | grep -v “^ $”

If the backup is still running, then the ps will show the PID and time started.

In addition the logs are reported with the most recently changed at the end.

The agent talks to a central service to find out when it should be doing stuff. The exclude/ckrunning.txt file uses the below entries to ignore spammy lines where the agent is checking but not told to do anything. This ignores about 70 of the 100 lines when everything is normally operating from my check.

  • Sleeping
  • Workorder received: sleep
  • Requesting work

Long-term, I guess I should ask why the long-running backup does not appear in the Active category when it obviously should.

TED Talk: The Internet’s Immune System

I really enjoyed this TED Talk on hacktivists the first couple times I watched it a year ago and a few months ago. Not sure why I have not yet posted it.

The beauty of hackers, says cybersecurity expert Keren Elazari, is that they force us to evolve and improve. Yes, some hackers are bad guys, but many are working to fight government corruption and advocate for our rights. By exposing vulnerabilities, they push the Internet to become stronger and healthier, wielding their power to create a better world.

The Loss of Tech Support

I found a statement in Twitter is your IT support interesting:

For reasons I won’t go in to, I haven’t been able to get [a WordPress install with the FeedWordPress plugin] done at the Open University, despite trying since last July. I’ve spoken to people at others unis and it isn’t isolated to the OU, it seems to be this low-level, experimental type of IT support is increasingly difficult to find.

Do you know who I think the culprit is? The VLE. As universities installed VLEs they became experts at developing enterprise level solutions. This is serious business and I have a lot of respect for people who do it. The level of support, planning and maintenance required for such systems is considerable. So we developed a whole host of processes to make sure it worked well. But along the way we lost the ability to support small scale IT requests that don’t require an enterprise level solution. In short, we know how to spend £500,000 but not how to spend £500.

(For those of you non-British/European readers, VLE are Virtual Learning Environments which are often also called Learning Management Systems on this side of the Atlantic.)

It is true the higher education IT has change with online class systems, but I think that part of the symptom and not causal. Chief Information Officers, Chief Academic Officers, and presidents all get recognition for big things. Enterprise level solutions are sexy because it is something that makes them look decisive and effective. Employees who report to them know this, so enterprise level solutions have the priority. Everything else fits into the dwindling extra work time.

What extra time?

The good news though is the small things have gotten much easier for anyone to go off on their own. At my last job, I sat as an ex-officio member of the Faculty Senate technology committee. One of the hot topics one year was a couple faculty members taught students how to use the LMS adopted by another college system in the state. It was two courses. Should we spend $20,000/yr and take up a significant amount of my time running a second LMS? Or should they continue to pay $800/yr for Blackboard to do it? The answer ultimately was to continue with Blackboard. Now days, they probably would be directed at CourseSites. At the time my to-do list was several pages long and hundred plus hour weeks were not uncommon just to keep top and high priority items timely done. The ETA for anything not top or high priority was over a year.

I prefer working with innovative technologies. Custom solutions that require creative thinking and problem solving make me feel like I accomplished something special. They give the biggest rush. Enterprise level software is steak and potatoes, so it is the core. The enterprise is the minimum. I just wish I more time to devote to achieve going beyond the minimum than I did. Well, do. This is a top level decision. Improve staffing and flexible team management so that people can spend time working on the things that make them happier.

The Hacker Manifesto

UPDATE: I hung this on a pinboard in my work cube on May 4, 2007. I am surprised no one has asked me to pull it.

by
+++The Mentor+++
Written January 8, 1986

Another one got caught today, it’s all over the papers. “Teenager Arrested in Computer Crime Scandal”, “Hacker Arrested after Bank Tampering”…

Damn kids. They’re all alike.

But did you, in your three-piece psychology and 1950’s technobrain, ever take a look behind the eyes of the hacker? Did you ever wonder what made him tick, what forces shaped him, what may have molded him?

I am a hacker, enter my world…

Mine is a world that begins with school… I’m smarter than most of the other kids, this crap they teach us bores me…

Damn underachiever. They’re all alike.

I’m in junior high or high school. I’ve listened to teachers explain for the fifteenth time how to reduce a fraction. I understand it. “No, Ms. Smith, I didn’t show my work. I did it in my head…”

Damn kid. Probably copied it. They’re all alike.

I made a discovery today. I found a computer. Wait a second, this is cool. It does what I want it to. If it makes a mistake, it’s because I screwed it up. Not because it doesn’t like me… Or feels threatened by me.. Or thinks I’m a smart ass.. Or doesn’t like teaching and shouldn’t be here…

Damn kid. All he does is play games. They’re all alike.

And then it happened… a door opened to a world… rushing through the phone line like heroin through an addict’s veins, an electronic pulse is sent out, a refuge from the day-to-day incompetencies is sought… a board is found. “This is it… this is where I belong…” I know everyone here… even if I’ve never met them, never talked to them, may never hear from them again… I know you all…

Damn kid. Tying up the phone line again. They’re all alike…

You bet your ass we’re all alike… we’ve been spoon-fed baby food at school when we hungered for steak… the bits of meat that you did let slip through were pre-chewed and tasteless. We’ve been dominated by sadists, or ignored by the apathetic. The few that had something to teach found us willing pupils, but those few are like drops of water in the desert.

This is our world now… the world of the electron and the switch, the beauty of the baud. We make use of a service already existing without paying for what could be dirt-cheap if it wasn’t run by profiteering gluttons, and you call us criminals. We explore… and you call us criminals. We seek after knowledge… and you call us criminals. We exist without skin color, without nationality, without religious bias… and you call us criminals. You build atomic bombs, you wage wars, you murder, cheat, and lie to us and try to make us believe it’s for our own good, yet we’re the criminals.

Yes, I am a criminal. My crime is that of curiosity. My crime is that of judging people by what they say and think, not what they look like. My crime is that of outsmarting you, something that you will never forgive me for.

I am a hacker, and this is my manifesto. You may stop this individual, but you can’t stop us all… after all, we’re all alike.

Black Box Magic

black boxes ttv

With a black box system a person working with it sees what goes in and what comes out. The machine’s decision making process is obfuscated. Theories are made based on incomplete evidence on the behavior. More data points on more situations confirming the behavior is my way of being more comfortable the theory is correct. Sometimes we lack the time or conscientiousness or even access to ensure the theory is correct. This leads to magical thinking like labeling the software in human-like terms, especially insane or stupid or seeking revenge.

With a white box system, a person working with it can see the machine’s logic used to make decisions. Theories can be made based on more complete evidence due to investigating the code to see what it is intended to do. The evidence is far more direct than testing more.

Systems today are so complex they tend to have many parts interacting with each other. Some will be of each type.

Then there are Application Programming Interfaces (APIs) which expose vendor supported methods to interact with a black box by disclosing how they works.

Proprietary systems tend towards a black box model from the perspective of clients. This black box philosophy depends on the experts, employees of the company, design the system so it works well and resolve the issues with it. So there is no need for clients to know what it is doing. Where the idea breaks down is clients who run the systems need to understand how it works to solve problems themselves. Sure the company helps. However, the client will want to achieve expertise to manage minor and moderate issues as much as possible. They want to involve the vendor as little as reasonably possible. Communities arise because peers have solved the client issues and getting an answer out of the vendor is either formulaic, inaccurate company line, or suspect. Peers become the best way to get answers.

Open source systems tend toward a white box model from the perspective of clients. This white box philosophy depends on clients to take initiative figuring out issues and solutions to resolve them. Clients become the experts who design the system so it works well. Where the idea breaks down is some clients just want something that works and not to have to solve the problems themselves. Sure the open source community helps. Companies have arisen to take the role of the vendor for proprietary systems to give CIOs “someone to yell at about the product”. Someone else is better to blame than myself.

Cases of both the black and the white box will be present in either model. That is actually okay. Anyone can manage both. Really it is about personal preference.

I prefer open source. But that is only because I love to research how things work, engage experts, and the feel of dopamine when I get close to solving an issue. My personality is geared towards it. My career is based around running web services in higher education. Running something is going to be my preference. (Bosses should take note that when I say not to run something, this means it is so bad I would risk being obsolete than run it.)

This post came about by discussing how to help our analysts better understand how to work with our systems. It is hard to figure out how to fix something when you cannot look at the problem, the data about the problem, or do anything to fix it. So a thought was to give our analysts more access to test systems so they get these experiences solving problems.

Photo credit: black boxes ttv from Adam Graham at Flickr.

The Cause

Found I develop free software because of CUNY and Blackboard following the Blackboard security issues story. It is a really good blog post. This conclusion made me smile. I am certain there are plenty of people in the system I support who strongly agree. I just wish there was an easier way of finding and applauding them.

As long as our IT departments are dominated by Microsoft-trained technicians and corporate-owned CIOs, perhaps the best way to advance the cause – the cause of justice in the way that student money is spent – is to create viable alternatives to Blackboard and its ilk, alternatives that are free (as in speech) and cheap (as in beer). This, more than anything else, is why I develop free software, the idea that I might play a role in creating the viable alternatives. In the end, it’s not just about Blackboard, of course. The case of Blackboard and CUNY is a particularly problematic example of a broader phenomenon, where vulnerable populations are controlled through proprietary software. Examples abound: Facebook, Apple, Google. (See also my Project Reclaim.) The case of Blackboard and its contracts with public institutions like CUNY is just one instance of these exploitative relationships, but it’s the instance that hits home the most for me, because CUNY is such a part of me, and because the exploitation is, in this case, so severe and so terrible.

The training plan is to make me one of those “Microsoft-trained technicians”. It makes me feel stupider just thinking about it.

Notepad++ load langs.xml failed

Notepad++ is my Windows text editor. Work has a site license for UltraEdit. I bought a personal license for EditPlus back in 2002-ish. Notepad++ does what I need it to do without having to track down a license key.

This week I started getting this error when it starts.

Load langs.xml failed!

Apparently this happens enough, Google was able to suggest the search and pulled up a solution in the first result. Rob3C at Superuser.com recommends renaming the langs.xml out of the way and copying the langs.model.xml into place. It was thoughtful of the developers to provide a default good version on which to fall back.

I was curious what was wrong with the the file. Turns out it just ended in the middle of line 100 which is ini, a style I would not have modified. (I mainly use SQL and bash.) Also, the bad one uses the wrong end style for the language element despite it being correctly on adjacent lines. So the file is no longer valid XML. Very odd.