Rote Loading

With this specific application, we can import data, but there are limitations due to its 2000-era handling of XML files.

  1. HTML forms uploading files have to…
    1. Have all the packets be received by the server.
    2. Process the file with the browser connection still open.
    3. The server has to tell the browser everything was received and is done.
  2. All this has to happen as one action within a 5 minute window.

A better method would allow just uploading the files to a page. Background processes would monitor that location and process the files independent of the browser. Notifications can be sent to alert the user the processing is done.

Or… recognize the echo XML file because you took too long and prevent it from being loaded or remove the data.

I ended up figuring out that if we split the files at about 5,000 records, then it should take about half the 5 minute window. I am pleased that for most I have seen that is true and about one in ten take so much longer that if I had cheated and gone with closer to the 5 minutes, I would be deleting duplicates. (This last is because some files are 50MB and others 100MB.)

The grumbling of this post is that I am on the 25th of 58 files. This is tedious. I am lamenting not creating a curl script to do this part for me. Automation is perfect for things like this.

 

Fun With Regex

We replaced the old ticketing system with a new one. Naturally there are people who are concerned about losing access to old tickets. So we looked at exporting all the tickets. My coworker had the better method of getting out the data with one issue.

Because the old sytem used an HTML editor for a specific textarea, the content in them was difficult to read without expertise in HTML. Fine for a former Webmaster like myself, but few people who will need this read it like they do English.

My first thought was to look for products that clean up HTML. I even got excited when I notice HTML Tidy comes with our Linux OS, but that just converted the HTML to standardized format of HTML. (And trashed the plain-text portions of the ticket.) I did not find options for removing the HTML with Tidy.

So, my next thought was to try Regular Expressions (Regex). Certainly it ought to be doable. Just Regex is hard. No, difficult. No, turn your hair gray at 22. But, it can do anything if you put your mind to it. And I ran across RegExr which really simplified the process by showing how my pattern worked in sample content.

In the end I mad a simple shell script to clean up the files.

#!/bin/bash
#############################################################
# Convert HTML to plaintext using sed.
# Created by Ezra Freelove, email
#############################################################
# Variables
WORKINGDIR=/stage/$1
if [ -d $WORKINGDIR ] ; then echo “… found dir; continuing” ; else echo “… missing dir ; bailing” ; exit; fi
DESTDIR=${WORKINGDIR}/fixed
# Make a list of files to convert.
cd $WORKINGDIR
WORKINGLIST=`ls *.txt`
# Fix the files
mkdir -p $DESTDIR
for WORKINGFILE in $WORKINGLIST
do
sed -e ‘s|<br[\ \/]*>|\n|g’ -e ‘s/<[^!>]*>//g’ -e ‘s/&nbsp;/ /g’ -e ‘s/&lt;/</g’ -e ‘s/&gt;/>/g’ $WORKINGFILE > ${DESTDIR}/fixed_${WORKINGFILE}
done

The regexes are:

  • s|<br[\ \/]*>|\n|g which means match HTML <br> tags and replace with a newline character . The <br> tag tells a web browser to go to the next line.
  • s/<[^!>]*>//g which means match a less than (<) out to the next greater than but exclude an exclamation point. Delete everything between. This handle the HTML elements and their attributes. This like <p class=”MsoPlainText”> or </span>. For some reason the date and username of the person who updated the ticket are stored as <! 2017-02-03 username>, so I had to figure out how to keep them.
  • ‘s/&nbsp;/ /g’ which means match the text “&nbsp;” which is a non-breaking space it with a normal space.
  • ‘s/&lt;/</g’ which means replace the text “&lt;” with a “<“. And finally the same thing but for greater than.

An easy way to match all of these latter ones would be pretty cool, but I think dealing with the most common ones is good enough.

Initially I was going to remove all the character codes like &nbsp;. In the end, I decided that the ones I handled should help people. The more rare ones can be determined easily if someone runs across them.

Interactive Archives

My jaw dropped at the end of this blog post Cloud Hosting and Academic Research.

There is a value in keeping significant old systems around, even if they no longer have active user bases.  A cloud hosting model seems so right to me–it’s scalable and robust. It just makes sense. But the hosting costs are a problem. Even if the total amount of money is small, grants are for specific work and have end dates. I can still be running a 10+ year old UNIX box, but I can’t still be paying hosting fees for a research project whose funding ended years ago, no matter how small that bill is.  Grants end–there’s no provision for “long term hosting.”  Our library can help us archive data, but they are not yet ready to “archive” an interactive system.  I hope companies that provide hosting services will consider donating long-term hosting for research.

Opening up a new area of digital archives by preserving the really cool works of the faculty seems like something I might enjoy.

My mentor in web design and server administration might have been described as a pack rat. He… Well, I guess, we kept around versions of web pages a decade old. Nothing really found deletion. The public just missed it by use of permissions.

When building my portfolio, my mistake was not gathering up the whole files to replicate the sites I designed. I’m no longer doing web design or even programming. So it is okay.

A professor in Geology had a pretty cool Virtual Museum for Fossils. The site moved around a few times, eventually ending up on the main web server also hosting WWW. Of course, HTML, images, and Flash files are easy to archive. Take the files and place them on a web server. Since they are static, it is easy to keep around for a long time. As long as the standards remain honored, they should be good. Developers of web browsers have pressure to go for the new, which potentially abandons the old eventually.

Scripted web sites using Perl, PHP, ASP, or JSP, JavaScript, or AJAX require a working interpreter. Still, some things might not be backwards compatible.

About a year ago my mother ran across 8mm video film. An uncle found a place who converted it to DVD. Will we even be using DVDs in a decade? Maybe the 8mm needs to go on Blueray?

Going back to the scripted web sites, should an archived web site’s code be updated to work on the new version of the interpreter? Maybe. If makers of the interpreters allowed for running in a backwards compatible mode, then all would be good. Even better, to be able to add to a script a variable that tells the interpreter which back version to pretend to use. For administrators, they could have the programmers check non-working scripts by just telling the interpreter to simulate an older version.

Apple Trying To Poach IE6 Users

Attempted to watch the Transformer’s 3 trailer, but apparently Chrome on Linux was a no-go for the JavaScript which hides the web site and displays the trailer. Fancy but broken. So I thought I would look at the HTML and get the .mov file. I found this snippet of code in the HTML quite interesting.

<!–[if lt IE 7]>
<div id=”ie6-message”>
<h2>You are currently using an outdated browser.</h2>
<p>Please upgrade to a <a href=”http://www.apple.com/safari/”>modern browser</a> to fully experience this site.<p>
</div>

Where most places would have someone upgrade to a newer version of the software they are currently using, Apple is trying to poach Microsoft users. Bravo! Bravo!