Slackers and IT

Go read “Science Fiction Is for Slackers.”

As a rule, science fiction may be the laziest of all genres, not because the stories themselves are too facile—they can be just as sophisticated and challenging as those of any other genre—but because they often revel in easy solutions: Why walk when you can warp? Why talk when you’re a telepath? Technology in such stories typically has more to do with workarounds than it does with work.

I do love science fiction. From robots/AI to star travel to virtual reality. I love it all. I may even love it BECAUSE of the laziness. I’d love to have all these things to make my life better. And much of science fiction influences technologists into making decisions to make the fiction a reality.

The How Shatner Changed the World (mock) documentary talks about the technologies of Star Trek and how scientists work towards making these things reality. Faster than light travel and cybernetics are still aspirant. But cell phones and personal computers were influenced by technologists familiar with the show and movies.

At times I worry about automation putting me out of a job, but then I remember my career goal is always to replace myself with a tiny shell script. Why click when I can script? Why script when I can tell an AI to handle it? Sure it takes away some of my responsibilities, but what I am supposed to do has always changed. And I get better challenging work when I free myself from mundane tasks.

Guess this is why I told Puppet Labs my job is an Automation Evangelist. It’s not universal. I have allies, but convincing people of the good in automation is much like changing their religion.

Back in college I was encouraged to become a librarian. More specifically, people thought I should become an automation librarian. I guess the automation part stuck?

Listing Lists

A mini project is to hand over the course packages for the prior product to each of our clients. A good idea was to include a list of the files so down the road, if something is missing then, we can say this list in the ticket has what they received.

So I wrote this shell script to make the lists for me. (Well, really the analyst doing the hard work wanted to know if he should make the list. Told him I could really easily through Linux.) This is because I am talking about 385,528 courses and 37 targets. First step generates a list of the clients (schools) involved. Next, the path to where the files are stored have two subdirectories, so I pull them out of the path. The list generates with a find command stripping out the “./” at the beginning and writing the results to a file. Finally I check the size and number of lines in the file.

SCHOOLLIST=`find /${BASEDIR} -name bak`
for SCHOOLDIR in $SCHOOLLIST
do
cd $SCHOOLDIR
SCHOOL=`pwd | awk -F\/ ‘{print $4}’`
CLUSTER=`pwd | awk -F\/ ‘{print $3}’`
find . -name “*.bak” | sed -e ‘s|^./||g’ > ${BASEDIR}/${CLUSTER}/${SCHOOL}/course_list_${SCHOOL}.txt
head /${BASEDIR}/${CLUSTER}/${SCHOOL}/course_list_${SCHOOL}.txt
done
ls -h /${BASEDIR}/*/*/course_list*
wc -l /${BASEDIR}/*/*/course_list*

Since each course is on its own line, I can compare these numbers to other known numbers of courses.

So nice to get the computer to work for me. Purely by hand this would have taken days. It took about half an hour to craft the core and make sure it looked right. Then another half hour for the loop to work right.

Of course, I need to figure out how to do this in Powershell. 🙂

Content Migration Progress Tracking

When moving hundreds of thousands of courses between WebCT Vista and Desire2Learn, keeping track of what made it through which stage seems like an obvious hindsight thing to do. I added that last bit because we started to notice where things fell between the cracks starting to pile up. The basic process…

    1. Through Oracle SQL we populate a middleware database with those courses which meet acceptable criteria.
      CRACK: With the wrong criteria, courses are not even selected.
    2. Through Oracle SQL we generate XML listing in 50 count sets of the courses.
      CRACK: A subset of data loaded into the database may be extracted.
    3. A shell script automates running the WebCT command-line backup process to create a package for each course.
      CRACK: The command-line backup fails on some courses.
    4. Desire2Learn scripts pick up the files and convert WebCT formatted packages to Desire2Learn.
      CRACKS: Too big fail. Too long paths fail. This step can fail to create CSV files for the next step.
    5. Converted packages are placed in a queue to be imported into Desire2Learn.
      CRACKS: WebCT Vista courses can have 1,000 characters in the name and D2L fails if there are more than 50. Courses named the same as a previously loaded one but with a different file name loads both into the same course.

So, there are apparently five different stages and eight potential failures to track and no end-to-end tracking to even know what is missing. Which means inventing something to check logs for errors. 

First thing, I tried writing SQL to create an ordered list of the courses that are available.

The WebCT backups were a little tougher to convert into a useful list. The file names follow the format of Course_Parent_1234567890.bak. They were also in various directories, so I ended up doing something like this to get a list of the files, strip off the parents & time stamps, strip off the directories, and order it.

ls */*.bak | awk -F_ ‘{print $1}’ | awk -F\/ ‘{print $2}’ | sort

So I have two ordered lists. Anything in the D2L one and not in the WebCT one ought to be the work of clients making their own stuff. Anything in the WebCT one and not in the D2L one ought to be my missing ones. Only almost every line is a mismatch.

Visually comparing them, I realized the same courses had in effect different names. All the spaces were converted to underscores. So I ended up adding a sed to convert spaces to underscores after the sort.

Then I wrote some more scripts.

    • Go through the logs from #4 and #5 to display the errors. With it I was able to compare my list of missing with the errors and confirm why they did not come through.
    • Some of the cracks can be addressed by making new import files. So I wrote a script to add those lines to a redo.csv file. Touch up the problems and go.

Basically at this point it only covers 3-5. At some point I am going to have to check steps 1-5. 

OrgCode Duplicate Filter

I was asked to work my “Unix magic”.

The problem? Duplicate courses were spooled and converted from the WebCT format to the Desire2Learn. The conversion process creates an import file using the WebCT SourcedId.Id as the OrgCode. The first time the OrgCode is used, it creates a course. The next and subsequent times, it duplicates content. So these duplicate converted courses gave us a situation where we were screwed.

Fortunately our partners at Desire2Learn intercepted the problem before it got worse.

Out of 1,505 still to be imported, there were 468 duplicates. Yes, 31% duplicates.

D2L asked me to filter the imports to remove the duplicates. I said I am too much of a n00b with Windows to pull it off.

The reply was to use Unix.

Boy do I love Bash shell scripting. In two hours I solved it, though after the high of solving something I had no idea how to write this morning in two hours, there must be something wrong with it.

First, my general idea was to read the file line by line and write those lines with OrgCodes that do not yet exist to a filtered.csv file. I started out looking to exactly duplicate my existing file in another file by reading it line by line.

A while loop which reads each line and records the whole line in a variable.

INPUTFILE=/path/to/file.txt
exec<$INPUTFILE
while read LINE
do
     stuff...
done

I quickly discovered though that since Windows uses the backslash, that foiled the ability of echo to exactly write every line to a file. The backslash escapes the next character. Neither double nor single quotes helped the situation. Oops. So I decided to use sed to make a temporary copy to duplicate the backslashes. A first backslash escapes the next character, in this case a second backslash.

sed -e 's|\\|\\\\|g'

As an error check, the last thing the script does is a diff -u to compare source and new files. At this stage nothing means perfect. I like the -u to give me easier to read results.

So I was able to get an exact copy of the original. All that was next was to get the OrgCode, check it against my filtered file, and if it did not exist, then add it to the end of the filtered file.

ORGCODE=`cat $LINE | awk -F, '{print $1}'`
IF_EXISTS=`grep $ORGCODE $FILTERFILE`
if [ -z IF_EXISTS ] ; then
     echo $LINE >>  $FILTERFILE
fi

Easy. Too easy?

The checks against my work confirmed it worked.

    1. A sorted version of the source run through this and compared in diff -u consistently showed the correct lines were excluded.
    2. Counts for the number of duplicates and the difference of lines missing works.
    3. A check for the number of duplicate OrgCodes returns nothing on the filtered file.