Rants, Raves, and Rhetoric v4

OrgCode Duplicate Filter

I was asked to work my “Unix magic”.

The problem? Duplicate courses were spooled and converted from the WebCT format to the Desire2Learn. The conversion process creates an import file using the WebCT SourcedId.Id as the OrgCode. The first time the OrgCode is used, it creates a course. The next and subsequent times, it duplicates content. So these duplicate converted courses gave us a situation where we were screwed.

Fortunately our partners at Desire2Learn intercepted the problem before it got worse.

Out of 1,505 still to be imported, there were 468 duplicates. Yes, 31% duplicates.

D2L asked me to filter the imports to remove the duplicates. I said I am too much of a n00b with Windows to pull it off.

The reply was to use Unix.

Boy do I love Bash shell scripting. In two hours I solved it, though after the high of solving something I had no idea how to write this morning in two hours, there must be something wrong with it.

First, my general idea was to read the file line by line and write those lines with OrgCodes that do not yet exist to a filtered.csv file. I started out looking to exactly duplicate my existing file in another file by reading it line by line.

A while loop which reads each line and records the whole line in a variable.

INPUTFILE=/path/to/file.txt
exec<$INPUTFILE
while read LINE
do
     stuff...
done

I quickly discovered though that since Windows uses the backslash, that foiled the ability of echo to exactly write every line to a file. The backslash escapes the next character. Neither double nor single quotes helped the situation. Oops. So I decided to use sed to make a temporary copy to duplicate the backslashes. A first backslash escapes the next character, in this case a second backslash.

sed -e 's|\\|\\\\|g'

As an error check, the last thing the script does is a diff -u to compare source and new files. At this stage nothing means perfect. I like the -u to give me easier to read results.

So I was able to get an exact copy of the original. All that was next was to get the OrgCode, check it against my filtered file, and if it did not exist, then add it to the end of the filtered file.

ORGCODE=`cat $LINE | awk -F, '{print $1}'`
IF_EXISTS=`grep $ORGCODE $FILTERFILE`
if [ -z IF_EXISTS ] ; then
     echo $LINE >>  $FILTERFILE
fi

Easy. Too easy?

The checks against my work confirmed it worked.

    1. A sorted version of the source run through this and compared in diff -u consistently showed the correct lines were excluded.
    2. Counts for the number of duplicates and the difference of lines missing works.
    3. A check for the number of duplicate OrgCodes returns nothing on the filtered file.

Posted

in

by

Comments

Leave a Reply