Counting

In the beginning I used TOTAL=`ls /d2lmig/*/*/bak/*/* | wc -l` to get a total count. All was good.

Until at around 55,000 files I got: -bash: /bin/ls: Argument list too long.

Then I used TOTAL=`find /d2lmig/*/*/bak/*/* -name *.bak | wc -l` to get a total count. All was good.

Until at around 90,000 files I got: -bash: /usr/bin/find: Argument list too long.

It happened that I was already using a for do done loop and getting a count for each bak directory. So I added within the for loop:

TOTAL=`expr $TOTAL + $COUNT`
if [ $RUNTYPE = “INTERNAL” ] ; then echo ” Running total = $TOTAL” ; fi

(INTERNAL is a value I pass at the command line that controls whether the email is sent to me or other parties.)

Comment out the TOTAL using find. Voila.

I know, I should have done it this way in the beginning. Sloppy code to go back and get the total. Maybe because I have this post here, I reference it and not be so bad in the future.

Content Migration Progress Tracking

When moving hundreds of thousands of courses between WebCT Vista and Desire2Learn, keeping track of what made it through which stage seems like an obvious hindsight thing to do. I added that last bit because we started to notice where things fell between the cracks starting to pile up. The basic process…

    1. Through Oracle SQL we populate a middleware database with those courses which meet acceptable criteria.
      CRACK: With the wrong criteria, courses are not even selected.
    2. Through Oracle SQL we generate XML listing in 50 count sets of the courses.
      CRACK: A subset of data loaded into the database may be extracted.
    3. A shell script automates running the WebCT command-line backup process to create a package for each course.
      CRACK: The command-line backup fails on some courses.
    4. Desire2Learn scripts pick up the files and convert WebCT formatted packages to Desire2Learn.
      CRACKS: Too big fail. Too long paths fail. This step can fail to create CSV files for the next step.
    5. Converted packages are placed in a queue to be imported into Desire2Learn.
      CRACKS: WebCT Vista courses can have 1,000 characters in the name and D2L fails if there are more than 50. Courses named the same as a previously loaded one but with a different file name loads both into the same course.

So, there are apparently five different stages and eight potential failures to track and no end-to-end tracking to even know what is missing. Which means inventing something to check logs for errors. 

First thing, I tried writing SQL to create an ordered list of the courses that are available.

The WebCT backups were a little tougher to convert into a useful list. The file names follow the format of Course_Parent_1234567890.bak. They were also in various directories, so I ended up doing something like this to get a list of the files, strip off the parents & time stamps, strip off the directories, and order it.

ls */*.bak | awk -F_ ‘{print $1}’ | awk -F\/ ‘{print $2}’ | sort

So I have two ordered lists. Anything in the D2L one and not in the WebCT one ought to be the work of clients making their own stuff. Anything in the WebCT one and not in the D2L one ought to be my missing ones. Only almost every line is a mismatch.

Visually comparing them, I realized the same courses had in effect different names. All the spaces were converted to underscores. So I ended up adding a sed to convert spaces to underscores after the sort.

Then I wrote some more scripts.

    • Go through the logs from #4 and #5 to display the errors. With it I was able to compare my list of missing with the errors and confirm why they did not come through.
    • Some of the cracks can be addressed by making new import files. So I wrote a script to add those lines to a redo.csv file. Touch up the problems and go.

Basically at this point it only covers 3-5. At some point I am going to have to check steps 1-5. 

Why Ten

The question of why we run ten clusters came up recently. Off the top of my head, the answer was okay. Here is my more thoughtful response.

Whenever I have been in a conversation with a BEA (more recently Oracle) person on Weblogic, the number of nodes we run has invariably surprised them. Major banks serve ten times the number simultaneous users we have on a half dozen managed nodes or less. We have 130 managed nodes for production. Overkill?

There are some advantages they have.

  1. Better control over the application. WebCT hacked together an install process very much counter to the way BEA would have done it. BEA would have had one install the database, the web servers, and then deploy the application using either the console or command-line. WebCT created an installer which does all this in the background out of sight and mind of the administrator. They also created start and stop scripts which do command-line interaction of Weblogic to start the application. Great for automation and making it simple for administrators. It also lobotomies the console making many advanced things one could normally do risky. So now the console is only useful for some minor configuration management and monitoring.
  2. Better control over the code. When there is a performance issue, they can find what is the cause and improve the efficiency of the code. The best I can do is point out the inefficiencies to a company who chose as a priority a completely different codebase. If you do not have control over the code, then you give the code more resources.
  3. As good as Weblogic is at juggling multiple managed nodes, more nodes does not always equal better. Every node has to keep track of the others. The heart beats communicate through multicast. Every node sends out its own and listens for the same from all the others. Around twenty nodes they would miss occasional beats on their own. Thrown in a heavy work load and an overwhelmed node can miss enough missed beats it becomes marked as unavailable by the others. Usually at this point is when the monitors started paging me about strange values in the diagnostics. Reducing the number of nodes helped.

More resources means more nodes. We had two clusters with about 22 nodes (44 total) each when we hit a major performance wall. They were split into four clusters with 15 nodes each (60 total). Eventually these grew to over 22 nodes each again. At this point upgrading was out of the question. A complete overhaul with all new databases and web servers meant we could do whatever we wished.

The ideal plan was a cluster per client. Licenses being so expensive scrapped that plan.

Ten clusters with 13 managed nodes each was a reasonable compromise. More nodes while also using smaller clusters achieved both needs well. Empty databases also gave us a better restarting point. The databases still have grown to the point certain transactions run slowly just for 4 terms later. (I was hoping for 6.) Surviving the next two years will be a challenge to say the least. I wish we got bonuses for averting disasters.

Missing Shutdowns

A Weblogic managed node for a development cluster failed to shutdown when our shutdown script requested. The last managed node to shutdown becomes the JMS node and triggers a rewrite of the config.xml. We have scripts in place to check for the config.xml changing and alert us. Since I am the on call this week, I received the page.

I thought it would be good enough to copy the config.xml into place. Since it would be restarted that night by the usual shutdown script, the cluster would pick up the new config.xml and all would be well again. Ha! Normally a node has an entry in WebCTServer.99999999999.log stating something like “Server shutdown has been requested by system.” for each shutdown then the log ends. These are completely missing. That intrigued me as expected these to be present but some reason for why it failed. Instead, it was like no request was sent.

The shutdown script log showed this error where it ought to show the node was shutdown successfully. (Actual names replaced with CLUSTER_NAME and NODE.)

Error:CLUSTER_NAME:Name=NODE,Location=NODE,Type=ServerConfig.

The shutdown script call Blackboard’s stopWebCTServer.sh which just calls another script which takes various inputs to ultimately call:

java -classpath $CLASSPATH weblogic.Admin -url t3://$HOSTNAME:$PORT -username $WL_USER -password $WL_PASS SHUTDOWN

CLASSPATH= can be found in the start scripts and has multiple entries in setEnv.sh. Just run “sh setEnv.sh” to set this for your session.
HOSTNAME= server hostname
PORT= HTTP port where Weblogic listens
WL_USER= Weblogic user
WL_PASS= Weblogic user’s password

Instead of shutting down the node it just gave an irrelevant status. This one node gave a Weblogic command-line response which seems to mean a botched connection but one where something was listening. The not listening or wrong address error is:

Failed to connect to t3://$HOSTNAME:$PORT: Destination unreachable; nested exception is: java.net.ConnectException: Connection refused; No available router to destination.)

Yay, another case to figure out to correctly handle.

I killed the process. For safe measure I did a “touch REFRESH” in $WL_DOMAIN so the node would dump anything it had cached and download new these things. Since it started up this morning in the normal restart, I think it is fixed.

Other than JMS migrating and failing to do so, I don’t think this caused any problems for users. Just so very odd.

P.S. weblogic.Admin is deprecated in Weblogic 9, so it is interesting Blackboard still makes use of it.

LC Oddities

IMS XML for Blackboard Vista 8:

Say Division1 exists. We want to create Group1 inside Division1. Ignore that Division1 already exists and write XML to create it again. Create Group1 with relationship tag info for Division1.

Starting with Group1 doesn’t work unless command-line overrides starting learning context to be Division instead of Group.

Luminis XML for Blackboard Vista 8:

Starting with Group1 fine because divisions are unsupported.

Don’t ever use Luminis XML as a model for IMS. Ever!

CE / Vista Undocumented Workspaces

On the WebCT Users email list (hosted by Blackboard) there is a discussion about a mysterious directory called unmarshall which suddenly appeared. We found it under similar circumstances as others by investigating why a node consumed so much disk space. Failed command-line restores end up in this unmarshall directory.

Unmarshalling in Java jargon means:

converting the byte-stream back to its original data or object 1

This suspiciously sounds like what a decryption process would use to convert a .bak file into a .zip so something can open the file.

This is fourth undocumented work space where failed files site for a while and cause problems and no forewarning from the vendor.

Previous ones are:

  1. Failed UI backups end up in the weblogic81 (Vista 3, does this still happen in Vista 8?) directory.
  2. Failed tracking data files end up in WEBCTDOMAIN/tracking (Vista 3, apparently no longer stored this way in Vista 4/8 according to CSU-Chico and Notre Dame)
  3. Web Services content ends up in /var/tmp/ and are named Axis####axis. These are caused by a bug in DIME (like MIME) for Apache Axis. No one is complaining about the content failing to arrive, so we presume the files just end up on the system.

#3 were the hardest to diagnose because of a lack of an ability to tie the data back to user activity.

Is this all there are? I need to do testing to see which of these I can cross off my list goring forward in Vista 8. Failed restores are on it indefinitely for now.
🙁

References:

  1. http://www.jguru.com/faq/view.jsp?EID=560072

CE/Vista and Banner Integration

This is the second time I have worked on making Vista integration work with Banner. The first was 2005 in Vista 3.0.3 at Valdosta State. The production here at GeorgiaVIEW was set up by Harold, Jill, and Amy years ago and integrated into the install scripts or part of the cloned databases.

So now I am working on getting it to work in Vista 8. The IMS imports worked the first time like a charm. When I turned to using the Luminis adapter, the person records worked fine but the group contexts failed in Vista 8 and worked fine in Vista 3. So the “siapi.sh luminis import restrict” works fine.

Command-line

We have 41 institutions in Vista 3 currently. So imports are automated to some degree to preserve the sanity of Jill (and to a lesser degree Amy and myself). Rather than put in the UI all the settings, we have a properties file defining the location, glcid, sourcedid.source and sourcedid.id for each institution. This allows us to easily pass the values when importing at the command-line.

My first approach was to leave the settings identical to what I used to create persons and group records with IMS. This essentially uses the glcid of the institution and sourcedid of the institution. This is what resulted in the person records working and groups not. Fail.

I realized my error in logic must be the lack of a division-to-group relationship as the error described the groups cannot be related to an institution. So I changed the properties to use the division values for the sourcedid. Fail.

So I went looking in “Guide to Integration with the SunGard Luminis Data Integration Suite” for what I ought to use at the command-line. I didn’t find a solution. Just the same command-line lacking even the glcid and sourcedid.
🙁

XML

Giving up on the command-line approach for now, I added the relationship element to the XML so the group would become a child of one of the divisions I created with IMS. It sorta worked! The groups all imported but the course failed with the exact same error the groups formerly succeeded. To add insult to injury, simply running the import again on the exact same file had the courses import.

Mistakes

A mistake I made was reading the documentation: “Guide to Integration with the SunGard Luminis Data Integration Suite”.

Sungard Libraries:

  1. Page 8 says imq.jar and mbclient.jar do not come with CE/Vista and must be obtained from Sungard. All three of us thought in Vista 3.x these were automatically placed so we didn’t need to place them. Best I can tell, these were installed by Vista. I found $WEBCTDOMAIN/customconfig/startup.properties references both files in CUSTOM_CLASSPATH and setEnv.sh references CUSTOM_CLASSPATH. (This document has notes for what CE customers need to do and no note about CE users needing to go get them from Sungard.)
  2. Those who believe the last note would keep reading and find on Page 9 instructions to deposit the files in $WEBCTDOMAIN/serverlibs/. Assuming I am wrong about item #1, the startup.properties expects them in $WEBCTDOMAIN/serverlibs/luminis/ and would not find them where the document says to put them.

Bb 9 is NextGen?

UPDATE: Have another source who disagrees with Badge who made the claim below.

———

Hopefully this is true. I’d hate to be spreading a false rumor. Anyone willing to confirm? 😀

According to J. L. Badge, Blackboard 9 will be the “NextGen” product we are… *cough* eagerly… *ahem* waiting to see. John Fontaine was in our building Thursday. He mentioned NextGen several times. I must have fallen asleep to have missed a version number was mentioned when he talked about NextGen. It makes me wonder if Jan Posten Day’s recent blog post re: Blackboard Ide Exchange is to elicit early feedback on NextGen?

Certainly, if Bb9 is The One and people can get a first look at BbWorld ’08, then the conference will be mobbed. Lovely.

The “Too little, too late.” comment is funny. Apparently he is professor who has moved on to hanging out by himself in Second Life. A little ahead of the curve.

My interest for Blackboard products are 1) stability, 2) deployments, and 3) doing tier 3+ support well. These early looks are only going to tell me maybe a bit about #3. I really won’t know what I want to know until I talk to the Bb Perfomance Team and look at the dirt from the command-line. Of course, the elimination of all but a few Java applets deserves for us to lift John Fontaine up on our shoulders and parade him around all during BbWorld ’09.