What the Heck Is a Hot Fix?

At the BbWorld Developers’ Conference (Thursday afternoon and Friday morning after BbWorld), there was a session by John Fontaine called What the Heck is a Hotfix? (PPT,audio recording). I’d been meaning to go look for this at the Bb Connections web site where the conference presentations were uploaded. However, I found this through a Bb knowledge base link to eduGarage which apparently is the new home of the Blackboard Developers Network.

  • Ad Hoc Patch – fixes a single issue
  • Hot Fix – Multiple usually related code changes (5-6 issues)
  • Service Pack – Many code changes (50-60 issues)
  • New Release (either Application Pack or new version number)- New features and Large scale code changes

BbWorld Presentation Redux Part II – Monitoring

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.
😀

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.

On the Fourth through Sixth Loops of Ready 2 Wear

I really have to stop listening to the same song played over and over. It may affect my thinking….

We had another node crash due to the Sun JVM issue. Our start script failed to make a file in /var so the node did not become fully operational as expected. While waiting for those with permission to delete some stuff to free up space, I went looking for what I could delete myself. Naturally /var/tmp seemed a likely place. I found 1,171 files named Axis#####axis. (Replace the #s with well… numbers.) They used up only 42MB. Most were small. Looking across all our machines there are thousands of these dating back to February of this year.

I love the Unix file command. It will tell you what kind of files are there. So I used file | sort -k 2 to sort by the type. Almost all of the files were either plain text or JPEG or GIFs. One file, called a “c program file” turned out to be a JavaScript (based on the C syntax). I downloaded a JPEG file locally, renamed it to have the .jpg extension, and opened it in an image viewer. It opened correctly. Seems its a graphic of a table.

It would seem our Blackboard Vista 3 has been collecting these files for months. They do not take up very much space. There are not nearly enough files to represent a download of content by all users. Our /var would fill up hourly in that case.

Axis is an Apache SOAP project. Vista’s exposed APIs use Axis, I believe. So, the running hypothesis is several of our campuses are using a product which is contacting the APIs to upload content. Its spread out enough that all four clusters are affected. Its something that started about February.

Suspect #1 Respondus – Chosen because we know it hits the APIs to upload content. Discounted because the content is lecture materials. Respondus works with assessments (aka quizzes, tests, exams).

Suspect #2 Impatica – Chosen because the JavaScript file references PPT. Impatica compacts PowerPoint (aka PPT) files and allows them to play without needing a PPT player. Their support pages teach users how to use the Campus Edition 4 user interface to upload content into a course. O-kay….

Suspects #n Softchalk, Diploma, Microsoft .Learn, etc. – I haven’t really investigated any of these. They are just names to me at the moment.


UPDATE: So… There is a bug in Axis which dumps these files into the file system. The files can be deleted as long as they are not current.

BbWorld Presentation Redux Part I – Automation

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

I’ve decided to do a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT). I wrote the bit about tracking files a while back in large part because of the blank looks we got when I mentioned in our presentation at BbWorld these files exist. For many unanticipated reasons, these may not be made part of the tracking data in the database.

Automation in this context essentially is the scheduling of tasks to run without a human needing to intercede. Humans should spend time on analysis not typing commands into a shell.

Rolling Restarts

This is our internal name for restarting a subset (consisting of nodes) of our clusters. The idea is to restart all managed nodes except the JMS node, usually one at a time. Such restarts are conducted for one of two reasons: 1) have the node pick up a setting or 2) have Java discard from memory everything. The latter is why we restart the nodes once weekly.

Like many, I was skeptical of the value of restarting the nodes in the cluster once weekly. Until, as part of the Daylight Savings Time patching, we provided our nodes to our Systems folks (hardware and operating systems) and forgot to re-enable the Rolling Restarts for one batch. Those nodes starting complaining about issues into the second week. Putting back into place the Rolling Restarts eliminated the issues. So… Now I am a believer!

One of my coworkers created a script which 1) detects whether or not Vista is running on the node, 2) only if Vista is running does it shut down the node, 3) once down, it starts up the node, and 4) finally checks that it is running. Its pretty basic.

Log cleanup to preserve space

We operate on a relatively small space budget. Accumulating logs infinitum strikes us as unnecessary. So, we keep a months’ worth of logs for certain ones. Others are rolled by Log4j to keep a certain number. Certain activities can mean only a day’s worth are kept, so we have on occasion increased the number kept for diagnostics. Log4j is so easy and painless.

We use Unix’s find with mtime to look for files 30 days old with specific file names. We delete the ones which match the pattern.

UPDATE 2007-SEP-18: The axis files in /var/tmp will go on this list, but we will delete any more than a day old.

Error reporting application, tracking, vulnerabilities

Any problems we have encountered, we expect to encounter again at some point. We send ourselves reports to stay on top of potentially escalating issues. Specifically, we monitor for the unmarshalled exception for WebLogic, that tracking files failed to upload, and we used to collect instances of a known vulnerability in Vista. Now that its been patched, we are not looking for it anymore.

Thread dumps

Blackboard at some point will ask for thread dumps at the time the error occurred. Replicating a severe issue strikes us as bad for our users. We have the thread dumps running every 5 minutes and can collect them to provide Blackboard on demand. No messing with the users for us.

Sync admin node with backup

We use rsync to keep a spare admin node in sync with the admin node for each production cluster. Should the admin node fail, we have a hot spare.

LDIS batch integration

Because we do not run a single cluster per school and the Luminis Data Integration Suite does not work with multiple schools for Vista 3 (rumor is Utah has it working for Vista 4), we have to import our Banner data in batches. The schools we host send the files, our expert reviews the files and puts them in place. A script finds the files and uploads each in turn. Our expert can sleep at night.

Very soon, we will automate the running of the table analysis.

Anyone have ideas on what we should automate?

links for 2007-07-18

.