Big Bad Blip

I was at lunch last week when I saw pages about a failed monitoring checks on one of our sites. My coworkers were working on CE/Vista SP6 upgrades. Though it was one upgraded yesterday. When I returned to the office, I asked about it. Exactly 24 hours to the second after checking the license in yesterday’s final start, the JMS node failed a license check four times about a minute apart. On the fourth failure, it started a shutdown of the node. Others in the cluster did as well.

Fortunately, a coworker caught it soon enough to start them again so not enough were shut down the load balancer would stop sending us traffic. Also, this was between terms so we did not have a normal work load.

Still, JMS migrated. That made Weblogic edit the config.xml and probably left the cluster in a weird state. So I set cron to shutdown the cluster at 4am, copy a known good config.xml into place, check the config with our monitor script (pages if bad), and start the cluster. That was a disaster. Various nodes failed their early The startup started the admin node, but the JMS failed to start. So I was paged about it still being down when it ought to have been running.

My 6:30 am starts failed for the same reason: bad encrypted password in boot.properties. My only idea how to fix this was a coworker had mentioned having to re-install an admin node for a security error. So I called the coworker. I explained the problem and the solution I really did not want to take. She looked at the error and thought about it some. She decided it might work to replace the boot.properties with an unencrypted version because Weblogic would encrypted it when discovered. She also suggested removing the servers directory and placing a REFRESH file which would prompt the node to download a new copy of the files it needs from the admin node.

That worked to getting the nodes to start correctly. It was fine during the normal maintenance on Friday. Looks like we are in the clear.

That afternoon I brought it up on our normal check-in call with Blackboard. An unable to find license file issue was why Blackboard pulled CE/Vista SP4. It also was a Weblogic upgrade.

webctbackup

John made a good point… While telling Blackboard about this is pointless, the community at large ought to be aware of another undocumented workspace issue. I found an 8GB .bak in the /u01/app/nodeA/weblogic81/webctbackup on the active JMS node. Taking out user accessible nodes is okay in my book as with 18-20 of them in our clusters, we can lose one and no client would ever know. Mail, chat, learning context administration and other services in CE/Vista fail without a functional JMS node.

An administrator did a template reassignment with “Force archive before template reassignment” set to true. For some reason the file was placed on the JMS node. It should have been deleted. However, it was not. I caught it in time as another large file was dropped within 10 minutes of me deleting the first. I only caught it time because I was at my desk working (not in meetings, at home, or asleep).

This came within one GB of completely filling up the file system. We do not have huge hard drives on these nodes, just 3 times the size we need except for this. Nor do we allow the nodes accrue a ton of logs or junk.

Maybe this is something Blackboard has resolved this for future versions like Vista 4 or 8. Maybe one day we will have official or unofficial documentation about this kind of stuff.

The answers I anticipate from Blackboard:

  1. This is functioning as designed. I bet composing the archive requires something from the JMS node, so it must reside there. The JVM is too small as is /var/tmp, so the file system is the best place.
  2. Use a bigger hard drive.
  3. Set “Force archive before template reassignment” to false.

Even if Blackboard agrees this is bad, then it might get fixed on Vista 8. Certainly it will not get fixed in the officially supported  Vista 3.
🙁

If you want to confirm if you have the potential for this problem, then you should have a $NODENAME/weblogic81/webctbackup or a $NODENAME/weblogic92/webctbackup directory. We only have them on all four JMS nodes, but have have seen them on four (out of 76) other nodes. The other 72 nodes lack this directory. While you are at it, make sure you know about the other undocumented work spaces I have mentioned.
🙂

BbWorld Presentation Redux Part I – Automation

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

I’ve decided to do a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT). I wrote the bit about tracking files a while back in large part because of the blank looks we got when I mentioned in our presentation at BbWorld these files exist. For many unanticipated reasons, these may not be made part of the tracking data in the database.

Automation in this context essentially is the scheduling of tasks to run without a human needing to intercede. Humans should spend time on analysis not typing commands into a shell.

Rolling Restarts

This is our internal name for restarting a subset (consisting of nodes) of our clusters. The idea is to restart all managed nodes except the JMS node, usually one at a time. Such restarts are conducted for one of two reasons: 1) have the node pick up a setting or 2) have Java discard from memory everything. The latter is why we restart the nodes once weekly.

Like many, I was skeptical of the value of restarting the nodes in the cluster once weekly. Until, as part of the Daylight Savings Time patching, we provided our nodes to our Systems folks (hardware and operating systems) and forgot to re-enable the Rolling Restarts for one batch. Those nodes starting complaining about issues into the second week. Putting back into place the Rolling Restarts eliminated the issues. So… Now I am a believer!

One of my coworkers created a script which 1) detects whether or not Vista is running on the node, 2) only if Vista is running does it shut down the node, 3) once down, it starts up the node, and 4) finally checks that it is running. Its pretty basic.

Log cleanup to preserve space

We operate on a relatively small space budget. Accumulating logs infinitum strikes us as unnecessary. So, we keep a months’ worth of logs for certain ones. Others are rolled by Log4j to keep a certain number. Certain activities can mean only a day’s worth are kept, so we have on occasion increased the number kept for diagnostics. Log4j is so easy and painless.

We use Unix’s find with mtime to look for files 30 days old with specific file names. We delete the ones which match the pattern.

UPDATE 2007-SEP-18: The axis files in /var/tmp will go on this list, but we will delete any more than a day old.

Error reporting application, tracking, vulnerabilities

Any problems we have encountered, we expect to encounter again at some point. We send ourselves reports to stay on top of potentially escalating issues. Specifically, we monitor for the unmarshalled exception for WebLogic, that tracking files failed to upload, and we used to collect instances of a known vulnerability in Vista. Now that its been patched, we are not looking for it anymore.

Thread dumps

Blackboard at some point will ask for thread dumps at the time the error occurred. Replicating a severe issue strikes us as bad for our users. We have the thread dumps running every 5 minutes and can collect them to provide Blackboard on demand. No messing with the users for us.

Sync admin node with backup

We use rsync to keep a spare admin node in sync with the admin node for each production cluster. Should the admin node fail, we have a hot spare.

LDIS batch integration

Because we do not run a single cluster per school and the Luminis Data Integration Suite does not work with multiple schools for Vista 3 (rumor is Utah has it working for Vista 4), we have to import our Banner data in batches. The schools we host send the files, our expert reviews the files and puts them in place. A script finds the files and uploads each in turn. Our expert can sleep at night.

Very soon, we will automate the running of the table analysis.

Anyone have ideas on what we should automate?

Balancing the Underreaction and Overreaction

Last night was supposed to be easy. Shut down the Vista application, wait for a call after some work, and bring up the app. It was easy… at first. I got the app back online and continued poking after letting others know it was back in service.

Satisfied all was well, I went to bed at 3am or so. Well… I got a phone call. There might be a problem with that earlier maintenance which was being addressed, but was my application affected? Well, no one could log in, so… yeah. Only the tools I use most often to find issues were also unavailable. Oops.

So I improvised. With netstat, I could see nodes in the application cluster were talking to each other and the database. Logs on one node showed it was failing to contact the database. So, the response should have been to shut down the application, right? As Amy says, “When all you have is a hammer, every problem starts to look like a nail.”

I didn’t make that choice. I decided it would be an overreaction. Instead, I reassured people asking me what’s wrong. When the network stabilized, the application did as well. It took over 1/2 an hour for the app to finally resolve the issues. As a result of this funk, it appeared the JMS services might have migrated, so I migrated them back. Hindsight being 20/20, I think even that might have been an overreaction, but it made me feel better to have done something.

Doing something feels better than doing nothing. It just doesn’t resolve the cause. It can easily become the cause of another issue. Two for the price of one?

P.S. I finally got to bed at about 7am, called in to 1pm meeting, and crashed again after that meeting. I’m still tired. Let’s do it again in couple days.