Big Bad Blip

I was at lunch last week when I saw pages about a failed monitoring checks on one of our sites. My coworkers were working on CE/Vista SP6 upgrades. Though it was one upgraded yesterday. When I returned to the office, I asked about it. Exactly 24 hours to the second after checking the license in yesterday’s final start, the JMS node failed a license check four times about a minute apart. On the fourth failure, it started a shutdown of the node. Others in the cluster did as well.

Fortunately, a coworker caught it soon enough to start them again so not enough were shut down the load balancer would stop sending us traffic. Also, this was between terms so we did not have a normal work load.

Still, JMS migrated. That made Weblogic edit the config.xml and probably left the cluster in a weird state. So I set cron to shutdown the cluster at 4am, copy a known good config.xml into place, check the config with our monitor script (pages if bad), and start the cluster. That was a disaster. Various nodes failed their early The startup started the admin node, but the JMS failed to start. So I was paged about it still being down when it ought to have been running.

My 6:30 am starts failed for the same reason: bad encrypted password in boot.properties. My only idea how to fix this was a coworker had mentioned having to re-install an admin node for a security error. So I called the coworker. I explained the problem and the solution I really did not want to take. She looked at the error and thought about it some. She decided it might work to replace the boot.properties with an unencrypted version because Weblogic would encrypted it when discovered. She also suggested removing the servers directory and placing a REFRESH file which would prompt the node to download a new copy of the files it needs from the admin node.

That worked to getting the nodes to start correctly. It was fine during the normal maintenance on Friday. Looks like we are in the clear.

That afternoon I brought it up on our normal check-in call with Blackboard. An unable to find license file issue was why Blackboard pulled CE/Vista SP4. It also was a Weblogic upgrade.

Missing Shutdowns

A Weblogic managed node for a development cluster failed to shutdown when our shutdown script requested. The last managed node to shutdown becomes the JMS node and triggers a rewrite of the config.xml. We have scripts in place to check for the config.xml changing and alert us. Since I am the on call this week, I received the page.

I thought it would be good enough to copy the config.xml into place. Since it would be restarted that night by the usual shutdown script, the cluster would pick up the new config.xml and all would be well again. Ha! Normally a node has an entry in WebCTServer.99999999999.log stating something like “Server shutdown has been requested by system.” for each shutdown then the log ends. These are completely missing. That intrigued me as expected these to be present but some reason for why it failed. Instead, it was like no request was sent.

The shutdown script log showed this error where it ought to show the node was shutdown successfully. (Actual names replaced with CLUSTER_NAME and NODE.)

Error:CLUSTER_NAME:Name=NODE,Location=NODE,Type=ServerConfig.

The shutdown script call Blackboard’s stopWebCTServer.sh which just calls another script which takes various inputs to ultimately call:

java -classpath $CLASSPATH weblogic.Admin -url t3://$HOSTNAME:$PORT -username $WL_USER -password $WL_PASS SHUTDOWN

CLASSPATH= can be found in the start scripts and has multiple entries in setEnv.sh. Just run “sh setEnv.sh” to set this for your session.
HOSTNAME= server hostname
PORT= HTTP port where Weblogic listens
WL_USER= Weblogic user
WL_PASS= Weblogic user’s password

Instead of shutting down the node it just gave an irrelevant status. This one node gave a Weblogic command-line response which seems to mean a botched connection but one where something was listening. The not listening or wrong address error is:

Failed to connect to t3://$HOSTNAME:$PORT: Destination unreachable; nested exception is: java.net.ConnectException: Connection refused; No available router to destination.)

Yay, another case to figure out to correctly handle.

I killed the process. For safe measure I did a “touch REFRESH” in $WL_DOMAIN so the node would dump anything it had cached and download new these things. Since it started up this morning in the normal restart, I think it is fixed.

Other than JMS migrating and failing to do so, I don’t think this caused any problems for users. Just so very odd.

P.S. weblogic.Admin is deprecated in Weblogic 9, so it is interesting Blackboard still makes use of it.