Weblogic

You are currently browsing articles tagged Weblogic.

Let’s nevermind why I am working on this in the first place. Namely…

  1. the Blackboard Learning Environment Connector introduced using the hostname and port for applet URLs in Vista 8 Blackboard,
  2. Blackboard dropped WebCT’s support for using a different port for an application when behind a load balancer.
So we found out we could use port 443 as the SSL listen port because we terminate SSL on the load balancer, Weblogic would not bind to port 443, but the Vista application would be tricked into displaying to the end user what we wish.
In the past week, we have put the correct config.xml in place multiple times and found it reverts back to an older version with the port we don’t want. The first time, I was lazy and did not shut down the Weblogic admin server because… well… that was the lazy practice I had used in Weblogic 8.1 and had not had a problem. My shell record shows it was correct then. Within hours it wasn’t correct anymore.
So, we found a few things…
  1. a copy of the config.xml is stored WEBCTDOMAIN/servers/domain_bak/config_prev/,
  2. all files in WEBCTDOMAIN/config/ are pushed to the nodes,
  3. to change this value in the Weblogic console requires turning on a feature to bind to the SSL listen port.
Additionally, we think research into this would show Weblogic stores this information in memory. It will then write changes it makes to the file back to disk on the admin node (destroying our change). Managed nodes will then pick up the change.
The latest shot at this is to purge the #1 and #2 on both the admin server and managed nodes, put the right file in place on the admin nodes, and see if it reverts again.
So now I’ve got to write a script to periodically check if the nodes have the wrong listen port and email us should it change.

It has been a hectic week. A recap…

Java certificate fix - Yesterday, August 23rd, the certificate distributed in various Java applets expired. The community discovered the issue and informed Blackboard who put out a fix for the more current products on August 15th. Many customers are leery of having such little lead time to test, verify, and install a fix. Well, Vista 3.0.7.17 was also reported to have the problem, but Blackboard didn’t provide a fix until the 20th after I got my TSM to verify it really still is a problem on the 18th. (The corrected 3.0.7.17.8 version was provided August 21st. Why is in the next paragraph.)

The fix for Vista 3 required us to be on 3.0.7.17.8 (hotfix 8 which we had not yet applied), had references to the “webctapp” directory (in Vista 3 it is applications), and distributed a webct.sh script to add updateWar which didn’t work with Vista 3. FAIL. Thankfully we have modified War files in the past, so adding the updates was more work and accomplished before Blackboard provided a corrected version.

To see the Java certificates in Windows: Control Panel > Java > Security > Certificates. The Blackboard ones are verified by Thawte (the Certificate Authority). The old one is issued to Blackboard. The new one is issued to dc.blackboard.com.

Vista 3.0.7.17.8 - This hotfix was released a couple weeks ago. However, since the priority has been the migration to Vista 8, this was on hold. The previous problem made us step up and throw this into production. The testers went to heroic efforts to get this and the certifcate fix tested. Testing was mixed.

  1. Losing session cookie because of Office 2007 in Internet Explorer. Happened less often post fix, but still happens in some cases.
  2. Autosignon MAC2. Mode to allow insecure MAC works to give the one school using it time to correct update their portal to use MAC2. Originally the plan was to let them work out MAC2 in test.

Slammed by our users…

  1. systemIntegrationApi.dowebct - The school using the autosignon wanted to have the correct consortiaId to create the MAC. Some time back in January they started calling this any time users tried to login because a handful (guess was ~12) have had their username changed. So the autosignon failed. Yes, they were sent us 25,000 requests in a busy day (about 20% of the queues were working on these during the day) to handle potential 12 problems in a term. FAIL.
  2. pmSelfRegister.dowebt - One of the clusters started to have issues. Two nodes went crappy. I looked at the Weblogic console and found all of the failing nodes had no free spots in the queues. 90% of the queues were working on these. Much of this is because the requests were hanging around for at least 4800 seconds (an hour is 3600 seconds). At about 6000 seconds the cluster recovered when the queues cleared.I think the queues cleared because I changed to false a couple settings:
    • Allow users to register themselves as a Student in a section = false
    • Allow users to register themselves as an Auditor in a section = false

    As I recall, we only had about 22 queue spots open (out of 308) across the whole cluster. We got lucky.

webctbackup

John made a good point… While telling Blackboard about this is pointless, the community at large ought to be aware of another undocumented workspace issue. I found an 8GB .bak in the /u01/app/nodeA/weblogic81/webctbackup on the active JMS node. Taking out user accessible nodes is okay in my book as with 18-20 of them in our clusters, we can lose one and no client would ever know. Mail, chat, learning context administration and other services in CE/Vista fail without a functional JMS node.

An administrator did a template reassignment with “Force archive before template reassignment” set to true. For some reason the file was placed on the JMS node. It should have been deleted. However, it was not. I caught it in time as another large file was dropped within 10 minutes of me deleting the first. I only caught it time because I was at my desk working (not in meetings, at home, or asleep).

This came within one GB of completely filling up the file system. We do not have huge hard drives on these nodes, just 3 times the size we need except for this. Nor do we allow the nodes accrue a ton of logs or junk.

Maybe this is something Blackboard has resolved this for future versions like Vista 4 or 8. Maybe one day we will have official or unofficial documentation about this kind of stuff.

The answers I anticipate from Blackboard:

  1. This is functioning as designed. I bet composing the archive requires something from the JMS node, so it must reside there. The JVM is too small as is /var/tmp, so the file system is the best place.
  2. Use a bigger hard drive.
  3. Set “Force archive before template reassignment” to false.

Even if Blackboard agrees this is bad, then it might get fixed on Vista 8. Certainly it will not get fixed in the officially supported  Vista 3.
:(

If you want to confirm if you have the potential for this problem, then you should have a $NODENAME/weblogic81/webctbackup or a $NODENAME/weblogic92/webctbackup directory. We only have them on all four JMS nodes, but have have seen them on four (out of 76) other nodes. The other 72 nodes lack this directory. While you are at it, make sure you know about the other undocumented work spaces I have mentioned.
:)

On the WebCT Users email list (hosted by Blackboard) there is a discussion about a mysterious directory called unmarshall which suddenly appeared. We found it under similar circumstances as others by investigating why a node consumed so much disk space. Failed command-line restores end up in this unmarshall directory.

Unmarshalling in Java jargon means:

converting the byte-stream back to its original data or object 1

This suspiciously sounds like what a decryption process would use to convert a .bak file into a .zip so something can open the file.

This is fourth undocumented work space where failed files site for a while and cause problems and no forewarning from the vendor.

Previous ones are:

  1. Failed UI backups end up in the weblogic81 (Vista 3, does this still happen in Vista 8?) directory.
  2. Failed tracking data files end up in WEBCTDOMAIN/tracking (Vista 3, apparently no longer stored this way in Vista 4/8 according to CSU-Chico and Notre Dame)
  3. Web Services content ends up in /var/tmp/ and are named Axis####axis. These are caused by a bug in DIME (like MIME) for Apache Axis. No one is complaining about the content failing to arrive, so we presume the files just end up on the system.

#3 were the hardest to diagnose because of a lack of an ability to tie the data back to user activity.

Is this all there are? I need to do testing to see which of these I can cross off my list goring forward in Vista 8. Failed restores are on it indefinitely for now.
:(

References:

  1. http://www.jguru.com/faq/view.jsp?EID=560072

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ‘07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.
:D

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.