Weblogic

You are currently browsing articles tagged Weblogic.

Rather than depend on end users to accurately report the browser used, I look for the user-agent in the web server logs. (Yes, I know it can be spoofed. Power users would be trying different things to resolve their own issues not coming to us.)

Followers of this blog may recall I changed the Weblogic config.xml to record user agents to the webserver.log.

One trick I use is the double quotes in awk to identify just the user agent. This information is then sorting by name to count (uniq -c) how many of each is present. Finally, I sort again by number with the largest at the top to see which are the most common.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | sort | uniq -c | sort -n -r

This is what I will use looking for a specific user. If I am looking at a wider range, such as the user age for hits on a page, then I probably will use the head command to look at the top 20.

A “feature” of this is getting the build (Firefox 3.011) rather than just the version (Firefox 3). For getting the version, I tend to use something more like this to count the found version out of the log.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | grep -c ‘<version>’

I have yet to see many CE/Vista URIs with the names of web browsers. So these are the most common versions one would likely find (what to grep – name – notes):

  1. MSIE # – Microsoft Internet Explorer – I’ve seen 5 through 8 in the last few months.
  2. Firefox # – Mozilla Firefox – I’ve seen 2 through 3.5. There is enough difference between 3 and 3.5 (also 2 and 2.5) I would count them separately.
  3. Safari – Apple/WebKit – In searching for this one, I would add to the search a ‘grep -v Chrome’ or to eliminate Google Chrome user agents.
  4. Chrome # – Google Chrome – Only versions 1 and 2.

Naturally there many, many others. It surprised me to see iPhone and Android on the list.

I noticed one the nodes in a development cluster was down. So I started it again. The second start failed, so I ended up looking at logs to figure out why. The error in the WebCTServer.000000000.log said:

weblogic.diagnostics.lifecycle.DiagnosticComponentLifecycleException: weblogic.store.PersistentStoreException: java.io.IOException: [Store:280036]Missing the file store file “WLS_DIAGNOSTICS000001.DAT” in the directory “$VISTAHOME/./servers/$NODENAME/data/store/diagnostics”

So I looked to see if the file was there. It wasn’t.

I tried touching a file at the right location and starting it. Another failed start with a new error:

There was an error while reading from the log file.

So I tried copying to WLS_DIAGNOSTICS000002.DAT to WLS_DIAGNOSTICS000001.DAT and starting again. This got me a successful startup. Examination of the WLS files revealed the the 0 and 1 files have updated time stamps while the 2 file hasn’t changed since the first occurance of the error.

That suggests to me Weblogic is unaware of the 2 file and only aware of the 0 and 1 files. Weird.

At least I tricked the software into running again.

Some interesting discussion about these files.

  1. Apparently I could have just renamed the files. CONFIRMED
  2. The files capture JDBC diagnostic data. Maybe I need to look at the JDBC pool settings. DONE (See comment below)
  3. Apparently these files grow and add a new file when it reaches 2GB. Sounds to me like we should purge these files like we do logs. CONFIRMED
  4. There was a bug in a similar version causing these to be on by default.

Guess that gives me some work for tomorrow.
:(

Some support tickets are more easily solved by knowing both user behavior and environment. An often helpful piece of information is what web browser they used. To add this, shut down the cluster, edit /VISTA_HOME/config/config.xml with to include the cs(User-Agent), and start the cluster. This line will need to appear for every node. At startup, the nodes will download a new copy of the file.

<elf-fields>date time time-taken c-ip x-weblogic.servlet.logging.ELFWebCTSession sc-status cs-method cs-uri-stem cs-uri-query bytes cs(User-Agent) x-weblogic.servlet.logging.ELFWebCTExtras</elf-fields>

Command:
cp config.xml config.xml.bak
sed -s s/bytes x-/bytes cs(User-Agent) x-/g config.xml.bak > config.xml

Probably this could be edited in the Weblogic 9.2 console. I haven’t looked yet.

Let’s nevermind why I am working on this in the first place. Namely…

  1. the Blackboard Learning Environment Connector introduced using the hostname and port for applet URLs in Vista 8 Blackboard,
  2. Blackboard dropped WebCT’s support for using a different port for an application when behind a load balancer.
So we found out we could use port 443 as the SSL listen port because we terminate SSL on the load balancer, Weblogic would not bind to port 443, but the Vista application would be tricked into displaying to the end user what we wish.
In the past week, we have put the correct config.xml in place multiple times and found it reverts back to an older version with the port we don’t want. The first time, I was lazy and did not shut down the Weblogic admin server because… well… that was the lazy practice I had used in Weblogic 8.1 and had not had a problem. My shell record shows it was correct then. Within hours it wasn’t correct anymore.
So, we found a few things…
  1. a copy of the config.xml is stored WEBCTDOMAIN/servers/domain_bak/config_prev/,
  2. all files in WEBCTDOMAIN/config/ are pushed to the nodes,
  3. to change this value in the Weblogic console requires turning on a feature to bind to the SSL listen port.
Additionally, we think research into this would show Weblogic stores this information in memory. It will then write changes it makes to the file back to disk on the admin node (destroying our change). Managed nodes will then pick up the change.
The latest shot at this is to purge the #1 and #2 on both the admin server and managed nodes, put the right file in place on the admin nodes, and see if it reverts again.
So now I’ve got to write a script to periodically check if the nodes have the wrong listen port and email us should it change.

It has been a hectic week. A recap…

Java certificate fix – Yesterday, August 23rd, the certificate distributed in various Java applets expired. The community discovered the issue and informed Blackboard who put out a fix for the more current products on August 15th. Many customers are leery of having such little lead time to test, verify, and install a fix. Well, Vista 3.0.7.17 was also reported to have the problem, but Blackboard didn’t provide a fix until the 20th after I got my TSM to verify it really still is a problem on the 18th. (The corrected 3.0.7.17.8 version was provided August 21st. Why is in the next paragraph.)

The fix for Vista 3 required us to be on 3.0.7.17.8 (hotfix 8 which we had not yet applied), had references to the “webctapp” directory (in Vista 3 it is applications), and distributed a webct.sh script to add updateWar which didn’t work with Vista 3. FAIL. Thankfully we have modified War files in the past, so adding the updates was more work and accomplished before Blackboard provided a corrected version.

To see the Java certificates in Windows: Control Panel > Java > Security > Certificates. The Blackboard ones are verified by Thawte (the Certificate Authority). The old one is issued to Blackboard. The new one is issued to dc.blackboard.com.

Vista 3.0.7.17.8 – This hotfix was released a couple weeks ago. However, since the priority has been the migration to Vista 8, this was on hold. The previous problem made us step up and throw this into production. The testers went to heroic efforts to get this and the certifcate fix tested. Testing was mixed.

  1. Losing session cookie because of Office 2007 in Internet Explorer. Happened less often post fix, but still happens in some cases.
  2. Autosignon MAC2. Mode to allow insecure MAC works to give the one school using it time to correct update their portal to use MAC2. Originally the plan was to let them work out MAC2 in test.

Slammed by our users…

  1. systemIntegrationApi.dowebct – The school using the autosignon wanted to have the correct consortiaId to create the MAC. Some time back in January they started calling this any time users tried to login because a handful (guess was ~12) have had their username changed. So the autosignon failed. Yes, they were sent us 25,000 requests in a busy day (about 20% of the queues were working on these during the day) to handle potential 12 problems in a term. FAIL.
  2. pmSelfRegister.dowebt – One of the clusters started to have issues. Two nodes went crappy. I looked at the Weblogic console and found all of the failing nodes had no free spots in the queues. 90% of the queues were working on these. Much of this is because the requests were hanging around for at least 4800 seconds (an hour is 3600 seconds). At about 6000 seconds the cluster recovered when the queues cleared.I think the queues cleared because I changed to false a couple settings:
    • Allow users to register themselves as a Student in a section = false
    • Allow users to register themselves as an Auditor in a section = false

    As I recall, we only had about 22 queue spots open (out of 308) across the whole cluster. We got lucky.

John made a good point… While telling Blackboard about this is pointless, the community at large ought to be aware of another undocumented workspace issue. I found an 8GB .bak in the /u01/app/nodeA/weblogic81/webctbackup on the active JMS node. Taking out user accessible nodes is okay in my book as with 18-20 of them in our clusters, we can lose one and no client would ever know. Mail, chat, learning context administration and other services in CE/Vista fail without a functional JMS node.

An administrator did a template reassignment with “Force archive before template reassignment” set to true. For some reason the file was placed on the JMS node. It should have been deleted. However, it was not. I caught it in time as another large file was dropped within 10 minutes of me deleting the first. I only caught it time because I was at my desk working (not in meetings, at home, or asleep).

This came within one GB of completely filling up the file system. We do not have huge hard drives on these nodes, just 3 times the size we need except for this. Nor do we allow the nodes accrue a ton of logs or junk.

Maybe this is something Blackboard has resolved this for future versions like Vista 4 or 8. Maybe one day we will have official or unofficial documentation about this kind of stuff.

The answers I anticipate from Blackboard:

  1. This is functioning as designed. I bet composing the archive requires something from the JMS node, so it must reside there. The JVM is too small as is /var/tmp, so the file system is the best place.
  2. Use a bigger hard drive.
  3. Set “Force archive before template reassignment” to false.

Even if Blackboard agrees this is bad, then it might get fixed on Vista 8. Certainly it will not get fixed in the officially supported  Vista 3.
:(

If you want to confirm if you have the potential for this problem, then you should have a $NODENAME/weblogic81/webctbackup or a $NODENAME/weblogic92/webctbackup directory. We only have them on all four JMS nodes, but have have seen them on four (out of 76) other nodes. The other 72 nodes lack this directory. While you are at it, make sure you know about the other undocumented work spaces I have mentioned.
:)

On the WebCT Users email list (hosted by Blackboard) there is a discussion about a mysterious directory called unmarshall which suddenly appeared. We found it under similar circumstances as others by investigating why a node consumed so much disk space. Failed command-line restores end up in this unmarshall directory.

Unmarshalling in Java jargon means:

converting the byte-stream back to its original data or object 1

This suspiciously sounds like what a decryption process would use to convert a .bak file into a .zip so something can open the file.

This is fourth undocumented work space where failed files site for a while and cause problems and no forewarning from the vendor.

Previous ones are:

  1. Failed UI backups end up in the weblogic81 (Vista 3, does this still happen in Vista 8?) directory.
  2. Failed tracking data files end up in WEBCTDOMAIN/tracking (Vista 3, apparently no longer stored this way in Vista 4/8 according to CSU-Chico and Notre Dame)
  3. Web Services content ends up in /var/tmp/ and are named Axis####axis. These are caused by a bug in DIME (like MIME) for Apache Axis. No one is complaining about the content failing to arrive, so we presume the files just end up on the system.

#3 were the hardest to diagnose because of a lack of an ability to tie the data back to user activity.

Is this all there are? I need to do testing to see which of these I can cross off my list goring forward in Vista 8. Failed restores are on it indefinitely for now.
:(

References:

  1. http://www.jguru.com/faq/view.jsp?EID=560072

Yesterday a Blackboard Vista node went unresponsive but recovered. Then a second and third did the same. This was odd as only the test which logs into the node reported a failure. Normally, in severe unresponsive cases the Weblogic monitor either reports either one of the parameters we watch is very high or fails because it was unable to connect to the Weblogic.

The graphs for CPU showed 4 periods of sustained activity near 100% for longer than 10 minutes. In each case, when the CPU dropped there were multiple instances of editAssignmentSubmission.dowebct (not in the PowerSight tracked actions?) just completed. Normally when these are browser resubmissions the logged times are near identical but the time-to-process is about 300 seconds apart. In these cases the logged times and time-to-process are within seconds of each other.

I suspect this means the users (in this case students) are hitting the Submit button repeatedly. Maybe resources devoted to handling process0 are diverted to handling processes 1-5 and causing it to take longer not shorter.

Another possibility is the instructor is having the students upload something massive which takes forever to load into the database. Movies and music take up a ton of space. Maybe I could look at the graphs on storage space and see how quickly the space was consumed compared to control time frames (similar amount of activity).

.

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ‘07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.
:D

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.

« Older entries