Host Hopping Cookies

It started with a tweet on Saturday morning.

@MsIngalls: When I go to the Checklist in @Desire2Learn -when I am logged in – I get an error message that says my log in has expired – ideas? #d2l

This sounded like an issue we had in WebCT Vista product I called Failed Sessions. FSes occur when the user is actively working in the product and suddenly gets dumped to the login page. Not to be confused with Login Loop which is providing the correct username and password but never getting a valid session. I hated working either issue because they were never repeatable. The problem could involve any piece of software that could somehow touch a cookie on the user’s computer. Occasionally they were the fault of Weblogic too.

I recommended Amy open a Desire2Learn ticket through our portal on behalf of the professor. I also started my own investigation.

First, I poked around in the logging database for errors involving checklist. I found different courses not the one involved.

Second, I pulled out of our load balancer logs the id number for the course. That yielded plenty of data showing the problem.

These, I added to the ticket and suggested capturing the HTTP headers should the issue not be repeatable by others. Of course, the support agent was not able to repeat it. The headers clearly showed the cookies were not sent.

The professor of course poked holes all through my suggestions of tracking down which of the many software is involved. Different software, hardware, networks, and browsers meant the cause was probably not something residing on the computers. But the issue definitely was still all of these browsers in a wide variety of places all chose not to send the cookies. This is also when he dropped the next bomb that the problem only occurs on links in a specific widget.

Checking the code behind the widget, I only saw simple absolute URLs. Which made me shudder because earlier this week absolute URLs in the login page for a development site put me in production without me being aware for several minutes.

PSA: Only use absolute URLs when sending a visitor to another web server. Say you are here, at www.ezrasf.com and you want someone to see another of my blog entries. Drop http://www.ezrasf.com from the URL and start the path with / (a relative URL). Should I change the host name to blog.ezrasf.com or www.ezrafreelove.com, then the link has better success of working.

It turns out the problem is the professor used the pre-production host name for the web application. The widget absolute URL links used a different host name for production. Both resolve to the same servers. But cookies are tied to a specific host name. So being logged into one of host and getting a link to the variant means the session is not valid at the variant.

At least the workaround and fix are easy.

The workaround for the professor is to stop using the pre-production URL.

The fix is for the widget designer to turn the absolute URLs to relative URLs since they point to same location.

Also, it would be nice for a better error message than:

No Login

Either you have failed to login, or your login has expired.

First the comma is bad grammar. Second, if I am a normal user who encounters this problem, then what can I do to fix it myself? This is not an error someone sees if their password or username are wrong. This is also not what a user normally sees when they are idle too long. But then again, there are lots and lots of potential causes and solutions.

Big Bad Blip

I was at lunch last week when I saw pages about a failed monitoring checks on one of our sites. My coworkers were working on CE/Vista SP6 upgrades. Though it was one upgraded yesterday. When I returned to the office, I asked about it. Exactly 24 hours to the second after checking the license in yesterday’s final start, the JMS node failed a license check four times about a minute apart. On the fourth failure, it started a shutdown of the node. Others in the cluster did as well.

Fortunately, a coworker caught it soon enough to start them again so not enough were shut down the load balancer would stop sending us traffic. Also, this was between terms so we did not have a normal work load.

Still, JMS migrated. That made Weblogic edit the config.xml and probably left the cluster in a weird state. So I set cron to shutdown the cluster at 4am, copy a known good config.xml into place, check the config with our monitor script (pages if bad), and start the cluster. That was a disaster. Various nodes failed their early The startup started the admin node, but the JMS failed to start. So I was paged about it still being down when it ought to have been running.

My 6:30 am starts failed for the same reason: bad encrypted password in boot.properties. My only idea how to fix this was a coworker had mentioned having to re-install an admin node for a security error. So I called the coworker. I explained the problem and the solution I really did not want to take. She looked at the error and thought about it some. She decided it might work to replace the boot.properties with an unencrypted version because Weblogic would encrypted it when discovered. She also suggested removing the servers directory and placing a REFRESH file which would prompt the node to download a new copy of the files it needs from the admin node.

That worked to getting the nodes to start correctly. It was fine during the normal maintenance on Friday. Looks like we are in the clear.

That afternoon I brought it up on our normal check-in call with Blackboard. An unable to find license file issue was why Blackboard pulled CE/Vista SP4. It also was a Weblogic upgrade.

Session Oddities

One of the clients we host complained about losing their session. Blackboard recommended we switch how our load balancer is handling the session persistence. Before agreeing to do that, we decided to use Blackboard’s script to determine if there is a problem before trying to fix something which may or may not exist.

An acceptable number of sessions showing on multiple nodes of a cluster is less than 5%. When I ran the test, I found 35.8% matched this criteria. But wait just a second, this seemed like an extraordinarily high number. I ran a second test for an identically configured cluster on the same hardware to find only 4.3%. Why are these so different?

Most cases of this “duplicated session” I spot checked were 1 hit for autosignon on another node. Blackboard confirmed these happen before the user has logged in, so they could appear on the other node. So I ran the test again ignoring these autosignon requests and found we were down to 7.2%. Close to acceptable but not quite.

 Similar to autosignon, the editonpro.js appeared in the majority of the cases I spot checked as the sole hit another node. Once, I removed those from the test, I was down to 0.7%. My control cluster was down to 1.4%. 

One would hope the the script used to determine the amount of duplicate sessions would ignore or remove from the data set the known false positive log entries. 

One would also hope the script instructions (requires login to Blackboard help site) would help users account for these false positives. I did leave a comment on the instructions to hopefully help the next person who has to do this.

Forcing Weblogic’s Config.xml

Let’s nevermind why I am working on this in the first place. Namely…

  1. the Blackboard Learning Environment Connector introduced using the hostname and port for applet URLs in Vista 8 Blackboard,
  2. Blackboard dropped WebCT’s support for using a different port for an application when behind a load balancer.
So we found out we could use port 443 as the SSL listen port because we terminate SSL on the load balancer, Weblogic would not bind to port 443, but the Vista application would be tricked into displaying to the end user what we wish.
In the past week, we have put the correct config.xml in place multiple times and found it reverts back to an older version with the port we don’t want. The first time, I was lazy and did not shut down the Weblogic admin server because… well… that was the lazy practice I had used in Weblogic 8.1 and had not had a problem. My shell record shows it was correct then. Within hours it wasn’t correct anymore.
So, we found a few things…
  1. a copy of the config.xml is stored WEBCTDOMAIN/servers/domain_bak/config_prev/,
  2. all files in WEBCTDOMAIN/config/ are pushed to the nodes,
  3. to change this value in the Weblogic console requires turning on a feature to bind to the SSL listen port.
Additionally, we think research into this would show Weblogic stores this information in memory. It will then write changes it makes to the file back to disk on the admin node (destroying our change). Managed nodes will then pick up the change.
The latest shot at this is to purge the #1 and #2 on both the admin server and managed nodes, put the right file in place on the admin nodes, and see if it reverts again.
So now I’ve got to write a script to periodically check if the nodes have the wrong listen port and email us should it change.

links for 2007-07-18

.