Why Ten

The question of why we run ten clusters came up recently. Off the top of my head, the answer was okay. Here is my more thoughtful response.

Whenever I have been in a conversation with a BEA (more recently Oracle) person on Weblogic, the number of nodes we run has invariably surprised them. Major banks serve ten times the number simultaneous users we have on a half dozen managed nodes or less. We have 130 managed nodes for production. Overkill?

There are some advantages they have.

  1. Better control over the application. WebCT hacked together an install process very much counter to the way BEA would have done it. BEA would have had one install the database, the web servers, and then deploy the application using either the console or command-line. WebCT created an installer which does all this in the background out of sight and mind of the administrator. They also created start and stop scripts which do command-line interaction of Weblogic to start the application. Great for automation and making it simple for administrators. It also lobotomies the console making many advanced things one could normally do risky. So now the console is only useful for some minor configuration management and monitoring.
  2. Better control over the code. When there is a performance issue, they can find what is the cause and improve the efficiency of the code. The best I can do is point out the inefficiencies to a company who chose as a priority a completely different codebase. If you do not have control over the code, then you give the code more resources.
  3. As good as Weblogic is at juggling multiple managed nodes, more nodes does not always equal better. Every node has to keep track of the others. The heart beats communicate through multicast. Every node sends out its own and listens for the same from all the others. Around twenty nodes they would miss occasional beats on their own. Thrown in a heavy work load and an overwhelmed node can miss enough missed beats it becomes marked as unavailable by the others. Usually at this point is when the monitors started paging me about strange values in the diagnostics. Reducing the number of nodes helped.

More resources means more nodes. We had two clusters with about 22 nodes (44 total) each when we hit a major performance wall. They were split into four clusters with 15 nodes each (60 total). Eventually these grew to over 22 nodes each again. At this point upgrading was out of the question. A complete overhaul with all new databases and web servers meant we could do whatever we wished.

The ideal plan was a cluster per client. Licenses being so expensive scrapped that plan.

Ten clusters with 13 managed nodes each was a reasonable compromise. More nodes while also using smaller clusters achieved both needs well. Empty databases also gave us a better restarting point. The databases still have grown to the point certain transactions run slowly just for 4 terms later. (I was hoping for 6.) Surviving the next two years will be a challenge to say the least. I wish we got bonuses for averting disasters.

Comments

Leave a Reply