BbWorld Presentation Redux Part II – Monitoring

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.
😀

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.

Germane

Maybe its intrinsic to human nature to seek our relevance. To our family. To our friends. To the world. We label those who fail to care about the impact of the behaviors on others as sociopaths. That is a bad thing in case you didn’t know.
🙂

I’ve heard people are happiest in jobs where what they do has meaning to the organization. These employees must feel germane to the organization to have satisfaction. Languishing in a job with no idea how what one is doing helps anyone engenders a feeling of uselessness. Maybe even paranoia about termination could arise. By contrast, knowing the organization completely depends upon every decision made by an individual dispels fear. So many people want to work for Google because Google makes software millions of people use. We provide facilities for thousands of students to conduct their higher education at my work. Its no Google, but I am content.

Mythology, cosmogony, cosmology, and especially religion help define for us where we are in the world and especially what we can do to improve the world around us. We can even find pertinence on the Internet. The popularity of blogs, I think, lies in two things: 1) hoping others find the posts useful in some way and 2) the pertinent comments others leave in feedback.

I think for me, personally, I have not done such a good job understanding my relevance to individuals in my life. Nor have I considered the relevance of other individuals to me. Has anyone systematically done this?

Overheard In The Office

DBA: Some of you use stage in all lower case, some in all upper case, and some in mixed case. You have to all use one naming convention. The way you have written this, its not going to work.
Programmer: That is why I am here. I need you to fix their stuff so it will.
DBA: Do you know what a “Catch-22” is?
Programmer: No, but I have a feeling you are going to tell me.
DBA: Its when you do it one way which causes this to break. When you do it this other way it causes this other thing to break.
Programmer: Right, I just want to do it my way and it to work.
DBA: By breaking everyone else’s stuff?

Who are broadband users?

Go to the article for more. Summary is below.

Bush Broadband Goal Gored – US Broadband Penetration Breaks 70% Among Active Internet Users – Broadband Study Highlights Two-Speed Europe – May 2006 Bandwidth Report

President Bush’s goal of universal broadband access for all Americans by 2007 appears to be in doubt, according to a recent GAO report. Between 42% to 48% of online Americans subscribe to a broadband service, according to two surveys. Among active Internet users, US broadband penetration broke 70% for the first time in April 2006. In Europe, slow adoption among new member states has created a two-speed European Union.

Stop. Now reverse it…. If 48% have broadband, then that means 50% do not. You are more likely to have broadband if you are a college graduate with a good income.