Comment Spam Resumes

Have spammers figured out how to pick reCAPTCHA‘s lock? All of a sudden I am getting hundreds of comment spam blocked by Akismet. When I added reCAPTCHA, it dropped to a few a month. Now 409 in a week.

Guess this is why layers of security are good.

UPDATE: Scanned through for false positives. The first word of many of them were Xanth characters: Bink, Chameleon, Dolph, Iris, Smash, Goldy, Grundy, Cherie, Chester, Roogna, Imbri.

TED Talk: Picking apart the puzzle of racism in elections

By Nate Silver

A less than convincing point… The list of states with voters reporting a racial bias only well matches the Obama-Clinton difference map because Nate draws the audience to the states he’s picking on: Arkansas, Louisiana, Tennessee, Kentucky, and West Virginia (5 hits). He totally ignores the strong race bias in South Carolina, Alaska, Missouri, or Indiana didn’t translate into more votes (4 false negatives). Also Wyoming and Oklahoma both had no reported racial bias and voted more against Obama (2 false positives).

Session Oddities

One of the clients we host complained about losing their session. Blackboard recommended we switch how our load balancer is handling the session persistence. Before agreeing to do that, we decided to use Blackboard’s script to determine if there is a problem before trying to fix something which may or may not exist.

An acceptable number of sessions showing on multiple nodes of a cluster is less than 5%. When I ran the test, I found 35.8% matched this criteria. But wait just a second, this seemed like an extraordinarily high number. I ran a second test for an identically configured cluster on the same hardware to find only 4.3%. Why are these so different?

Most cases of this “duplicated session” I spot checked were 1 hit for autosignon on another node. Blackboard confirmed these happen before the user has logged in, so they could appear on the other node. So I ran the test again ignoring these autosignon requests and found we were down to 7.2%. Close to acceptable but not quite.

 Similar to autosignon, the editonpro.js appeared in the majority of the cases I spot checked as the sole hit another node. Once, I removed those from the test, I was down to 0.7%. My control cluster was down to 1.4%. 

One would hope the the script used to determine the amount of duplicate sessions would ignore or remove from the data set the known false positive log entries. 

One would also hope the script instructions (requires login to Blackboard help site) would help users account for these false positives. I did leave a comment on the instructions to hopefully help the next person who has to do this.

BbWorld Presentation Redux Part II – Monitoring

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.

False Positives

It’s often argued that the high false positive rate proves the system is poorly run or even useless. This is not necessarily the case. In running a system like this, we necessarily trade off false positives against false negatives. We can lower either kind of error, but doing so will increase the other kind. The optimal policy will balance the harm from false positives against the harm from false negatives, to minimize total harm. If the consequences of a false positive are relatively minor…, but the consequences of a false negative are much worse…, then the optimal choice is to accept many false positives in order to drive the false negative rate way down. In other words, a high false positive rate is not by itself a sign of bad policy or bad management. You can argue that the consequences of error are not really so unbalanced, or that the tradeoff is being made poorly, but your argument can’t rely only on the false positive rate.
— Ed Felton — Why So Many False Positives on the No-Fly List?
(Bolding my own.)

This quote is really about the No Fly List whose purpose is to help the airlines identify who is not allowed to fly. False positives have come up at work lately in the context of catching “bad people”. In our case, great differences of opinion exist about whether the false positives are relatively minor. We all do agree that false negatives are very bad.

A concern about the false positives, is a lot of time an resources are spent looking at the possibles only to determine they are not really a “bad person”. The more false positives we get, the more we doubt the usefulness of the tools we have to identify “bad people”.