10.2 Upgrade

Later this week both of our production systems will get upgraded from Desire2Learn 9.4.1 to 10.2. This will be an epic effort. Over several days a few dozen people will work on various aspect of the databases, application servers, and testing. Good ‘ole clean IT fun just in time for Christmas. Finding the magical gap between classes and end of term was hard enough when our upgrades took a day. These multi-day ones are much tougher.

Back at FUSION a coworker and I attended a presentation by the team lead for D2L’s Automation group. Thankfully the excellent work from there will be used as part of the upgrade to identify problems earlier. Also, our client testing appears to take a leap forward to automating testing. (We need to take a light year leap forward with application level monitoring.)

Six months of work will be complete.

Then we get to turn around and do it again for May.

Security Inside Out #USGRockEagle13

Eddie Carter and Orrin Char, Oracle

    • Identity management and security and access management.
    • Eddie wore a UGA shirt. Guy in front of me made fun of him obviously not wanting to sell to Georgia Tech. Turns out he’s from  Kennesaw. The GT-UGA rivalry knows no bounds. Love it!
    • Handout: Database firewall more auditing and ACLs than enterprise firewalls access to many hosts.
    • 67% records breached from servers. 76% breached through weak or stolen credentials. Discovered by an external party. 97% preventable with basic controls. Source: 2013 Data Breach Investigations Report.
    • Pre-1997: security issues mistakes. 1998-2007: Privilege abuse. Curiosity. Leakage. 2008-2009: Malicious. Social engineering. Sophisticated attacks. Business data theft. Loss of reputation.
    • Can be fined. Buy services for people affected by the breach.
    • DBAs are the targets. Phishing to get credentials.
    • Change is where gaps are opened. Being more available means more highly privileged users. Consultants and vendors claim they need DBA level access.
    • 80% of IT security programs do not address db security. They address outside computers such as with firewalls. More and more attacks exploit legitimate access applications and user credentials.
    • Supports SQL Server and MySQL.
    • Preventative
      • encryption : If data stolen in encrypted form, then do not have report the breach? Application should not even know it is encrypted. Network encryption now free to us. Autonegotiates with destination. No application changes. Little overhead. Integrated with Oracle technologies. Key management 2 layers. Master in hardware module or in a wallet. Wallet can be tied to hardware and accessed at restart. Data encrypted with table or column key. Table and column keys encrypted with master key.
      • redaction : Use ACLs to determine who can see. It will replace text such as on credit card numbers, SSNs, so can only see a full, partial, fixed.
      • data masking for nonproduction use : copy of production data in test with test being less secure. Masking means no longer valuable data. Finds sensitive columns through templates and convert the data so meaningless. Shuffle salaries. ID numbers randomized even partial. Randomize all but first two characters of last name. Can be two way so change for sending to a partner for process but then revert back when returned.
      • privileged user controls : Compartmentalization of commands. Prevent consultants from querying certain tables. Creates protective zones around schema objects.
    • Detective
      • activity monitoring :
      • database firewall : sits on the network. Parses SQL to determine the intent. Whitelist and Blacklist and exception list. If none, then alerts security to it and potentially added to a list. Have a learning and blocking mode. Can return empty result list to a hacker so thinks there are no records.
      • auditing and reporting : analyze audit-event data. Central audit repository so hacker unaware. Default and custom reports.
      • conditional auditing framework : if-this-then-that
    • Administrative
      • privilege analysis : privilege capture mode. report on what actual privileges and roles that are used. Revoke unnecessary.
      • sensitive data discovery : scan Oracle for sensitive fields. data definitions.
      • configuration management : discover and classify databases. scan for secure config.

Automated Testing

On a call today, our new vendor asked that we verify every web site works before having them apply service packs. Our analyst said, “We can do that.” I pointed out the problem causing the present concern happened one in ten times on one site on one server of the instance. Therefore to catch it, they would need 10 views of the login page for 30 servers for each of 18 sites. That is 5,400 page views.

The conundrum came up because when the service pack was applied to test, some sites on one server failed this check. Over time they cleared and returned. We have monitoring in place to check a single site on each server works with a login and logout. This check is super-sensitive to changes. Originally this check was on a functional evaluation site, but it broke every other week because someone changed a color, icon, etc. That was with 7. With 111, we would go mad.

Clearly, I am going to have to develop automated testing to verify sites on each of their servers before and after server pack application. Too bad the vendor does not make sure everything works after they make changes to our systems.

Rock Eagle Debrief

GeorgiaVIEW

  1. SMART (Section Migration Archive and Restore Tool) created for us by the Georgia Digital Innovation Group seemed well received. I’m glad. DIG worked tirelessly on it on an absurdly short schedule.
  2. Information is strewn about in too many places. There isn’t one place to go for information. Instead between Blackboard, VistaSWAT, and GeorgiaVIEW about 29. I amazed I do find information.
  3. Blackboard NG 9 is too tempting for some.
  4. Vista does DTD valdiation but not very well. We need to XML validation before our XML files are run. As we do not control the source of these files and errors by those creating the files cause problems, we run them in test before running in production. I am thinking of something along the lines of validating the file and finding the errors and reporting to the submitter the problems in the file. Also, it should do XML schema validation so we can ensure the data is as correct as possible before we load it.
Yaketystats
  1. If you run *nix servers, then you need Yaketystats. I have been using it for 2 years. It revolutionized how I go about solving problems. If you are familiar with my Monitoring post, then this is the #2 in that post.
That is all for now. I am sure I will post more later.

Coradiant TrueSight

Several of us saw a demo of Coradiant Truesight yesterday (first mentioned in the BbWorld Monitoring post). Most of the demo, I spent trying to figure out the name Jeff Goldblum as one of team giving the demo had the voice and mannerisms of the actor’s characters. Had he mentioned a butterfly, then I definitely would have clapped. The other reminded me of John Hodgman.

Something I had not noticed at the time, but a reoccurring point of having Truesight is to tell our users, “Here is evidence the problem is on your end and not ours.” This assumes the users are rational or will even believe the evidence. They wish the problem never occurred (preference) and a resolution (secondarily). Preventing every problem, especially issues outside our domain, probably is outside the scope of the budget we receive. So, we are left with resolving the issues. Especially scary are the users who take evidence the problem is on their end or their ISP’s end to mean, “This is all your fault.”

Resolutions we can we offer are:

  1. Hardware change – We can replace or alter the configuration of the hardware components of the network, storage, database, or application.
  2. Software change – We can alter the configuration of the software components of the network, storage, database, or application.
  3. Request a code change from a vendor – We can work with our vendors to get a code change. These take forever to implement.
  4. Suggest a user resolve the issue
    1. We can provide a work around (grudgingly accepted, remember the preferred wish is the problem never occurred).
    2. We suggest configuration changes the user can make to resolve the problem.

Truesight provides us information to help us try to resolve issues. Describing the information provided as “facts” was a nice touch. At Valdosta State, I gave up on users reporting the browsers accurately and captured the information from the User-Agent header. Similarly, at the USG, I’ve found users disagree ~30% of the time about the version of the browser according to the User-Agent string. Heck, they have errors in the name of the class ~40% of the time. My favorite is something took 15 minutes, but all I could find was it took four minutes. Ugh. Because Truesight is capturing the header info, it ought to be much easier to confirm what users were doing and where problems occurred more accurately than the users can describe.

After receiving all the “facts”, we still have to determine the cause. Truesight helps us understand the scope of the problem by how many users, how many web servers, and how many pages are affected by slowness to what degree. As a DBA and administrator, my job identifying cause ought to be easier, though quantifying how much easier probably is difficult to say.

Part of why: (Mostly speculation.) Problems identified as a spike in anything other than “Host” are external causes. These are causes in front of the device. Causes behind the device are “Host”. If these were more narrowly broken down, the maybe we could better determine cause. That would require knowledge web browsers typically would not know like the server processing time, query processing time, or even the health of the servers.

tag: Blackboard Inc, Coradiant, , user agent,

BbWorld Presentation Redux Part II – Monitoring

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

This is part two in a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT).

Part one covered automation of Blackboard Vista 3 tasks. Next, let’s look at monitoring.

Several scripts we have written are in place to collect data. One of the special scripts connects to Weblogic on each node to capture data from several MBeans. Other scripts watch for problems with hardware, the operating system, database, and even login to Vista. Each server (node or database) has, I think, 30-40 monitors. A portion of items we monitor is in the presentation. Every level of our clusters are watched for issues. The data from these scripts are collected into two applications.

  1. Nagios sends us alerts when values from the monitoring scripts on specific criteria fall outside of our expectations. Green means good; yellow means warning; red means bad. Thankfully none in our group are colorblind. Nagios can also send email and pages for alerts. Finding the sweet spot where we get alerted for a problem but avoid false positives perhaps is the most difficult.
  2. An AJAX application two excellent members of our Systems group created called internallyl Stats creates graphs of the same monitored data. Nagios tells us a node failed a test. Stats tells us when the problem started, how long it lasted, and if others also displayed similar issues.We also can use stats to watch trends. For example, we know two peaks by watching WIO usage rise to a noonish peak slough by ~20% and peak again in the evening fairly consistently over weeks and months.

We also use AWStats to provide web server log summary data. Web server logs show activity of the users: where they go, how much, etc.

In summary, Nagios gives us a heads up there is a problem. Stats allows us to trend performance of nodes and databases. AWStats allows us to trend overall user activity.

Coradiant TrueSight was featured in the vendor area at BbWorld. This product looks promising for determining where users encounter issues. Blackboard is working with them, but I suspect its likely for Vista 4 and CE 6.

We have fantastic data. Unfortunately, interpreting the data proves more complex. Say the load on a server hosting a starts climbing, its the point we get pages and continues to climb. What does one do? Remove it from the cluster? Restart it? Restarting it will simply shift the work to another node in the cluster. Say the same happens with the database. Restarting the database will kick all the users out of Vista. Unfortunately, Blackboard does not provide a playbook on what to do with every support possibility. Also, if you ask three DBAs, then you will likely get three answers.
😀

Its important to balance the underreaction and overreaction. When things go wrong, people want us to fix the problem. Vista is capable of handling many faults and not handling very similar faults. The link example was a failed firewall upgrade. I took a similar tact with another firewall problem earlier this week. I ultimately had to restart the cluster that evening because it didn’t recover.

Part three will discuss the node types.

links for 2007-07-18

.

Terror Attack

Yay! A terrorist plot was foiled. I guess the fighting over in Iraq hasn’t exactly kept the terrorists from attacking the USA?

FBI disrupts New York City tunnel plot – Yahoo! News:

Authorities have disrupted planning by foreign terrorists for an attack on New York City tunnels, two law enforcement officials said Friday.

FBI agents monitoring Internet chat rooms used by extremists learned in recent months of the plot to strike a blow at the city’s economy by destroying vital transportation networks, one official said.

Hot Ash Doesn’t Feel Good on Skin

Why is it that with every forseeable natural disaster, the people who could be affected are not willing to get out of the way? I guess living in abject poverty is worse than death?

Red alert for Indonesia volcano

Thousands of people living on the slopes of Mount Merapi in Indonesia are being taken to safety, because of fears the volcano may be about to erupt.

The elderly, women and children have been taken to emergency shelters after officials monitoring the volcano raised the threat status to the highest level.

The volcano has been rumbling for weeks but is becoming more volatile.

Streams of lava have been flowing down one side of the mountain, which is also spewing out hot volcanic ash and smoke.

However some villagers have refused to move because they do not want to leave their crops and livestock.