Snow Storm Effect On Traffic

Tuesday afternoon a bit of snow hit Georgia. We get them occasionally even down here in the southern United States. Usually they hit the mountains. Everywhere else, we just shut down for a few days. Unfortunately, we do not have all of our Atlanta schools on one system or all of our south Georgia schools on one system. Each system has some of each.

These graphs for current connections (application servers to database so an inference to end users) to Desire2Learn do tell these stories:

  1. Some Atlanta universities shut down around noon. Just before noon and moving quickly down, we had fewer and fewer users.
  2. While campuses were closed, some faculty kept their due dates and had students turn in work via Desire2Learn.
  3. Wednesday was a delayed opening for several campuses, so the first peak of the day was around 4pm instead of the normal 11am.
  4. Based on Twitter rumblings a bit of our traffic were students checking to see whether or not class was canceled.

My conclusion is we kept 60-70% of our normal traffic while around 75% of our campus user base was closed.

We will have fully made it when these closing cause our numbers to go higher than normal because when they get home they come into our system to keep working.

Snowstorm Q Site
Snowstorm Q Site
Snowstorm X Site
Snowstorm X Site

P.S. If we had gone with the desired technical architecture of a database/application stack dedicated to each school, then I could show these numbers per school not just per site.

10.2 Upgrade

Later this week both of our production systems will get upgraded from Desire2Learn 9.4.1 to 10.2. This will be an epic effort. Over several days a few dozen people will work on various aspect of the databases, application servers, and testing. Good ‘ole clean IT fun just in time for Christmas. Finding the magical gap between classes and end of term was hard enough when our upgrades took a day. These multi-day ones are much tougher.

Back at FUSION a coworker and I attended a presentation by the team lead for D2L’s Automation group. Thankfully the excellent work from there will be used as part of the upgrade to identify problems earlier. Also, our client testing appears to take a leap forward to automating testing. (We need to take a light year leap forward with application level monitoring.)

Six months of work will be complete.

Then we get to turn around and do it again for May.

Each D2L Site Broke 25 Million Hits

Night School is the last time I posted stats here. So this is probably due.

This Yaketystats graph records the daily hits for each site in green and pink. The gray is a total of the two. I had it go back to April first to show our spring load compared to how we are starting this fall.

Each of our Desire2Learn instances broke 25 million hits. Our schools set their own calendars, so start dates ranged from the 12th to the 26th. The amount of activity surged on start dates for the largest schools. A small bump on Wednesday of last week, normally there is a small fall off on Wednesdays, is due to a smaller school’s first day of classes.

The green line has three very large schools (1st, 2nd, 4th) who all started on the 19th. It peaked at 27.59M on the 20th.

The pink line’s bump on the 26th was due to its only very large school starting then. It peaked at 25.61M on the 27th.

Tuesday’s peak could be the highest we get until finals in early December. Since the schools have a Drop/Add period, the first few days are usually slightly higher activity.

Hits From 2013-APR-01 to 2013-AUG-29
Hits From 2013-APR-01 to 2013-AUG-29

Back in April, I saw:

@kfrisch#D2LRUF Reported that MnSCU’s D2L gets 25 million hits a day! We’re the largest self-hosted client of D2L in the United States.

The largest client we have still uses around 70% Blackboard Vista. Their plan is to be all in D2L come January. That should push the green one over the point it is bigger than MnSCU. But even just the normal increase in usage should have the pink one fairly comparable to MnSCU.

We should do something like about a billion hits a month.

I would prefer using a pageview metric. Maybe one of these days when I can make better progress on the to-do list.

Host Hopping Cookies

It started with a tweet on Saturday morning.

@MsIngalls: When I go to the Checklist in @Desire2Learn -when I am logged in – I get an error message that says my log in has expired – ideas? #d2l

This sounded like an issue we had in WebCT Vista product I called Failed Sessions. FSes occur when the user is actively working in the product and suddenly gets dumped to the login page. Not to be confused with Login Loop which is providing the correct username and password but never getting a valid session. I hated working either issue because they were never repeatable. The problem could involve any piece of software that could somehow touch a cookie on the user’s computer. Occasionally they were the fault of Weblogic too.

I recommended Amy open a Desire2Learn ticket through our portal on behalf of the professor. I also started my own investigation.

First, I poked around in the logging database for errors involving checklist. I found different courses not the one involved.

Second, I pulled out of our load balancer logs the id number for the course. That yielded plenty of data showing the problem.

These, I added to the ticket and suggested capturing the HTTP headers should the issue not be repeatable by others. Of course, the support agent was not able to repeat it. The headers clearly showed the cookies were not sent.

The professor of course poked holes all through my suggestions of tracking down which of the many software is involved. Different software, hardware, networks, and browsers meant the cause was probably not something residing on the computers. But the issue definitely was still all of these browsers in a wide variety of places all chose not to send the cookies. This is also when he dropped the next bomb that the problem only occurs on links in a specific widget.

Checking the code behind the widget, I only saw simple absolute URLs. Which made me shudder because earlier this week absolute URLs in the login page for a development site put me in production without me being aware for several minutes.

PSA: Only use absolute URLs when sending a visitor to another web server. Say you are here, at www.ezrasf.com and you want someone to see another of my blog entries. Drop http://www.ezrasf.com from the URL and start the path with / (a relative URL). Should I change the host name to blog.ezrasf.com or www.ezrafreelove.com, then the link has better success of working.

It turns out the problem is the professor used the pre-production host name for the web application. The widget absolute URL links used a different host name for production. Both resolve to the same servers. But cookies are tied to a specific host name. So being logged into one of host and getting a link to the variant means the session is not valid at the variant.

At least the workaround and fix are easy.

The workaround for the professor is to stop using the pre-production URL.

The fix is for the widget designer to turn the absolute URLs to relative URLs since they point to same location.

Also, it would be nice for a better error message than:

No Login

Either you have failed to login, or your login has expired.

First the comma is bad grammar. Second, if I am a normal user who encounters this problem, then what can I do to fix it myself? This is not an error someone sees if their password or username are wrong. This is also not what a user normally sees when they are idle too long. But then again, there are lots and lots of potential causes and solutions.

LMS Brackets

I don’t really follow basketball. So, it is odd that I even registered this tweet from Dennis Kavelman, @dkavelman:

Dennis Kavelman  Team D2L has 4 schools in sweet sixteen! Go #marquette, #arizona, #msu, #osu! @desire2learn

He is excited because four universities who are clients of his company are in the major NCAA basketball championship contest. If only four are Deseire2Learn customers, then that means the other twelve are not. That made me wonder what are the Learning Management Systems used by these sixteen schools?

    1. Louisville : Blackboard
    2. Oregon : Blackboard
    3. Michigan State : D2L
    4. Duke : Sakai
    5. Wichita State : Blackboard
    6. La Salle : Blackboard
    7. Arizona : D2L
    8. Ohio State : D2L
    9. Kansas : Blackboard
    10. Michigan : Sakai
    11. Florida : Sakai
    12. Florida Gulf Coast : Angel (Blackboard) and Canvas in 39 days
    13. Indiana : Sakai
    14. Syracuse : Blackboard
    15. Marquette : D2L
    16. Miami (FL) : Blackboard

Looks like the breakdown is:

Blackboard 8
Desire2Learn: 4
Sakai: 4

This is an interesting grouping. I kind of knew Sakai tended to be the product of choice for well off schools with the money to spend on customization. So, schools with strong athletics probably are more likely to have something like Sakai. Of course, I expect Canvas to be better represented too as it is hot of late. While Moodle tends to be favored by really small schools without a budget, I still figured it would have some representation (really just FGCU).

Which LMSes will be involved with those in the Final Four?

Bulk User Management

The Desire2Learn conversion process strips our Blackboard Vista sections of students and instructors. Our clients naturally want instructors and designers enrolled in the migrated courses (BbVista sections are D2L courses). So obviously we had to enroll them.

The options were the XML (Holding Tank) format or the CSV (Bulk User Management) format.

Do or do not. There is no try. I helped a coworker write the SQL to generate IMS standard XML for a content migration between Vista 3 and 8. In that case, it was created the hierarchy of the groups and courses for the restores of sections. Writing XML to create enrollments was definitely feasible. The hesitation was a sense that Holding Tank was pretty demanding.

Back in March we used Bulk User Management to seed the golden master instance with accounts. BUM also has an okay UI process which shows what it is going to do before it actually does it. Especially in setting up this process that came in handy.

Either was clearly viable for what we wanted to do. Basically that was to write SQL against the Vista databases to generate Section Instructor and Section Designer enrollments. CSV was the easiest to create. (Jr and III in last names were very rare. I’ve only seen a couple dozen.)

Some problems and their solutions:

    1. Missing users. The student information system creates most users for us. This data changes so generating new files under the new format still could miss users. Plus, there are users created in Vista that never existed in the SIS. The best way to match users to the migrated courses is to use the same data source: Vista. Solution: I created a CSV file to make the users.
    2. Inconsistent codes. Courses created by the SIS have a simple code of an id number + a term code. The term usually looks like year plus the month. Courses created by Vista have a long sequence of numbers and letters that do not really match to anything. The D2L conversion process replaces the Vista codes with the title of the course. Solution: When producing the actual file, loop through the SIS sourced ones and come back through and make the Vista sourced ones. Concatenate the files later if need one.
    3. Special characters. My favorite of these is the single quote. The D2L conversion process lets these through as this character does not affect the course code. It does, however, not let the Bulk User Management enroll users into the course. Oops. Solution: Instead of just fixing the enrollment data, one has to remove the single quotes in course codes. Which means looking through the database because 200,000 courses is too much to deal with one-by-one. And finding a single quote in SQL where it has special meaning is one of those special level of Hell annoyances.

 

Automated Testing

On a call today, our new vendor asked that we verify every web site works before having them apply service packs. Our analyst said, “We can do that.” I pointed out the problem causing the present concern happened one in ten times on one site on one server of the instance. Therefore to catch it, they would need 10 views of the login page for 30 servers for each of 18 sites. That is 5,400 page views.

The conundrum came up because when the service pack was applied to test, some sites on one server failed this check. Over time they cleared and returned. We have monitoring in place to check a single site on each server works with a login and logout. This check is super-sensitive to changes. Originally this check was on a functional evaluation site, but it broke every other week because someone changed a color, icon, etc. That was with 7. With 111, we would go mad.

Clearly, I am going to have to develop automated testing to verify sites on each of their servers before and after server pack application. Too bad the vendor does not make sure everything works after they make changes to our systems.

The Long Problem

Wait Time

We encountered a vexing issue where sections did not appear for the class list for students. We confirmed the students were properly enrolled. We confirmed the current date was between the dates for learning contexts and terms. We confirmed the access was granted to the students. Still, the sections were not showing, but the usual suspects of administrator or instructor error we not the culprits. Someone eventually figured out that the sections showed up in a student’s editor for showing hidden classes, but not on the classlist. Moving the section higher in the list caused other sections to disappear.

We found strange errors in the webct.log about: “com.webct….SettingsDAOException: Generic DB exception(SMS)-SettingsDAOImpl::loadSettingValue java.sql.SQLException: Numeric Overflow”. The stack trace talked about oracle.jdbc.driver.NumberCommonAccessor.getLong which for the Oracle JDBC driver to react like this… We knew was bad. Open a ticket ASAP with Blackboard with a high severity bad.

Then we noticed these errors: “com.webct….NewMemberEJB METHOD: isRoleHasAccessToLc Failed to get sms setting for lcId =9999999999999 and setting = restrictAccessSection_SSTU”. This restrict access had to be related to the students not being able to get into their sections. Failing to look up the value was the curious part.

In Java’s primitive data types, there is one called the long data type. It has a maximum value of 263-1, aka 9,223,372,036,854,775,807. (Yeah, 9 quintillion.) According to Blackboard, Inc., it should take a hundred years to use up the possible values. This supposes that values are added by incrementing them by 1.

It turns out the section copy tool adds a digit each time the section is copied. So the original has about ten digits. The copy has eleven. The copy of the copy has twelve digits. This is our third year in the product with the capability. We have three terms a year. So that is nine potential generations. So ten digit values become nineteen in three years not a hundred. Doh!

Blackboard’s workaround is to divide the numbers exceeding Long by a thousand. Of course, there is a sanity check to make sure none will conflict with existing values. Ugh. In testing I found changing any setting different than a higher context adds a new value to the database. So basically we have to fix every setting change our clients make not just access to sections for students.

Our legendary analyst figured out changing the setting at the institution level and applying to child contexts removes the settings. Of course, that removes the customization of each course or section.

Content Migration Progress Tracking

When moving hundreds of thousands of courses between WebCT Vista and Desire2Learn, keeping track of what made it through which stage seems like an obvious hindsight thing to do. I added that last bit because we started to notice where things fell between the cracks starting to pile up. The basic process…

    1. Through Oracle SQL we populate a middleware database with those courses which meet acceptable criteria.
      CRACK: With the wrong criteria, courses are not even selected.
    2. Through Oracle SQL we generate XML listing in 50 count sets of the courses.
      CRACK: A subset of data loaded into the database may be extracted.
    3. A shell script automates running the WebCT command-line backup process to create a package for each course.
      CRACK: The command-line backup fails on some courses.
    4. Desire2Learn scripts pick up the files and convert WebCT formatted packages to Desire2Learn.
      CRACKS: Too big fail. Too long paths fail. This step can fail to create CSV files for the next step.
    5. Converted packages are placed in a queue to be imported into Desire2Learn.
      CRACKS: WebCT Vista courses can have 1,000 characters in the name and D2L fails if there are more than 50. Courses named the same as a previously loaded one but with a different file name loads both into the same course.

So, there are apparently five different stages and eight potential failures to track and no end-to-end tracking to even know what is missing. Which means inventing something to check logs for errors. 

First thing, I tried writing SQL to create an ordered list of the courses that are available.

The WebCT backups were a little tougher to convert into a useful list. The file names follow the format of Course_Parent_1234567890.bak. They were also in various directories, so I ended up doing something like this to get a list of the files, strip off the parents & time stamps, strip off the directories, and order it.

ls */*.bak | awk -F_ ‘{print $1}’ | awk -F\/ ‘{print $2}’ | sort

So I have two ordered lists. Anything in the D2L one and not in the WebCT one ought to be the work of clients making their own stuff. Anything in the WebCT one and not in the D2L one ought to be my missing ones. Only almost every line is a mismatch.

Visually comparing them, I realized the same courses had in effect different names. All the spaces were converted to underscores. So I ended up adding a sed to convert spaces to underscores after the sort.

Then I wrote some more scripts.

    • Go through the logs from #4 and #5 to display the errors. With it I was able to compare my list of missing with the errors and confirm why they did not come through.
    • Some of the cracks can be addressed by making new import files. So I wrote a script to add those lines to a redo.csv file. Touch up the problems and go.

Basically at this point it only covers 3-5. At some point I am going to have to check steps 1-5. 

New World

Here is an attempt at a positive post with our new vendor, Desire2Learn.

In the Old World, WebCT/Blackboard, my role was to develop install the application / databases, monitor for problems, automate systems to run without the need of humans, or to make it simple for humans to do.

In the New World, Desire2Learn, my role is to install database software, monitor for problems, automate systems to run without the need of humans, or to make it simple for humans to do. Desire2Learn installs the application. Temporarily I am doing content migration work.

In both worlds I do reverse engineering. An understanding of the principles behind the technology help me determine when and where are the problems. We can then hopefully prevent them from happening, detect the problems early, or solve them quickly.

A web server is a web server is a web server, right? Both Weblogic and IIS listen on ports 80 and 443. Files sitting in directories are either served to web browsers or execute code whose results are then served. Both Oracle and SQL Server have tables, views, indexes, and data files. Beyond the basic principles, though, things get hairy. The detail matter quite a bit.

Compiling Apache, PHP, and MySQL from source led to understanding the intimate details of how they worked. Automating Weblogic into silent installs and later working with cloning the install, led to understanding the intimate details of how it works. With IIS delivered to me and Desire2Learn installing the application, I feel very lost.

Throw in this is a very new operating system, database, programming languages, and scripting languages for me.

Basically I feel like I am reverse engineering blindfolded while building the boat I am using to cross the ocean and world is flat. The whole time I wonder when I will sail over the edge.