Mail Delivery Background Jobs

Only 8 years into running this product and I still learn something new about it.

Monday there was an event. Two nodes became responsive at about the same time. The other ten nodes did their jobs and transferred session information to the nodes taking on the sessions. Most were so busy they did not respond to monitor requests. There was lots of slowness. But we did not lose sessions. Nor did we lose the cluster.

Somehow we did lose the Mail tool. (Think internal email, but it can forward messages to email.)

In WebCT Vista 3 we diagnosed this by going to Weblogic, finding the email queues, and restarting some things to email would start flowing again. I was not able to find it that way. Apparently now, we go to the Background Jobs as a server administrator. The waiting mail jobs show up in Pending Jobs view.

Once I restarted the cluster, the blocking mail job was changed to Retried as soon as the JMS node came online. Retried only shows up in the All Jobs view. All the other views do not show it. Which makes sense because each view shows the status of the view name. So the Cancelled Jobs view only shows jobs with the Cancelled status. Any jobs with a Retried status should only show in the (non-existent) Retried Jobs and (existing) All Jobs views. It was bad assumption on my part that all potential statuses have a view.

Hindsight being 20/20, what we need is a Nagios monitor to detect is Pending jobs exceeds maybe 20-50 jobs. Normally this table appears empty. But I could see cases where it normally grows fast then quickly clears.

But then again, we have less than a year on this product. What are the odds this will happen again?

OrgCode Duplicate Filter

I was asked to work my “Unix magic”.

The problem? Duplicate courses were spooled and converted from the WebCT format to the Desire2Learn. The conversion process creates an import file using the WebCT SourcedId.Id as the OrgCode. The first time the OrgCode is used, it creates a course. The next and subsequent times, it duplicates content. So these duplicate converted courses gave us a situation where we were screwed.

Fortunately our partners at Desire2Learn intercepted the problem before it got worse.

Out of 1,505 still to be imported, there were 468 duplicates. Yes, 31% duplicates.

D2L asked me to filter the imports to remove the duplicates. I said I am too much of a n00b with Windows to pull it off.

The reply was to use Unix.

Boy do I love Bash shell scripting. In two hours I solved it, though after the high of solving something I had no idea how to write this morning in two hours, there must be something wrong with it.

First, my general idea was to read the file line by line and write those lines with OrgCodes that do not yet exist to a filtered.csv file. I started out looking to exactly duplicate my existing file in another file by reading it line by line.

A while loop which reads each line and records the whole line in a variable.

INPUTFILE=/path/to/file.txt
exec<$INPUTFILE
while read LINE
do
     stuff...
done

I quickly discovered though that since Windows uses the backslash, that foiled the ability of echo to exactly write every line to a file. The backslash escapes the next character. Neither double nor single quotes helped the situation. Oops. So I decided to use sed to make a temporary copy to duplicate the backslashes. A first backslash escapes the next character, in this case a second backslash.

sed -e 's|\\|\\\\|g'

As an error check, the last thing the script does is a diff -u to compare source and new files. At this stage nothing means perfect. I like the -u to give me easier to read results.

So I was able to get an exact copy of the original. All that was next was to get the OrgCode, check it against my filtered file, and if it did not exist, then add it to the end of the filtered file.

ORGCODE=`cat $LINE | awk -F, '{print $1}'`
IF_EXISTS=`grep $ORGCODE $FILTERFILE`
if [ -z IF_EXISTS ] ; then
     echo $LINE >>  $FILTERFILE
fi

Easy. Too easy?

The checks against my work confirmed it worked.

    1. A sorted version of the source run through this and compared in diff -u consistently showed the correct lines were excluded.
    2. Counts for the number of duplicates and the difference of lines missing works.
    3. A check for the number of duplicate OrgCodes returns nothing on the filtered file.

One of Many

The Learning Management System (LMS) has been a despised technology by some ever since I started working with one, WebCT, in 1999. At the time it was deemed crappy technology that had to improve or die. So today in 2012, about 13 years later, I have to roll my eyes at the pundits writing about how the current technology has not significantly changed in a decade (really more than a decade) because it still offers the same tools and will die unless it adapts.

My first few years, 2006-2010, of working at GeorgiaVIEW, our active user counts doubled every 1.5 years. We plateaued at around 290,000 and grow a few thousand a year. Numbers of actions in the system still doubles every 1.5 year. That is insane growth. Growth unlikely fueled by people despising use of the tool. Right now, we are getting pressure to migrate Summer 2012 content for the Fall 2012 start in Desire2Learn1 because instructors roll over the classes from term-to-term. That speaks of long term consistent loyal use not occasional only as little as have to use. For something on the verge of death, it is hard enough keeping the users happy.

I am a database administrator not a faculty member (or dean or vice president for academic affairs or provost). It seems to me though no one would say, “When you teach a class, the white board in the room is the only tool you can use.” Instead, the push would be to add to the available tools in a neverending pursuit of finding better ones. So we see pressures to integrate the LMS with a variety of similar specialized services. Many are textbook replacements or supplementary services designed specifically for student needs. Others are social media. More and more the LMS is just a portal: a place to organize where students really go to learn.

Also, as an IT guy, I think it is important to have a plan B. Things sometimes fail. As a student I was always annoyed when the instructor had to leave the room for 20% of the class to go track down a piece of chalk because the remaining ones were too small to write. I applauded once in my junior year because the instructor happened to have a piece of chalk in her purse just for that contingency. Similarly, faculty members and even students should think about what to do when the LMS is not there. Heck, what should they do if everything the university IT runs like the web sites, email, portal, and network all disappear. It can happen.

When the university bureaucracy selects and administrates a tool, they will adhere to university policy which adheres to higher education laws. When a faculty member selects and administrates a tool, they should do the same. Unfortunately, that means the faculty member becoming familiar with policy and law. Another challenge is running into different interpretations. An example: a user following @VSUENGL1101 on Twitter could be reasonably expected to be a student at Valdosta State University enrolled in the subject English class 1101. Some say that violates the Family Education Rights and Privacy Act. Some disagree, so it is being debated. The law is old and did not likely anticipate social media, so naturally there is movement towards an update.

I doubt the LMS will simply die because there is something better. Instead it will remain one of many tools for years to come. Like the land line, television, JavaScript, still camera, WiFi, non-smartphone, and (God forbid) pagers.

Note 1: Desire2Learn objects to their product being called an LMS. They prefer Learning Environment on the grounds it integrates with so many other tools.

P.S. This totally is from a sustaining technology perspective. Guess I should write this from a disruptive technology perspective.

Why Ten

The question of why we run ten clusters came up recently. Off the top of my head, the answer was okay. Here is my more thoughtful response.

Whenever I have been in a conversation with a BEA (more recently Oracle) person on Weblogic, the number of nodes we run has invariably surprised them. Major banks serve ten times the number simultaneous users we have on a half dozen managed nodes or less. We have 130 managed nodes for production. Overkill?

There are some advantages they have.

  1. Better control over the application. WebCT hacked together an install process very much counter to the way BEA would have done it. BEA would have had one install the database, the web servers, and then deploy the application using either the console or command-line. WebCT created an installer which does all this in the background out of sight and mind of the administrator. They also created start and stop scripts which do command-line interaction of Weblogic to start the application. Great for automation and making it simple for administrators. It also lobotomies the console making many advanced things one could normally do risky. So now the console is only useful for some minor configuration management and monitoring.
  2. Better control over the code. When there is a performance issue, they can find what is the cause and improve the efficiency of the code. The best I can do is point out the inefficiencies to a company who chose as a priority a completely different codebase. If you do not have control over the code, then you give the code more resources.
  3. As good as Weblogic is at juggling multiple managed nodes, more nodes does not always equal better. Every node has to keep track of the others. The heart beats communicate through multicast. Every node sends out its own and listens for the same from all the others. Around twenty nodes they would miss occasional beats on their own. Thrown in a heavy work load and an overwhelmed node can miss enough missed beats it becomes marked as unavailable by the others. Usually at this point is when the monitors started paging me about strange values in the diagnostics. Reducing the number of nodes helped.

More resources means more nodes. We had two clusters with about 22 nodes (44 total) each when we hit a major performance wall. They were split into four clusters with 15 nodes each (60 total). Eventually these grew to over 22 nodes each again. At this point upgrading was out of the question. A complete overhaul with all new databases and web servers meant we could do whatever we wished.

The ideal plan was a cluster per client. Licenses being so expensive scrapped that plan.

Ten clusters with 13 managed nodes each was a reasonable compromise. More nodes while also using smaller clusters achieved both needs well. Empty databases also gave us a better restarting point. The databases still have grown to the point certain transactions run slowly just for 4 terms later. (I was hoping for 6.) Surviving the next two years will be a challenge to say the least. I wish we got bonuses for averting disasters.

Apples to Oranges

My web hosting service, Dreamhost, happens to have a one-click-installer for Moodle. So I installed one for my own personal sandbox. In looking at the available roles, it suddenly occurred to me…. Comparing any LMS to another is like comparing an apple to an orange. The industry is like the Tower of Babel. Each product has its own jargon covering much of the same ground in absurdly different ways. How could you have an Internet if all the servers talked to each other so differently? Yet in technology created for higher education every system has a different name for the person who teaches a class or even if the name is the same, the capabilities differ.

Sure, there are some commonalities, even apples and oranges are both fruit, but the developers had different conceptual models in mind. The same word meaning different things really is quite annoying. Another example is a course in Vista is a container for the type of place where learning takes place whereas in Learn a course is where teaching takes place. Teaching in Vista takes place in sections. So… Vista : section :: Learn : course.

These different conceptual models are why the faculty get so irate about change. It is hard enough to have to learn new places to click and how to accomplish what you used to accomplish. For some period of time they have to have two vocabularies and maybe even years later they still cannot call it the correct term. (WebCT CE/SE called where teaching takes place courses and both former and current coworkers 6 years later still call sections “courses”).

One would think standards organizations like the IMS Global Learning Consortium would help solve this. Every product adhering to a standard should end up adopting consistent terminology, conceiving of objects similarly, and conceiving of processes similarly. This make comparing the two easier. Except the standard adoptions appear to be in the integration components or database not the main product.

I really feel bad for the instructional technologist who has to support more than one learning management system.

Also, selecting a new LMS seems like an insanely difficult task when trying to learn a dozen vocabularies enough to ascertain whether it has what you need.

Another Way to Verify Cookie Domain

Just finished a Oracle WebLogic Server 11g: Administration Essentials class today. So there are lots of things floating about in my head I want try. (Thankfully we have lots of development clusters for me to break beyond repair. Kidding. Sorta.)

One of the common support questions Blackboard asks for those of us CE/Vista clients running a cluster is whether we have changed the cookie domain in weblogic.xml. This has to do with specifying where the JSESSIONIDVISTA cookie is valid. By default the value in the weblogic.xml file is set to .webct.com which is not valid anywhere (not even Blackboard.com). One of the install steps is if one is running a cluster, in the administrator node Weblogic Domain directory run some commands to extract the weblogic.xml, edit it, then run some commands to add it back to the WAR file. Placing a “REFRESH” empty file on all the managed nodes deletes the staged and cached copies of the WAR.

No big deal and easy.

Except when it isn’t?

Occasionally someone will distrust your work and want you to verify the right setting is there. Normally they say to extract the weblogic.xml again and verify it is correct there. I had a thought. Why not verify in each managed node’s cache it has the correct value?

It is easier than it sounds. In the Weblogic domain directory (where setEnv.sh is located), change directories to

$WL_DOMAIN/servers/node_name/tmp/_WL_user/webct

(NOTE: Anything I put in bold means it is custom to you and not something I can anticipate what you would use there.)

Here I just used these greps to look for my domain. If I get results for the first one, then all is well. If I don’t get results for the first, then the second one should confirm the world is falling because we are missing the cookie domain.

grep “.domain.edu” */war/WEB-INF/weblogic.xml
grep “.webct.com” */war/WEB-INF/weblogic.xml

Since we use dsh for a lot of this kind of thing, I would use our regex for the node name and add on the path pieces in common. I have not yet studied the pieces between webct and war to know for certain who they are derived except to say they appear to 6 characters long and sufficiently random as to not repeat. Any [ejw]ar exploded into the cache appears to get a unique one. So this might work?

grep “.domain.edu” $WL_DOMAIN/servers/node_name_regex/tmp/_WL_user/webct/??????/war/WEB-INF/weblogic.xml

If not, then try:

cd $WL_DOMAIN/servers/node_name_regex/tmp/_WL_user/webct/
&& pwd && grep “.domain.edu” */war/WEB-INF/weblogic.xml

I’m envisioning this method to verify a number of different things in the nodes. It especially confirms the managed node received what I expected not that the admin node has the correct something.

xmllint

This Linux tool is my new best friend. We get thousands of XML files from our clients for loading user, class, and enrollment information. Some of these clients customize our software or write their own software for generating the XML.

This means we frequently get oddities in the files which cause problems. Thankfully I am not the person who has to verify these files are good. I just get to answer the questions that person has about why a particular file failed to load.

The CE/Vista import process will stop if its validator finds invalid XML. Unfortunately, the error “An exception occurred while obtaining error messages.  See webct.log” doesn’t sound like invalid XML.

Usage is pretty simple:

xmllint –valid /path/to/file.xml | head

  1. If the file is valid, then the whole file is in the output.
  2. If there are warnings, then they precede the whole file.
  3. If there are errors, then only the errors are displayed.

I use head here because our files can be up to 15MB, so this prevents the whole file from going on the screen for the first two situations.

I discovered this in researching how to handle the first situation below. It came up again today. So this has been useful to catch errors in the client supplied files where the file failed to load.

1: parser error : XML declaration allowed only at the start of the document
 <?xml version=”1.0″ encoding=”UTF-8″?>

162: parser error : EntityRef: expecting ‘;’
<long>College of Engineering &amp&#059; CIS</long>

(Bolded the errors.) The number before the colon is the line number. The carat it uses to indicate where on the line an error occurred isn’t accurate, so I ignore it.

My hope is to get this integrated into our processes to validate these files before they are loaded and save ourselves headaches the next morning.

Weblogic Diagnostics

I noticed one the nodes in a development cluster was down. So I started it again. The second start failed, so I ended up looking at logs to figure out why. The error in the WebCTServer.000000000.log said:

weblogic.diagnostics.lifecycle.DiagnosticComponentLifecycleException: weblogic.store.PersistentStoreException: java.io.IOException: [Store:280036]Missing the file store file “WLS_DIAGNOSTICS000001.DAT” in the directory “$VISTAHOME/./servers/$NODENAME/data/store/diagnostics”

So I looked to see if the file was there. It wasn’t.

I tried touching a file at the right location and starting it. Another failed start with a new error:

There was an error while reading from the log file.

So I tried copying to WLS_DIAGNOSTICS000002.DAT to WLS_DIAGNOSTICS000001.DAT and starting again. This got me a successful startup. Examination of the WLS files revealed the the 0 and 1 files have updated time stamps while the 2 file hasn’t changed since the first occurance of the error.

That suggests to me Weblogic is unaware of the 2 file and only aware of the 0 and 1 files. Weird.

At least I tricked the software into running again.

Some interesting discussion about these files.

  1. Apparently I could have just renamed the files. CONFIRMED
  2. The files capture JDBC diagnostic data. Maybe I need to look at the JDBC pool settings. DONE (See comment below)
  3. Apparently these files grow and add a new file when it reaches 2GB. Sounds to me like we should purge these files like we do logs. CONFIRMED
  4. There was a bug in a similar version causing these to be on by default.

Guess that gives me some work for tomorrow.
🙁

Odd Tracking File Recording

Every time a Vista 3 node is shut down without going through the initiated shut down process, there is a chance of incorrect data written to the tracking files (in NodeA/tracking/). Normally it leaves strange characters or partial lines at the end of the file. This is the first time I have seen it write the contents of another log instead of the tracking data.

click – 1.0 – 1244228052889 – 1135588340001 – “nova.view.usg.edu-1244227762853-6288” – SSTU – discussion – “compiled-message-viewed” – “page name” – 558711383 –

click – 1.0 – 1244228052891 – 15.0; .NET CLR 1.1.4322)”

2009-04-23      20:58:35        0.0030  xxx.xxx.xxx.xxx    JxH1zg4fZT1LTGcpmyNW    200     GET     /webct/libraryjs.dowebct        locale=en_US    0       “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

Even better. The node went down on June 5th at around 3pm. The lines from the other log were from April 23rd at 8:58pm.

Why am I surprised to see new incorrect behavior? Especially when the node was really confused?