Bulk User Management

The Desire2Learn conversion process strips our Blackboard Vista sections of students and instructors. Our clients naturally want instructors and designers enrolled in the migrated courses (BbVista sections are D2L courses). So obviously we had to enroll them.

The options were the XML (Holding Tank) format or the CSV (Bulk User Management) format.

Do or do not. There is no try. I helped a coworker write the SQL to generate IMS standard XML for a content migration between Vista 3 and 8. In that case, it was created the hierarchy of the groups and courses for the restores of sections. Writing XML to create enrollments was definitely feasible. The hesitation was a sense that Holding Tank was pretty demanding.

Back in March we used Bulk User Management to seed the golden master instance with accounts. BUM also has an okay UI process which shows what it is going to do before it actually does it. Especially in setting up this process that came in handy.

Either was clearly viable for what we wanted to do. Basically that was to write SQL against the Vista databases to generate Section Instructor and Section Designer enrollments. CSV was the easiest to create. (Jr and III in last names were very rare. I’ve only seen a couple dozen.)

Some problems and their solutions:

    1. Missing users. The student information system creates most users for us. This data changes so generating new files under the new format still could miss users. Plus, there are users created in Vista that never existed in the SIS. The best way to match users to the migrated courses is to use the same data source: Vista. Solution: I created a CSV file to make the users.
    2. Inconsistent codes. Courses created by the SIS have a simple code of an id number + a term code. The term usually looks like year plus the month. Courses created by Vista have a long sequence of numbers and letters that do not really match to anything. The D2L conversion process replaces the Vista codes with the title of the course. Solution: When producing the actual file, loop through the SIS sourced ones and come back through and make the Vista sourced ones. Concatenate the files later if need one.
    3. Special characters. My favorite of these is the single quote. The D2L conversion process lets these through as this character does not affect the course code. It does, however, not let the Bulk User Management enroll users into the course. Oops. Solution: Instead of just fixing the enrollment data, one has to remove the single quotes in course codes. Which means looking through the database because 200,000 courses is too much to deal with one-by-one. And finding a single quote in SQL where it has special meaning is one of those special level of Hell annoyances.

 

The Long Problem

Wait Time

We encountered a vexing issue where sections did not appear for the class list for students. We confirmed the students were properly enrolled. We confirmed the current date was between the dates for learning contexts and terms. We confirmed the access was granted to the students. Still, the sections were not showing, but the usual suspects of administrator or instructor error we not the culprits. Someone eventually figured out that the sections showed up in a student’s editor for showing hidden classes, but not on the classlist. Moving the section higher in the list caused other sections to disappear.

We found strange errors in the webct.log about: “com.webct….SettingsDAOException: Generic DB exception(SMS)-SettingsDAOImpl::loadSettingValue java.sql.SQLException: Numeric Overflow”. The stack trace talked about oracle.jdbc.driver.NumberCommonAccessor.getLong which for the Oracle JDBC driver to react like this… We knew was bad. Open a ticket ASAP with Blackboard with a high severity bad.

Then we noticed these errors: “com.webct….NewMemberEJB METHOD: isRoleHasAccessToLc Failed to get sms setting for lcId =9999999999999 and setting = restrictAccessSection_SSTU”. This restrict access had to be related to the students not being able to get into their sections. Failing to look up the value was the curious part.

In Java’s primitive data types, there is one called the long data type. It has a maximum value of 263-1, aka 9,223,372,036,854,775,807. (Yeah, 9 quintillion.) According to Blackboard, Inc., it should take a hundred years to use up the possible values. This supposes that values are added by incrementing them by 1.

It turns out the section copy tool adds a digit each time the section is copied. So the original has about ten digits. The copy has eleven. The copy of the copy has twelve digits. This is our third year in the product with the capability. We have three terms a year. So that is nine potential generations. So ten digit values become nineteen in three years not a hundred. Doh!

Blackboard’s workaround is to divide the numbers exceeding Long by a thousand. Of course, there is a sanity check to make sure none will conflict with existing values. Ugh. In testing I found changing any setting different than a higher context adds a new value to the database. So basically we have to fix every setting change our clients make not just access to sections for students.

Our legendary analyst figured out changing the setting at the institution level and applying to child contexts removes the settings. Of course, that removes the customization of each course or section.

Mail Delivery Background Jobs

Only 8 years into running this product and I still learn something new about it.

Monday there was an event. Two nodes became responsive at about the same time. The other ten nodes did their jobs and transferred session information to the nodes taking on the sessions. Most were so busy they did not respond to monitor requests. There was lots of slowness. But we did not lose sessions. Nor did we lose the cluster.

Somehow we did lose the Mail tool. (Think internal email, but it can forward messages to email.)

In WebCT Vista 3 we diagnosed this by going to Weblogic, finding the email queues, and restarting some things to email would start flowing again. I was not able to find it that way. Apparently now, we go to the Background Jobs as a server administrator. The waiting mail jobs show up in Pending Jobs view.

Once I restarted the cluster, the blocking mail job was changed to Retried as soon as the JMS node came online. Retried only shows up in the All Jobs view. All the other views do not show it. Which makes sense because each view shows the status of the view name. So the Cancelled Jobs view only shows jobs with the Cancelled status. Any jobs with a Retried status should only show in the (non-existent) Retried Jobs and (existing) All Jobs views. It was bad assumption on my part that all potential statuses have a view.

Hindsight being 20/20, what we need is a Nagios monitor to detect is Pending jobs exceeds maybe 20-50 jobs. Normally this table appears empty. But I could see cases where it normally grows fast then quickly clears.

But then again, we have less than a year on this product. What are the odds this will happen again?

Project Gutenberg Lorem Ipsum

There is a Java vulnerability where an attacker can exploit the hash predictability. The exploit is apparently easier when the content is larger. So the workaround is to limit the size of HTTP POST requests. Weblogic’s 10.3 config.xml has a max-post-size which does this. The handling of when the condition is reached is pathetic. It closes the connection.

In the case a legitimate user encounters this max POST size, their web browser will say our web server closed the connection. Which is perfectly true. I would prefer the web server to respond with some kind of error message to let the user know it was because the too much data was sent in the form submission.

My idea for where to get enough text was for the analysts to pick something from Project Gutenberg. Next to any file is an indicator of the size. This makes it easy for them to pick one large or small enough. The plain text version of A Princess of Mars is 390KB.

After the fact, I suspected I should have just sent them to Lorem Ipsum Generator. Unfortunately it maxed out at 71KB without an obvious warning it did not give me my 200KB I requested. Procato Publishing’s Lorem Ipsum Generator maxed out at 54KB. Blindtext’s Lorem Ipsum Generator maxed out at 100KB. Looks like if I had suggested this route, then I would have had to do more research to figure out what generator would work for them. Or suggest they paste multiple times which requires trust an unexpected result was correctly done.

From these results, I think Project Gutenberg will remain my go to resource for extremely large test texts.

Big Bad Blip

I was at lunch last week when I saw pages about a failed monitoring checks on one of our sites. My coworkers were working on CE/Vista SP6 upgrades. Though it was one upgraded yesterday. When I returned to the office, I asked about it. Exactly 24 hours to the second after checking the license in yesterday’s final start, the JMS node failed a license check four times about a minute apart. On the fourth failure, it started a shutdown of the node. Others in the cluster did as well.

Fortunately, a coworker caught it soon enough to start them again so not enough were shut down the load balancer would stop sending us traffic. Also, this was between terms so we did not have a normal work load.

Still, JMS migrated. That made Weblogic edit the config.xml and probably left the cluster in a weird state. So I set cron to shutdown the cluster at 4am, copy a known good config.xml into place, check the config with our monitor script (pages if bad), and start the cluster. That was a disaster. Various nodes failed their early The startup started the admin node, but the JMS failed to start. So I was paged about it still being down when it ought to have been running.

My 6:30 am starts failed for the same reason: bad encrypted password in boot.properties. My only idea how to fix this was a coworker had mentioned having to re-install an admin node for a security error. So I called the coworker. I explained the problem and the solution I really did not want to take. She looked at the error and thought about it some. She decided it might work to replace the boot.properties with an unencrypted version because Weblogic would encrypted it when discovered. She also suggested removing the servers directory and placing a REFRESH file which would prompt the node to download a new copy of the files it needs from the admin node.

That worked to getting the nodes to start correctly. It was fine during the normal maintenance on Friday. Looks like we are in the clear.

That afternoon I brought it up on our normal check-in call with Blackboard. An unable to find license file issue was why Blackboard pulled CE/Vista SP4. It also was a Weblogic upgrade.

Back Door Restore

Humans make mistakes. Our clients’ administrators some times do very bad things without malicious intent. The “Deny Access” button is too close to the “Delete” one. About 160 student accounts were deleted.

The hypothesis came to me that sections keep data when a student is removed. Maybe it keeps the data when a student’s account is deleted. If I can trick the system into thinking the same student came back, then maybe it will relink the data. Everyone is happy.

To test this hypothesis, I…

  • Exported a copy of the grade book for my test student account in a test CE/Vista 8.0.6 system. Should the test go bad, then I could at least restore the grades.
  • Copied the account’s profile to a text file for the user name, sourcedid.source, and sourcedid.id.
  • Created a new account, gave it  the same user name, sourcedid.source, and sourcedid.id (and first, last, password).
  • Enrolled the account into the original class as a student.

The grades were missing. Clearly my hypothesis was wrong. Data is not kept around for deleted students like it for unenrolled students. Which sucks.

In my retest, I…

  • Unrolled the same account. The grade book showed the student’s data in red, meaning the account was unenrolled but the data still there.
  • Deleted the same account. The grade book still showed the student’s data in red.
  • Created a new account with a 2 in the user name and added it to the section. The grade book showed the new account not the one I deleted.

I hope this means I still saw the data post-delete because of the cache services. Changing the enrollment changed what was stored in the cache so the old account disappeared at that point. A couple more tries confirms the behavior of the student appearing in the grade book post-delete.

Still disconcerting deleted users appear in the grade book.

DSID-0C090334

Working with our clients on LDAP configuration almost invariable starts with SSL certificates. Self-signed, intermediate, and take up a while. The two tools, openSSL and keytool have become my friends. Working with a network admin for the client, I finally saw the legitimate certificate correctly signed by the intermediate certificate not the self-signed. This means I finally saw this new I error I have never before seen.

javax.naming.AuthenticationException: [LDAP: error code 49 – 80090308: LdapErr: DSID-0C090334, comment: AcceptSecurityContext error, data 525, user@host.domain.tld:    at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3041)

Research on the error code DSID-0C090334 led to indications the LDAP search username was incorrect. The Blackboard CE/Vista LDAP client lacks capabilities many clients have to make it easier to use such as searching deeper into a tree or across branches. In this case our clients configured the user as “cn=account”. We looked at other clients who had something like “cn=account,ou=group,dc=domain,dc=edu”. When presented with this discrepancy as likely a problem, the client suggested a path for us to try like the latter. I entered it, tried our test user.

It worked. They also confirmed it worked. Something to add to the wiki, I guess.

HtmlSecurity.config

If you are a CE/Vista admin, then you should probably be aware of $WLDOMAIN/serverconfs/HtmlSecurity.config.

This file has the regex code for blocking inappropriate input by users to exploit forms. Say a student wants to write a mail message to another student with JavaScript to execute malicious code to hijack a session. One of the regexes here would reject the message on Submit with an error and not take it so it would not make it into the database.

The config file makes for interesting reading. Especially at the end where an administrator has the option of turning on items to block images, background images, anchor links, and (my personal favorite) any URL to an external portal “since it would be possible for students to trick instructors into unknowingly making requests to that system.”

 

Pick Up Line

(I will never use.)

My name’s Vista. Can I crash at your place tonight?

Noticed at geekpickuplines.

Especially funny for me because the product I run is the Blackboard Learning Management System Vista Enterprise. We just call it “Vista”. (Yes, very confusing when Windows Vista users want to know the compatibility of Vista with Vista. The answer: barely.)

OpenSSL Handshake

Chain

One of the questions we ask our clients initiating an engagement to help them setup external authentication from our LMS to their server is, “What is the certificate authority for your SSL certificate?” We have been burned by people purchasing certificates from authorities Java does not support. (And the support is indeed limited compared to say, Mozilla.)

We were given the name of an intermediate certificate which set off warning klaxons. There are none of these in the cacerts file, the list of root CAs Java uses.

So the clients setup to test. Failures. The error:

javax.naming.CommunicationException: hostname.domain.tld:port [Root exception is javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

From what I was able to find, the error meant the certificate was not understood. Framed into thinking the intermediate CA was the cause I started looking at how to make it work. The two potential routes were get the client to add the intermediate CA to their server or test ways to complete the chain by adding the intermediate to my client.

More failures.

Amy suggested looking at the certificate on the foreign server by connecting with openssl to get a better idea where it said there was a problem. The command looks like:

openssl s_client -connect hostname:port

The return was pretty clear that it could not understand or trust a self-signed certificate. The “i:” in the last line below is the Issuer. This made it clear the certificate was not signed by the intermediate CA we were told. It was a self-signed certificate. Doh!

depth=0 /CN=hostname.domain.tld
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 /CN=hostname.domain.tld
verify error:num=27:certificate not trusted
verify return:1
depth=0 /CN=hostname.domain.tld
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/CN=hostname.domain.tld
   i:/DC=tld/DC=domain/CN=domain-NAME-CA

It is clear I need to make checking the certificate on the foreign host part of the standard practice. Did some spot checking of previous setups to test against LDAP and every one has a good certificate chain.