Email Harvesters

Good Sign I missed the story about brothers convicted of harvesting emails the first time. Well, I noticed a followup.

Back around 2001, the CIO received complaints about performance for the web server. So, I went log trolling to see what the web server was doing. A single IP dominated the HTTP requests. This one IP passed various last names into the email directory. Some quick research revealed Apache could block requests from that IP. That calmed things down enough for me to identify the owner of the IP. The CIO then bullied the ISP to provide contact information for the company involved.

Previous little adventures like this landed me a permanent job, so I jumped at similar challenges.

Well, a few years later, it happened again. This time my boss had made me develop a script for the dissemination of the anti-virus software package to home users. Basically, it used email authentication for verification if someone could get the download link. So, I applied the same technique to the email directory. Well, this upset some people who legitimately needed email addresses. So the human workers would provide email addresses to people with a legitimate need.

I’m glad since I’ve left, VSU no longer looks up email addresses for people. (I thought some of the requests questionable.) Also, my little email authentication script was before LDAP was available to the university. I think the new solution much better.

One the more vocal complainers about my having stopped non-VSU access to the email directory was my current employer. We apparently list email addresses for employees freely. Which makes me wonder how much spam we get is due to the brothers described at the beginning of this story? Or other email harvesters? Just hitting the send button potentially exposes the email address.

No worries. I’m sure Glenn is protecting me. 🙂

Smart Boys

George and I talked about this some last night.

Nature vs Nurture… I tend to think of both as bottlenecks for human development. The debate about which does more to me makes as much sense as debating which is better for a web application: Apache or MySQL? Both are involved and affect the end results. The debate should be about how to leverage the synergy of both, but that is another blog post.

We humans have 46 chromosomes. 23 from each parent which come in pairs. Males have an XY pair. Females have an XX pair. Brain Rules was the first I’ve read that ~1500 brain-related genes are on the X and ~100 on the Y (and losing ~5 every million years). So the X chromosome is quite important for determining brain development.

For boys, the one X they have comes from the mother. Girls inherit an X chromosome from both her mother or father. To set up the strong potential of great genes for boys, look to women who are really intelligent. That tells you there is a 50% shot for the boy to get a good X. If both of the woman’s parents are intellectuals, even better.

Be smart about it though… Don’t make an IQ score for the parents part of a prenuptual agreement.

My mother has occasionally said things I enjoy remind her of her father. That’s a biased sample. 

CE / Vista Undocumented Workspaces

On the WebCT Users email list (hosted by Blackboard) there is a discussion about a mysterious directory called unmarshall which suddenly appeared. We found it under similar circumstances as others by investigating why a node consumed so much disk space. Failed command-line restores end up in this unmarshall directory.

Unmarshalling in Java jargon means:

converting the byte-stream back to its original data or object 1

This suspiciously sounds like what a decryption process would use to convert a .bak file into a .zip so something can open the file.

This is fourth undocumented work space where failed files site for a while and cause problems and no forewarning from the vendor.

Previous ones are:

  1. Failed UI backups end up in the weblogic81 (Vista 3, does this still happen in Vista 8?) directory.
  2. Failed tracking data files end up in WEBCTDOMAIN/tracking (Vista 3, apparently no longer stored this way in Vista 4/8 according to CSU-Chico and Notre Dame)
  3. Web Services content ends up in /var/tmp/ and are named Axis####axis. These are caused by a bug in DIME (like MIME) for Apache Axis. No one is complaining about the content failing to arrive, so we presume the files just end up on the system.

#3 were the hardest to diagnose because of a lack of an ability to tie the data back to user activity.

Is this all there are? I need to do testing to see which of these I can cross off my list goring forward in Vista 8. Failed restores are on it indefinitely for now.



Tale of Defeating the Crazy Woman

Babies are fascinated by me. When the two of us are in a room, they often find me the most interesting thing in the room. Usually, it is mutual.

So, a mutual friend of a friend, Mojan has a fantastic blog. The past year or so has been about being pregnant and most recently figuring out how to be a parent for the first time. Well, a crazy woman set up a ‘blog” which hotlinks images from Mojan’s blog and falsely represents the child in the photos. Ick. I offered to help with this identity theft issue.

Once upon a time, I was annoyed with people taking images from my last employer’s web site. Since I was the campus web designer, I created an image which said, “All your image are belong to VSU.” Also, as the web server administrator, I figured out how to defeat hotlinking with .htaccess by using mod_rewrite to give them my annoyance rather than their content. For the next couple days I watched the perpetrators try and figure out what was wrong. The hate mail I got was fantastic! I recommended Mojan do the same. When she agreed, I went researching to do what I did once upon a time. This is the .htaccess file I recommended she try.

# Basics
Options +FollowSymlinks
RewriteEngine On

# Condition is true for any host other yours
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mojansami\.com/ [nc]

# What to change gif, jpg, png to which target. In this case does not exist.
RewriteRule .*\.(gif|jpg|png)$ [nc]

My directions were not all that specific. So the next thing I know, her site is sporting an Internal Server Error. *headdesk* She used Dreamweaver to create the .htaccess file and upload it to her site. She reported the file she uploaded disappeared. Eventually, it did occur to me to look for the error.log and see what it said. The log complained about DOCTYPE in the .htaccess file in the home directory. A file which did not show in the FTP listing. So, replacing the bad .htaccess file with a blank one fixed the Internal Server Error.

The .htaccess file in the right place, of course, resolved the issue with the crazy woman hotlinking.

Nothing can fix the pain of another person committing identity theft against you or your loved ones. I really hope Mojan doesn’t become discouraged and abandon blogging entirely. Between moderation and authentication she might find a better balance.

Do you have any stories of online identity theft?

A More Usable Usability

Previously I have seen usability describing ease of using a web site. These four essences of usability are interesting.

I believe that to satisfy customers, a Web site must fulfill four distinct needs:

  • Availability: A site that’s unreachable, for any reason, is useless.
  • Responsiveness: Having reached the site, pages that download slowly are likely to drive customers to try an alternate site.
  • Clarity: If the site is sufficiently responsive to keep the customer’s attention, other design qualities come into play. It must be simple and natural to use – easy to learn, predictable, and consistent.
  • Utility: Last comes utility — does the site actually deliver the information or service the customer was looking for in the first place?

Web Usability: A Simple Framework

The first two items deal with system administration issues like the network, server(s), database, or application. Redundancy and proactive dealing with problems before they impact the system hopefully maximizes availibility. Optimization for performance hopefully maximizes responsiveness. An unhealthy database could fail to deliver information.

The last two items deal with design issues. More utility issues are likely based in design than tuning.

UPDATE: In my past life as a “Webmaster,” my fingers were dirty in all four aspects of usability. These were my servers and while not my design, I certainly influenced it by cleaning up the HTML and presentation. We created in-house everything except some outsourced photography and the Apache web server.

Blackboard’s Vista is a proprietary application with decent opportunities for instructional designers to provide clarity and utility. As much as it provides, clients often purchase or create additional applications to integrate with Vista to fill in holes Blackboard left. Okay, technically, WebCT left those holes, but Blackboard took the same model with Academic Suite. Blackboard doesn’t really intend to fill in those holes. They should for issues affecting most of their customers on each platform. This is the same approach taken by open source products with the caveat that third party companies are not filling in the holes, customers are developing their own solutions and providing back to the community.

The declining responsiveness of Vista over time definitely seems to create one frustrating difficulty for some clients. As the database tables get larger, responsiveness of the sites declines. Ouch. Delete it all… Oh, wait… Can we really do that?

On the Fourth through Sixth Loops of Ready 2 Wear

I really have to stop listening to the same song played over and over. It may affect my thinking….

We had another node crash due to the Sun JVM issue. Our start script failed to make a file in /var so the node did not become fully operational as expected. While waiting for those with permission to delete some stuff to free up space, I went looking for what I could delete myself. Naturally /var/tmp seemed a likely place. I found 1,171 files named Axis#####axis. (Replace the #s with well… numbers.) They used up only 42MB. Most were small. Looking across all our machines there are thousands of these dating back to February of this year.

I love the Unix file command. It will tell you what kind of files are there. So I used file | sort -k 2 to sort by the type. Almost all of the files were either plain text or JPEG or GIFs. One file, called a “c program file” turned out to be a JavaScript (based on the C syntax). I downloaded a JPEG file locally, renamed it to have the .jpg extension, and opened it in an image viewer. It opened correctly. Seems its a graphic of a table.

It would seem our Blackboard Vista 3 has been collecting these files for months. They do not take up very much space. There are not nearly enough files to represent a download of content by all users. Our /var would fill up hourly in that case.

Axis is an Apache SOAP project. Vista’s exposed APIs use Axis, I believe. So, the running hypothesis is several of our campuses are using a product which is contacting the APIs to upload content. Its spread out enough that all four clusters are affected. Its something that started about February.

Suspect #1 Respondus – Chosen because we know it hits the APIs to upload content. Discounted because the content is lecture materials. Respondus works with assessments (aka quizzes, tests, exams).

Suspect #2 Impatica – Chosen because the JavaScript file references PPT. Impatica compacts PowerPoint (aka PPT) files and allows them to play without needing a PPT player. Their support pages teach users how to use the Campus Edition 4 user interface to upload content into a course. O-kay….

Suspects #n Softchalk, Diploma, Microsoft .Learn, etc. – I haven’t really investigated any of these. They are just names to me at the moment.

UPDATE: So… There is a bug in Axis which dumps these files into the file system. The files can be deleted as long as they are not current.

BbWorld Presentation Redux Part I – Automation

Much of what I might write in these posts about Vista is knowledge accumulated from the efforts of my coworkers.

I’ve decided to do a series of blog posts on our presentation at BbWorld ’07, on the behalf of the Georgia VIEW project, Maintaining Large Vista Installations (2MB PPT). I wrote the bit about tracking files a while back in large part because of the blank looks we got when I mentioned in our presentation at BbWorld these files exist. For many unanticipated reasons, these may not be made part of the tracking data in the database.

Automation in this context essentially is the scheduling of tasks to run without a human needing to intercede. Humans should spend time on analysis not typing commands into a shell.

Rolling Restarts

This is our internal name for restarting a subset (consisting of nodes) of our clusters. The idea is to restart all managed nodes except the JMS node, usually one at a time. Such restarts are conducted for one of two reasons: 1) have the node pick up a setting or 2) have Java discard from memory everything. The latter is why we restart the nodes once weekly.

Like many, I was skeptical of the value of restarting the nodes in the cluster once weekly. Until, as part of the Daylight Savings Time patching, we provided our nodes to our Systems folks (hardware and operating systems) and forgot to re-enable the Rolling Restarts for one batch. Those nodes starting complaining about issues into the second week. Putting back into place the Rolling Restarts eliminated the issues. So… Now I am a believer!

One of my coworkers created a script which 1) detects whether or not Vista is running on the node, 2) only if Vista is running does it shut down the node, 3) once down, it starts up the node, and 4) finally checks that it is running. Its pretty basic.

Log cleanup to preserve space

We operate on a relatively small space budget. Accumulating logs infinitum strikes us as unnecessary. So, we keep a months’ worth of logs for certain ones. Others are rolled by Log4j to keep a certain number. Certain activities can mean only a day’s worth are kept, so we have on occasion increased the number kept for diagnostics. Log4j is so easy and painless.

We use Unix’s find with mtime to look for files 30 days old with specific file names. We delete the ones which match the pattern.

UPDATE 2007-SEP-18: The axis files in /var/tmp will go on this list, but we will delete any more than a day old.

Error reporting application, tracking, vulnerabilities

Any problems we have encountered, we expect to encounter again at some point. We send ourselves reports to stay on top of potentially escalating issues. Specifically, we monitor for the unmarshalled exception for WebLogic, that tracking files failed to upload, and we used to collect instances of a known vulnerability in Vista. Now that its been patched, we are not looking for it anymore.

Thread dumps

Blackboard at some point will ask for thread dumps at the time the error occurred. Replicating a severe issue strikes us as bad for our users. We have the thread dumps running every 5 minutes and can collect them to provide Blackboard on demand. No messing with the users for us.

Sync admin node with backup

We use rsync to keep a spare admin node in sync with the admin node for each production cluster. Should the admin node fail, we have a hot spare.

LDIS batch integration

Because we do not run a single cluster per school and the Luminis Data Integration Suite does not work with multiple schools for Vista 3 (rumor is Utah has it working for Vista 4), we have to import our Banner data in batches. The schools we host send the files, our expert reviews the files and puts them in place. A script finds the files and uploads each in turn. Our expert can sleep at night.

Very soon, we will automate the running of the table analysis.

Anyone have ideas on what we should automate?

links for 2007-07-25