xmllint

This Linux tool is my new best friend. We get thousands of XML files from our clients for loading user, class, and enrollment information. Some of these clients customize our software or write their own software for generating the XML.

This means we frequently get oddities in the files which cause problems. Thankfully I am not the person who has to verify these files are good. I just get to answer the questions that person has about why a particular file failed to load.

The CE/Vista import process will stop if its validator finds invalid XML. Unfortunately, the error “An exception occurred while obtaining error messages.  See webct.log” doesn’t sound like invalid XML.

Usage is pretty simple:

xmllint –valid /path/to/file.xml | head

  1. If the file is valid, then the whole file is in the output.
  2. If there are warnings, then they precede the whole file.
  3. If there are errors, then only the errors are displayed.

I use head here because our files can be up to 15MB, so this prevents the whole file from going on the screen for the first two situations.

I discovered this in researching how to handle the first situation below. It came up again today. So this has been useful to catch errors in the client supplied files where the file failed to load.

1: parser error : XML declaration allowed only at the start of the document
 <?xml version=”1.0″ encoding=”UTF-8″?>

162: parser error : EntityRef: expecting ‘;’
<long>College of Engineering &amp&#059; CIS</long>

(Bolded the errors.) The number before the colon is the line number. The carat it uses to indicate where on the line an error occurred isn’t accurate, so I ignore it.

My hope is to get this integrated into our processes to validate these files before they are loaded and save ourselves headaches the next morning.

Useful User Agents

Rather than depend on end users to accurately report the browser used, I look for the user-agent in the web server logs. (Yes, I know it can be spoofed. Power users would be trying different things to resolve their own issues not coming to us.)

Followers of this blog may recall I changed the Weblogic config.xml to record user agents to the webserver.log.

One trick I use is the double quotes in awk to identify just the user agent. This information is then sorting by name to count (uniq -c) how many of each is present. Finally, I sort again by number with the largest at the top to see which are the most common.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | sort | uniq -c | sort -n -r

This is what I will use looking for a specific user. If I am looking at a wider range, such as the user age for hits on a page, then I probably will use the head command to look at the top 20.

A “feature” of this is getting the build (Firefox 3.011) rather than just the version (Firefox 3). For getting the version, I tend to use something more like this to count the found version out of the log.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | grep -c ‘<version>’

I have yet to see many CE/Vista URIs with the names of web browsers. So these are the most common versions one would likely find (what to grep – name – notes):

  1. MSIE # – Microsoft Internet Explorer – I’ve seen 5 through 8 in the last few months.
  2. Firefox # – Mozilla Firefox – I’ve seen 2 through 3.5. There is enough difference between 3 and 3.5 (also 2 and 2.5) I would count them separately.
  3. Safari – Apple/WebKit – In searching for this one, I would add to the search a ‘grep -v Chrome’ or to eliminate Google Chrome user agents.
  4. Chrome # – Google Chrome – Only versions 1 and 2.

Naturally there many, many others. It surprised me to see iPhone and Android on the list.

Trusting Social Networks

Sunday at brunch we had an interesting conversation about Facebook.

Establishing the appropriate privacy levels to the various constituents see appropriate material is hard. So hard it takes a long pages of text and screenshots to just paint a picture of what to review for the top 10 Facebook privacy settings.

We were discussing how to make the Facebook world we touched more private. How to keep those we supervise or those who supervise us at bay once accepted into our social circle. Few of us only post things our grandmothers would find acceptable, so how do we ensure grandma will never see that picture? This meant banning grandma from seeing the Wall or photo albums or tagged photos.

I had heard we would soon be able to change the privacy levels of individual posts.  This privacy granularity comes at a price according to the New York Times:

By default, all your messages on Facebook will soon be naked visible to the world. The company is starting by rolling out the feature to people who had already set their profiles as public, but it will come to everyone soon.

People like walled gardens. Taking a term from Seth Godin, interacting with just the handpicked few forms a tribe.

If sunlight is the best disinfectant, then social networking on Facebook will die should it be exposed to the world (or too hard to remain private). The most common criticism of blogging is the whole world is in your business. People like the faux-protection of participating online where Google cannot archive it for posterity. This is why Facebook experienced such explosive growth.

Hopefully users will be able to deal with keeping everything as private as they like. Otherwise, we’ll be looking for another walled garden. Maybe I’ll even end up back on my private Twitter account?

Bottle v Tap

Its funny what people think about something we take for granted. Brown tap municipal tap water was stated as the reason for drinking bottled water. Is it a corporate v goverment thing? Is it because bottled water is so much more expensive than tap water so it must be better?

From Coca-Cola’s letter to the state of California about what is in DASANI water:

Most facilities that purify and bottle DASANI procure water from municipal water systems. At a few plants, however, water is obtained from protected groundwater sources managed by the bottling plant, with approvals from local authorities. DASANI® Bottled Water Report as required by California SB 220 (PDF)

It goes on to describe what they do to purify the water they procure: activated carbon filtration, reverse osmosis, ultraviolet light disinfection, re-mineralized, ozonation. So the municipalities get the water to within EPA standards but not FDA standards. Companies selling bottled water have to adhere to the FDA standard not the EPA. Maybe its a good thing: “Generally, over the years, the FDA has adopted EPA standards for tap water as standards for bottled water.” FDA Consumer magazine: Bottled Water: Better Than the Tap? (Should we be worried the same overworked agency which lets us get hit with all kinds of bacteria is protecting us from bad water?)

Athens-Clarke County Public Utilities Department has a similar report where they list how the water exceeds the EPA standards.

Personally, I would love everyone producing water to publish reports about their water quality with the amounts of detected contaminants listed as is shown in this DASANI analysis example. Too bad its just an example of a typical analysis. Anyone know where the real DASANI quality reports might be found?

LMS Security

This morning there was a flurry of effort to locate an article called “Hacking WebCT.” My coworker was able to locate it. We were disappointed. 

The main points of the article were:

  1. Lazy administrators make compromising user accounts easy.
  2. Lazy instructors make getting questions for assessments easy.

These apply to any LMS. So, here is some advice to counter the issues raised in this article.

 

Accounts

Default passwords are the bane of any system. Make users change them. (Yes, this increases support tickets.) This usually comes about because the administrators did not integrate the LMS authentication with LDAP, Kerberos, or CAS  which allows for central management of accounts. Central management of accounts means fewer accounts are likely to sit around with easily guessed intially imposed credentials. 

Linking many services together also raises the exposure should one account account me compromised. Enforce decently strong passwords. Too strong and frequently changed password will encourage users to employ means of remembering passwords which defeat the point. Passwords probably should not ever be just birthdays.

Not sure what advice to provide about the potential of a student installing a keylogger on a computer in a classroom?

 

Assessment Cheating

A long availability period (like a week) provides opportunities for enterprising students to exploit the issues with passwords to see and research questions in advance. Instead, a quiz with a short availability period like an hour means less time to go look at the other account, record the questions, research them, then go back into the proper account and take the assessment.

Instructors should use custome questions. Students can obtain questionss provided by publishers in ePacks or with textbooks from previous students, the same textbooks the instructor received, or even web sites online which sell the information. 

High stakes testing ensures students are looking to cheat. When the value of questions is high, these easier methods than knowing the material ensures a war between students and instructors over cheating. Of course, lowering the value of the questions increases the workload of the instructor. 
🙁

Recovering Pictures

William borrowed my camera to go on his honeymoon. He also lost the photos with a poorly timed crash & drive reformat. So he wants to borrow the card and recover the data. Thankfully I have not used the camera since he returned it despite thinking I should.

Luckily I ran across A Computer Repair Utility Kit You Can Run From a Thumb Drive

I didn’t like the setup of Photorec as it runs through the command line. Navigating the tree was confusing at best. It did recover 1,166 photos / 3.62GB for me.

Not trusting a single method, I also tried Recuva. That worked a little better. It reported 1,395 files found. However, 177 were unrecoverable. Getting 1,218 pictures / 3.78GB back was 52 / 160MB better than Photorec. Though many of the “recovered” pictures just say: Invalid Image. Maybe they really are Raw?

While trying to use Restoration, it crashed the first time. Not sure why. It was fine the next time, though it only found 4 photos.

Filename: Photorec doesn’t restore files with anything like the original name. Recuva and Restoration do.

Meta Data: OSes and image editors know about the EXIF data in pictures. All the Photorec pictures have date taken. Most of the Recuva pictures do. Guess I could see if only 52 pictures are missing the EXIF? That might explain why Photorec lost some of them.

All in all, it was an fun experiment. I am not curious how these stack up against of the proprietary software? Why pay $40 when these are better?

The Digital Switch

The Long Tail claims consumers, given more options, will reflect their widely varied interests. Physical stores cannot fill all of the demand, so bytes stored on disk are the fastest, cheapest method for getting stuff to consumers. We see a mostly example of this shift in the shift to digital music.

Vinyl records were the first physical music media form I used. Later, cassette tapes (1980s) and compact disc (1990s) achieved dominance. In 2001, I started the transition to digital music. There were some stumbles along the way because of technology changes and trusting vendors saying Digital Rights Management is good for consumers. At present, I only listen to digital music when using my own collection.

Digital video seems more complicated. Web sites streaming and on-demand television have the potential to fit the Long Tail model where consumers have access to insanely varied content when they want it. DVRs neither fix the when (just shift the airing to another time) or the insanely varied content. Movie rental distributors like Blockbuster and Netflix are moving toward distributing digital movies and TV shows in setups similar to on-demand. Nothing has even come close to winning.

Digital books may yet get some traction. Computers screens cause eye strain. Laptops don’t feel like a book. PDAs, Blackberrys, and other handhelds with small screens require a ton of scrolling. A recent solution to this is “epaper” which doesn’t constantly refresh. The Amazon Kindle, Barnes & Noble Nook, and Sony Reader are the biggest players. (The Long Tail is not available for the Kindle but is for the Reader. WTH?)

Remaining issues for me:

  1. Ownership is dying.
    • I really like the idea of playing music on my iPod or from CDs. I play DVDs on my computer because I can’t play my DVR stuff in a hotel. So streaming and on-demand only solutions bother me as long-term solutions. If it is easy for distributors to store it because it is just bytes, then it is easy for me to do so as well.
    • I have books from 20 years ago I can still read. Technology changes too much to depend on something I buy today working tomorrow. So maybe “renting” is a way better approach for digital media?
  2. The black markets for music and movies prove consumers want everything any time. Companies must embrace consumer demand and make it easier for consumers or suffer. I think companies changing to accommodate consumer demand is the only reason the music companies have survived. Litigation cannot solve it.
  3. Hardware investment gets expensive every few years.

My solution? Wait and see.

Higher Ed Twitter List

Karlyn Morissette posted her Master Higher Ed Twitter List. Other than @eironae and @barbaranixon, I didn’t know anyone on the list. So I thought to post a list of higher education professionals I follow categorized by primary expertise.

Blackboard twitterers might be another post.

Those in bold are coworkers.

College / University / Departments

@atsu_its – A.T. Still University – IT Help Desk & Support
@BC_Bb – Butte College Blackboard System
@CTLT – Center for Teaching, Learning, and Technology @ Goucher College
@GeorgiaSouthern – Georgia Southern University
@ucblackboard – University of Cincinnati Blackboard Support

CE/Vista

@amylyne – Amy Edwards – CE/Vista DBA
@corinnalo – Corrina Lo – CE/Vista Admin
@elrond25 – Carlos Araya – CE/Vista Admin, Dr. C
@jdmoore90 – Janel Moore – CE/Vista Admin
@jlongland – Jeff Longland – CE/Vista Programmer
@lgekeler – Laura Gekeler – CE/Vista Admin
@ronvs – Ron Santos – CE/Vista Analyst
@sazma – Sam Rowe – YaketyStats
@skodai – Scott Kodai – former Vista Admin now manager
@tehmot – George Hernandez – CE/Vista DBA
@ucblackboard – UC Blackboard Admins

Faculty

@academicdave – David Parry – Emerging Media and Communications
@amberhutchins – Amber Hutchins – PR and Persuasion
@barbaranixon – Barbara Nixon – Public Relations
@captain_primate – Ethan Watrall – Cultural Heritage Informatics
@doctorandree – Andree Rose – English
@KarenRussell – KarenRussell – Public Relations
@mwesch – Mike Wesch – Anthropology
@prof_chuck – Chuck Robertson – Psychology

Information Technologist / Support

@aaronleonard – Aaron Leonard
@Autumm – Autumm Caines
@bwatwood – Britt Watwood
@cscribner – Craig Scribner
@dontodd – Todd Slater
@ECU_Bb_Info – Matt Long
@ekunnen – Eric Kunnen
@heza – Heather Dowd
@hgeorge – Heather George
@masim – ???
@mattlingard – Matt Lingard
@meeganlillis – Meegan Lillis
@soul4real – Coop

Assessment / Library / Research

@alwright1 – Andrea Wright – Librarian
@amylibrarian – Amy Springer – Librarian
@amywatts – Amy Watts – Librarian
@elwhite – Elizabeth White – Librarian
@kimberlyarnold – Kimberly Arnold – Educational Assessment Specialist
@mbogle – Mike Bogle – Research

Web Design / UI

@eironae – Shelley Keith

Director

@aduckworth – Andy Duckworth
@garay – Ed Garay
@grantpotter Grant Potter
@IDLAgravette – Ryan Gravette
@Intellagirl – Sarah B. Robbins
@tomgrissom – Tom Grissom

Technorati : , ,
Del.icio.us : , ,
Flickr : , ,

Separate Populations?

What are my neighbors doing? Curiosity about that question resulted in some conflicting data. Ordered by when I added the RSS feed for them.

  1. search.twitter.com for “Athens GA”  – results are full of people talking about Athens, GA not in Athens, GA. Useful for people coming into town for an event.
  2. TweetLocal search for “Athens, GA” (or 30605 get same results) within 20 miles – Over the last 24 hours the RSS feed has given me 12 posts. First 5 users in search before 9pm: JeremyAce4 in Athens, GA, justdandelions in athens, ga, bozaf in Néa Smírni, Europe/Athensaaronbarton in Athens, GAelbee103 in Athens, GA (last @ 7pm). The hit on Europe/Athens is pretty disappointing.
  3. search.twitter.com for “near:AHN within:20mi” (or 30605 or AthensGA get same results) – Over the same 24 hour period, its RSS feed has given me 53 posts. First 5 users in search before 9pm: ThePicManjulieteastonryan_lafountainRyanHaguealester (last @ 7pm)
No overlap. How is that possible when they supposedly are coming from the same population (time, space, and active)? Both services look for their data on Twitter. Both are looking at the self-identified location for Twitter users. Both have the same range. So, why do they have such different results?
Looking specifically for the Tweetlocal users in search.twitter.com reveals them in the results. Searching on a user though doesn’t reveal the location. On the profile is the right location, so they should have been in both results.
Both fail in my opinion.