information

You are currently browsing articles tagged information.

xmllint

This Linux tool is my new best friend. We get thousands of XML files from our clients for loading user, class, and enrollment information. Some of these clients customize our software or write their own software for generating the XML.

This means we frequently get oddities in the files which cause problems. Thankfully I am not the person who has to verify these files are good. I just get to answer the questions that person has about why a particular file failed to load.

The CE/Vista import process will stop if its validator finds invalid XML. Unfortunately, the error “An exception occurred while obtaining error messages.  See webct.log” doesn’t sound like invalid XML.

Usage is pretty simple:

xmllint –valid /path/to/file.xml | head

  1. If the file is valid, then the whole file is in the output.
  2. If there are warnings, then they precede the whole file.
  3. If there are errors, then only the errors are displayed.

I use head here because our files can be up to 15MB, so this prevents the whole file from going on the screen for the first two situations.

I discovered this in researching how to handle the first situation below. It came up again today. So this has been useful to catch errors in the client supplied files where the file failed to load.

1: parser error : XML declaration allowed only at the start of the document
 <?xml version=”1.0″ encoding=”UTF-8″?>

162: parser error : EntityRef: expecting ‘;’
<long>College of Engineering &amp&#059; CIS</long>

(Bolded the errors.) The number before the colon is the line number. The carat it uses to indicate where on the line an error occurred isn’t accurate, so I ignore it.

My hope is to get this integrated into our processes to validate these files before they are loaded and save ourselves headaches the next morning.

Self-Reporting

When I read something like this, I start to question the validity of the method.

Psychologist Sam Gosling analyzed the Facebook profiles of 236 college-aged people, who were also asked to fill out personality questionnaires… surveys that were designed to assess not only how study participants viewed themselves in reality, but also what their personalities would be like if they had all of their ideal traits.
The Psychology of Facebook Profiles | TIME

The better experiment here is to have half the participants maintain a normal Facebook profile. The other half would create a profile demonstrating their ideal self. Then compare those against the Big Five questionnaire looking at both. The list of personality traits in the article “openness, agreeableness, conscientiousness, extraversion and neuroticism” gives away the test used despite not explicitly named. Of course, I’m no fan of the Big Five.

Should the results match you can say Facebook reveals whatever the Big Five measures. However, I’d be uncomfortable saying any instrument measuring self-reported information accurately reflected anything about a person’s real personality.

Rather than depend on end users to accurately report the browser used, I look for the user-agent in the web server logs. (Yes, I know it can be spoofed. Power users would be trying different things to resolve their own issues not coming to us.)

Followers of this blog may recall I changed the Weblogic config.xml to record user agents to the webserver.log.

One trick I use is the double quotes in awk to identify just the user agent. This information is then sorting by name to count (uniq -c) how many of each is present. Finally, I sort again by number with the largest at the top to see which are the most common.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | sort | uniq -c | sort -n -r

This is what I will use looking for a specific user. If I am looking at a wider range, such as the user age for hits on a page, then I probably will use the head command to look at the top 20.

A “feature” of this is getting the build (Firefox 3.011) rather than just the version (Firefox 3). For getting the version, I tend to use something more like this to count the found version out of the log.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | grep -c ‘<version>’

I have yet to see many CE/Vista URIs with the names of web browsers. So these are the most common versions one would likely find (what to grep – name – notes):

  1. MSIE # – Microsoft Internet Explorer – I’ve seen 5 through 8 in the last few months.
  2. Firefox # – Mozilla Firefox – I’ve seen 2 through 3.5. There is enough difference between 3 and 3.5 (also 2 and 2.5) I would count them separately.
  3. Safari – Apple/WebKit – In searching for this one, I would add to the search a ‘grep -v Chrome’ or to eliminate Google Chrome user agents.
  4. Chrome # – Google Chrome – Only versions 1 and 2.

Naturally there many, many others. It surprised me to see iPhone and Android on the list.

CE/Vista Reports and Tracking displays summaries of activity. If an instructor seeks to know who clicked on a specific file, then Reports and Tracking falls down on the job.

Course Instructor can produce a report of the raw tracking data. However, access to the role falls under the Administration tab so people running the system need to make a user specifically to enroll themselves at the course level to get the reports. (Annoying.)

Instead the administrators for my campuses pass up to my level of support requests to generate reports. For providing these I have SQL to produce a report. This example is for users who clicked on a specific file. Anything in bold is what the SQL composer will need to alter.

set lines 200 pages 9999
col user format a20
col action format a32
col pagename format a80

clear breaks computes
break on User skip 1
compute count of Action on User

select tp.user_name "User",ta.name "Action",
      to_char(tua.event_time,'MM/DD/RR HH24:MI:SS') "Time",
      NVL(tpg.name,'--') "PageName"
  from trk_person tp, trk_action ta, trk_user_action tua,
      trk_page tpg, learning_context lc
  where tp.id = tua.trk_person_id
    and ta.id = tua.trk_action_id
    and tua.trk_page_id = tpg.id (+)
    and tua.trk_learning_context_id = lc.id
    and lc.id = 1234567890
    and tpg.name like '%filename.doc%'
  order by tp.user_name,tua.event_time
/

Output

  • User aka tp.user_name – This is the student’s account.
  • Action aka ta.name – This is an artifact of the original script. You might drop it as meaningless from this report.
  • Time aka tua.event_time – Day and time the action took place.
  • PageName aka tpg.name – Confirmation of the file name. Keep if using like in a select on this.

Considerations

I use the learning context id (lc.id aka learning_context.id) because in my multi-institution environment, the same name of a section could be used in many places. This id ensures I data from multiple sections.

The tricky part is identifying the file name. HTML files generally will show up as the name of in the title tag (hope the instructor never updates it). Office documents generally will show as the file name. Here are a couple approaches to determining how to use tpg.name (aka trk_page.name).

  1. Look at the file in the user interface.
  2. Run the report without limiting results to any tpg.name. Identify out of the results the name you wish to search and use: tpg.name = ‘page name

Most tracked actions do have a page name. However, some actions do not. This SQL is designed to print a “–” in those cases.

On the BLKBRD-L email list is a discussion about proving students are cheating. Any time the topic comes up, someone says a human in a room is the only way to be sure. Naturally, someone else responds with the latest and greatest technology to detect cheating.

In this case, Acxiom offers identity verification:

By matching a student’s directory information (name, address, phone) to our database, we match the student to our database. The student then must answer questions to verify their identity, which may include name, address and date of birth.


The institution never releases directory information so there are no Family Educational Rights and Privacy Act (FERPA) violations.

However, to complete the course work the student is forced to hand over the information to Acxiom, an unknown and potentially untrusted party. Why should students trust Acxiom when institutions cannot be trusted?

Due to the decentralized nature of IT departments, higher education leads all industries in numbers data breach events. Acxiom’s verification capabilities were designed so that student and instructor privacy is a critical feature of our solution. Institutions never receive the data Acxiom uses in this process. They are simply made aware of the pass/fail rates.

In other words, high education institutions cannot be trusted to handle this information. No reason was provided as to why Acxiom can be better trusted. Guess the people reading this would never check to see whether Acxiom has also had data breaches.

This Electronic Freedom Foundation response to Acxiom’s claims their method is more secure was interesting:

True facts about your life are, by definition, pre-compromised. If the bio question is about something already in the consumer file, arguably the best kind of question is about something that is highly unlikely to be in one’s consumer file and even useless commercially–like my pet’s name.

Answering these kinds of questions feels like more of violation of than a preservation of privacy.

Good Sign I missed the story about brothers convicted of harvesting emails the first time. Well, I noticed a followup.

Back around 2001, the CIO received complaints about performance for the web server. So, I went log trolling to see what the web server was doing. A single IP dominated the HTTP requests. This one IP passed various last names into the email directory. Some quick research revealed Apache could block requests from that IP. That calmed things down enough for me to identify the owner of the IP. The CIO then bullied the ISP to provide contact information for the company involved.

Previous little adventures like this landed me a permanent job, so I jumped at similar challenges.

Well, a few years later, it happened again. This time my boss had made me develop a script for the dissemination of the anti-virus software package to home users. Basically, it used email authentication for verification if someone could get the download link. So, I applied the same technique to the email directory. Well, this upset some people who legitimately needed email addresses. So the human workers would provide email addresses to people with a legitimate need.

I’m glad since I’ve left, VSU no longer looks up email addresses for people. (I thought some of the requests questionable.) Also, my little email authentication script was before LDAP was available to the university. I think the new solution much better.

One the more vocal complainers about my having stopped non-VSU access to the email directory was my current employer. We apparently list email addresses for employees freely. Which makes me wonder how much spam we get is due to the brothers described at the beginning of this story? Or other email harvesters? Just hitting the send button potentially exposes the email address.

No worries. I’m sure Glenn is protecting me. :)

The tumult in Iran is huge news of late. As a Baha’i, news of the persecution of Baha’s in Iran has stepped up because of the Internet. Stories crossed the ocean through email. News agencies almost never picked up these stories. As fast as the Iran government could shut down CNN and NYT and BBC reporters, the same government cannot seem to quell dozens who don’t have press credentials or passports to revoke from sharing the message. So the idea of several thousand sharing a similar message evading the same government doesn’t seem all the surprising to me.

[The Iran unrest] is the first revolution that has been catapulted onto a global stage and transformed by social media. This is it. The big one.

Calling this unrest a revolution seems premature. Still, all this information making it overseas is interesting to watch.

(This started out as a blog comment for Sania’s post Facebook Killed Your Blog. I’m posting it here first.)

We share blogs with the whole world. So our blogs get lost in the noise, bolstering the need for a whole industry optimizing getting found in search engines. Its a concerted effort just get noticed. That’s because blog readers have to seek out blogs to follow, subscribe to the feed, and follow. Finding the best blogs to read is sometimes difficult and more from word of mouth than anything search engines provide.

Blogs also tend to have a lot of information to digest. Social networks have just a line or two with maybe a link to more information. Blog readers typically are designed around the idea of collecting all the posts and letting the user pick which to read. Social networks typically are designed around the idea of just showing recent posts and letting the users choose how far back in time to read.

As technologies lower the costs to express ideas (aka get easier), blogs will get left behind as they have become upside down in value. The costs of writings, reading, subscribing, and commenting on blogs are more expensive compared to micro-blogging or status updates.

Why blog when hanging out on social networks are so much easier? Blogs can only survive as long as they have information worthy.

Why blog when readers are no longer reading? Posting blog entries on social networks does help keep traffic levels somewhat by getting exposure.

As bloggers providing valuable expression leave blogging, the value of blogs decrease. People will still blog. It just won’t be the popular thing to do.

The claims Blackboard’s Learn 9 provides a Web 2.0 experience has bothered me for a while now. First, it was the drag-n-drop. While cool, that isn’t Web 2.0 in my opinion. A little more on track is the claim:

The all-new Web 2.0 experience in Release 9 makes it easy to meaningfully combine information from different sources. The Challenges Are Real, But So Are the Solutions

Integrating with a social network like Facebook is a start, but again, in my opinion, it still isn’t Web 2.0.

So, what is Web 2.0? I did some digging. I think the Tim O’Reilly approach meets my expectation best. He quotes Eric Schmidt’s ”Don’t fight the Internet.” as well as provide his own more in depth.

Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform. Chief among those rules is this: Build applications that harness network effects to get better the more people use them. (This is what I’ve elsewhere called “harnessing collective intelligence.”) Web 2.0 Compact Definition: Trying Again

Users expect a site on the Internet to meet their needs or they eventually move on to a site which does. There are so many web sites out there providing equivalent features to those commonly found in an LMS. There is the danger of irrelevance. This is why every LMS company or group strives to continually add new features (aka innovating). The bar continually gets raised, so LMS software continually needs to meet this higher standard.

Tim additionally provides some other rules which you can see at the above link.

When an LMS reachs the point where the resources of the Internet helps people learn, then it will be a Web 2.0. As long as an expert or leader imparts knowledge on students, the LMS is still something different than Web 2.0. Sorry…. The irony? This is exactly what Michael Wesch and PLE advocates preach.

Found an interesting comment on an article the state of Georgia observing the Confederate Memorial Day….

The truth of history means very little to those who are dead set against learning anything from it. No matter what the history books used in our public school system say, most will never believe anything other than their own opinion about the Civil War. History revisionist are the celebs of the day. As long as people like Rev. Wright, and David Duke exist, history’s truth will be filtered through lies and distortions. Few observe Confederate Memorial Day: UGA to display original constitution; state offices closed

Truth may very well be completely relative. Back during the US Presidential election, I ran across an interesting article in the Washington Post discussing research John Bullock did about the effects of misinformation and idealogical bias ties. I used to think it had to do with a handful of people stuck in their green, second ammendment, pro-life, pro-choice, capitalist, regulation views. My favorite pasttime in college was assuming positions contrary to others even when I agree with the others.

I doubt the effect solely affects conservatives as was proposed in the article. More likely everyone has some blindspots in determing truth from myth or fiction kind of like optical illusions. (Yes, even myself.) We have to choose which information to believe any time we interact with information. Much of the rules in philosophy and science are built around combatting the biases we have.

Rather than force ideas on others, I think we should be teaching children from an early age to recognize when others and most especially themselves are operating under a bias. Its the only way to find detachment.

« Older entries