DOJ, Dreamhost, and DisruptJ20

The government has no interest in records relating to the 1.3 million IP addresses that are mentioned in DreamHost’s numerous press releases and opposition brief.

Basically, the Department of Justice served Dreamhost this warrant asking for

  1. the code backing the web site,
  2. the HTTP request and error logs,
  3. logs about backend connections to upload files to the server
  4. databases
  5. email account metadata and contents
  6. account information for the site owner

Dreamhost resisted the warrant as overly broad. The DOJ is backing off the HTTP logs and unpublished draft posts.

If the site is using certain WordPress plugins to track visitors, then it is possible that the IPs for visitors are in the database. Or if the DOJ looked at the public HTML and noticed a Google Analytics JavaScript, then they know they can issue a warrant to Google to get the visitor information. Would Google resist handing it over as hard as Dreamhost?

 

Universities and SIS “Innovation”

Several years ago, while I worked at a medium-sized university, there was a very similar incident like what happened in Student Is Sanctioned for Creating Class-Registration Web Site. A student wanted into a full class. So he built an application to routinely check for whether a seat was available in the Student Information System. The database administrator for the SIS noticed too much traffic from this user while looking into why the system was working too hard. My impression was the level of traffic was not at the Denial of Service level, but still something that needed to be addressed to improve the experience of others.

The CIO had a chat with the student. At the time it amused me because years before I had been in the same chair as the kid. After, the CIO joked about wondering whether to hire the student, the same as he had joked about me and a friend.

The student went on to develop a student organization site and other good things. He found the right outlet for him to express in code things to scratch his itches. Personally, I think this is good for the students and good for the university. However, a close eye needs to be kept on these students to ensure they make secure, stable, and long-term viable products. When the student graduates, there needs to be a plan for someone taking over the upkeep.

Since then, I have run across even professionals making these students’ mistakes of slamming a system with traffic. One administrator wanted to check whether we were up, so he wrote a JavaScript web page that would hit the development site we provided. It had one two machines, so when five people had that page open at the same time, they somehow got 4 on the same machine which croaked at that kind of load. Weblogic, in my experience, does not handle the same transaction for the same session when the first has not yet completed. Each subsequent transaction takes even longer than the first until it builds up to the point it is taking minutes to complete what should take a fraction of a second.

In general, developers will contact us about developing something to work against our system. We try to be helpful and advise what are likely successful directions. There are still mavericks, who will write something that causes a problem and we try to track down who it is slamming our systems. I consider it part of the job of running a system people want to use. Someone will try to accomplish things outside the normal enter a URL, type in a username, type in a password, click, click, click…. Heck, we write scripts to get around this.

These events are all opportunities to meeting and educating developers.

Want to Work With Me?

There are a bunch of new positions which were just posted. We need analysts, database administrators, and an operating system / hardware specialist.

The list:

We have a great team. So you should come work with us.

Contact me if you are interested or want to know more. (Staff directory and search for ezra)

To Blog or to Share?

This blog has suffered from my sharing on social media. Where I used to post every day, even just one liners to go check out a web site or a story, that activity is now all on Facebook, Twitter, Google+. HackEducation does a weekly post of news. I am thinking about doing something similar for the things I would normally just share.

First, other sites tend to die. Ping.fm screwed me by my not understanding their technology. By using it to cross-post, every link and every image used their shortened URLs. When they lost the database, every link and image was broken. I think ifttt.com works better, so I have it making a backup of this blog at ezrasf.wordpress.com and sneezypb.posterous.com. (Well, except the tags do not go over.)

Second, I can control the format and quotes better on this blog than social media. Sometimes I wish I had quoted more of an article when it disappears behind the paywall, is moved, or removed.

Finally, it would be good for me to spend more time thinking about things before I post. About a tenth the things I intend to post on this blog, I give up on posting and instead share on social media. I feel like there is more thought and intention that goes into a blog post.

P.S. Originally this post started before Christmas. I had it scheduled for today. Setting the goal for the year ought to help.

Why Ten

The question of why we run ten clusters came up recently. Off the top of my head, the answer was okay. Here is my more thoughtful response.

Whenever I have been in a conversation with a BEA (more recently Oracle) person on Weblogic, the number of nodes we run has invariably surprised them. Major banks serve ten times the number simultaneous users we have on a half dozen managed nodes or less. We have 130 managed nodes for production. Overkill?

There are some advantages they have.

  1. Better control over the application. WebCT hacked together an install process very much counter to the way BEA would have done it. BEA would have had one install the database, the web servers, and then deploy the application using either the console or command-line. WebCT created an installer which does all this in the background out of sight and mind of the administrator. They also created start and stop scripts which do command-line interaction of Weblogic to start the application. Great for automation and making it simple for administrators. It also lobotomies the console making many advanced things one could normally do risky. So now the console is only useful for some minor configuration management and monitoring.
  2. Better control over the code. When there is a performance issue, they can find what is the cause and improve the efficiency of the code. The best I can do is point out the inefficiencies to a company who chose as a priority a completely different codebase. If you do not have control over the code, then you give the code more resources.
  3. As good as Weblogic is at juggling multiple managed nodes, more nodes does not always equal better. Every node has to keep track of the others. The heart beats communicate through multicast. Every node sends out its own and listens for the same from all the others. Around twenty nodes they would miss occasional beats on their own. Thrown in a heavy work load and an overwhelmed node can miss enough missed beats it becomes marked as unavailable by the others. Usually at this point is when the monitors started paging me about strange values in the diagnostics. Reducing the number of nodes helped.

More resources means more nodes. We had two clusters with about 22 nodes (44 total) each when we hit a major performance wall. They were split into four clusters with 15 nodes each (60 total). Eventually these grew to over 22 nodes each again. At this point upgrading was out of the question. A complete overhaul with all new databases and web servers meant we could do whatever we wished.

The ideal plan was a cluster per client. Licenses being so expensive scrapped that plan.

Ten clusters with 13 managed nodes each was a reasonable compromise. More nodes while also using smaller clusters achieved both needs well. Empty databases also gave us a better restarting point. The databases still have grown to the point certain transactions run slowly just for 4 terms later. (I was hoping for 6.) Surviving the next two years will be a challenge to say the least. I wish we got bonuses for averting disasters.

Report Just Usernames

Occasionally I’ll want to see the usernames who use something like a user-agent property or were doing something during a range of time. Rather than report all the log lines and pick them out of the data, I use this which Blackboard (or maybe BEA added).

Note  we’ve added user-agents to the webserver.log. The double quote I use as my delimiter in the awk is from us adding the user-agent to the webserver logs.If you have not set up your logs to use this, then you’ll either need to do so or figure out which position is appropriate for you with a space delimiter. The colon in the second awk is where just after the username the log records the reads and writes to the database.

| awk -F\” ‘{print $3}’ | awk -F\: ‘{print $1}’ | sort | uniq

An example usage is a case was escalated to me where a student had trouble taking an assessment. That student was, of course, using Internet Explorer 7, a web browser which prior CE/Vista 8.0.4 was supported. Now it is not. (Could be likely this is reason Blackboard stopped supporting in.) So I was curious how many users are still trying to use this browser.

I Write Like Me

Check which famous writer you write like with this statistical analysis tool, which analyzes your word choice and writing style and compares them with those of the famous writers.

Not trusting a single sample, I tested fifteen writing samples including stories and blog posts (excluding those with block quotes). The Cory Doctorow result was the most common at six.

I write like
Cory Doctorow

I Write Like by Mémoires, Mac journal software. Analyze your writing!

I also received David Foster Wallace (3), Arthur Conan Doyle (3), J.K. Rowling (2), Isaac Asimov (1).

There was a clear pattern to the results.

  1. Cory Doctorow: Topic was work. Analyzer probably keyed on the dispassionately objective word choice.
  2. David Foster Wallace: Topic was my personal life. Analyzer probably keyed on me portraying the  absurdities.
  3. Arthur Conan Doyle: Topic was adventure story originated in high school. I probably thought too much like Sherlock Holmes then.
  4. J.K. Rowling: Topic was also adventure story composed in early college. I probably thought too much like Harry Potter then.
  5. Isaac Asimov: Topic was science. Its hard not to use scientific jargon when writing about science.

That there would be a difference between my high school and college story writing was interesting. The difference depending on whether I was writing about work, personal, or science was also interesting. I would have liked to see almost every sample I chose of my writing to reflect a single author. Otherwise, it seems results skewed towards word choice not style.

From the developer, Dmitry Chestnykh on how this works.

Actually, the algorithm is not a rocket science, and you can find it on every computer today. It’s a Bayesian classifier, which is widely used to fight spam on the Internet. Take for example the “Mark as spam” button in Gmail or Outlook. When you receive a message that you think is spam, you click this button, and the internal database gets trained to recognize future messages similar to this one as spam. This is basically how “I Write Like” works on my side: I feed it with “Frankenstein” and tell it, “This is Mary Shelley. Recognize works similar to this as Mary Shelley.” Of course, the algorithm is slightly different from the one used to detect spam, because it takes into account more stylistic features of the text, such as the number of words in sentences, the number of commas, semicolons, and whether the sentence is a direct speech or a quotation.

Bayesian filters I’ve seen given an item a score to how likely an item is something. I would like to see the strength of the scores, including distributions, and comparison of a given result to other close results. Guess I am just someone who wants to know why?

Protected Post Password

I imported all my LiveJournal posts here. Other than posting pictures to there from Flickr, I don’t really use LJ anymore. I rarely even read my friends’ blogs there. Too bad. I still have the teeshirt.

Most of my LJ posts are protected. For this site, I’d rather have them set to private. So the section of WordPress (Tools > Import > LiveJournal) saying this seemed relevant:

If you have any entries on LiveJournal which are marked as private, they will be password-protected when they are imported so that only people who know the password can see them.

If you don’t enter a password, ALL ENTRIES from your LiveJournal will be imported as public posts in WordPress.

Password protected seemed better than not, so I set a 30 character long password, and the form accepted all 30. When the password didn’t work, I logged in as the administrator user and looked at Publish > Visibility >

In my opinion, web forms in general should prevent the user from entering more characters than the application or database will take. Passwords are very exact, so forms for creating them definitely should not allow extraneous characters.

Turnitin.com

I’m surprised I have not blogged here about the student lawsuit against Turnitin.com? An anti-plagiarism service, Turnitin has students or faculty members upload papers into the database. By comparing new papers to the database, it gives ratings as to whether it is likely a student plagiarized.

Now the search goes out for any student who has a paper that’s being held by TurnItIn that they did not upload themselves. Students Settle with TurnItIn

In theory I could be someone in this situation. Back in 2005, a coworker asked my mother if someone by my name was related to her. This coworker was taking some classes at the university I attended. Turnitin had threw up a cautionary flag on the Originality Report because it was somewhat similar to something with my name on it. The problem is this product came into use at the university after the time I was a student. So I never submitted anything to it. The department from which I got my degree kept a copy of my papers (many submitted by email) and used this product at the time.

Another possibility is this tidbit about the product: Over 11 Billion Web Pages Crawled & Archived. I was actively blogging before and at the time of the incident. Assuming it could identify my name out of all that content, this match could have come from my blogging.

When I contacted Turnitin about this back in 2005, they told me I would have to remove my paper. I re-explained that I didn’t submit the paper. So Turnitin explained that whoever did put the paper in the system would have to remove it. The guy acknowledged the difficulty of the situation in identifying who posted it.

Stalking Students

On the BLKBRD-L email list is a discussion about proving students are cheating. Any time the topic comes up, someone says a human in a room is the only way to be sure. Naturally, someone else responds with the latest and greatest technology to detect cheating.

In this case, Acxiom offers identity verification:

By matching a student’s directory information (name, address, phone) to our database, we match the student to our database. The student then must answer questions to verify their identity, which may include name, address and date of birth.


The institution never releases directory information so there are no Family Educational Rights and Privacy Act (FERPA) violations.

However, to complete the course work the student is forced to hand over the information to Acxiom, an unknown and potentially untrusted party. Why should students trust Acxiom when institutions cannot be trusted?

Due to the decentralized nature of IT departments, higher education leads all industries in numbers data breach events. Acxiom’s verification capabilities were designed so that student and instructor privacy is a critical feature of our solution. Institutions never receive the data Acxiom uses in this process. They are simply made aware of the pass/fail rates.

In other words, high education institutions cannot be trusted to handle this information. No reason was provided as to why Acxiom can be better trusted. Guess the people reading this would never check to see whether Acxiom has also had data breaches.

This Electronic Freedom Foundation response to Acxiom’s claims their method is more secure was interesting:

True facts about your life are, by definition, pre-compromised. If the bio question is about something already in the consumer file, arguably the best kind of question is about something that is highly unlikely to be in one’s consumer file and even useless commercially–like my pet’s name.

Answering these kinds of questions feels like more of violation of than a preservation of privacy.