Search Within Files

I keep some logs in a directory just in case I need to reference them later. The kind of data that has saved my bacon on a handful to times. Of course, it has over 3,000 files (16GB) in the directory. Of those less than a hundred were potentially relevant. And in the end only a couple dozen had the data I sought.

Windows Explorer used to make this easy to search. I could put in the pattern I wanted and tell it to search the text within them. It would give me the file name for each containing the search string. For whatever reason Windows 7 had to make it more difficult.

So, I wrote the easiest of Powershell scripts:

Get-Content $filelist | Select-String -pattern “Search String

Good thing too because apparently I need to to go through my Indexing Options and identify every file extension I want to search to index file contents. What a royal pain. My guess is doing so would also blow up the EDB data file from its currently 2GB to something way larger. 10GB? 50? 100? Yuck.


Almost Foiled by Facebook

A coworker asked me how long a certain song was playing in the lobby. I responded that I posted about it on Facebook and can find out from that.

Only I could not find it. I went to my page and hit End until I was too far back in time. Then I used the browser search to look for keywords. Nothing. More keywords. Other songs I about which I recalled getting lots of comments. Nothing.

Then I remembered Facebook defaults to showing Highlights. I had to change it to All Stories and do it again. That worked.

It is like they do not want us to be able to find anything.

Graph Search is okay. What would really be nice is being able to find that specific status update I want to reference. Maybe I need to go find a more archivist centric social network?

Goodreads Context Menu Search

At some point when I am not too lazy, I should empower my future laziness. I really need an easier way to look up books on Goodreads when I see them mentioned on other pages.

For example, say I go to BBC’s Big Read and want to add some to my wish list. At present I would highlight the title, Ctrl+C to copy, Alt+Tab to a window/tab with, click in the search box, Ctrl+V to paste, and hit Enter. Probably would take no more than a couple seconds.

What I want is to be able to highlight the title, right-click to bring up the context menu, and click on search for the title on Goodreads. Probably would take less than a second.

Interestingly enough with Google Chrome I can make the default search Goodreads. That achieves the above with the draw back that all searches go through it. So not a great solution as I do love being able to search from the address bar. But it’ll work for now as long as I remember to put it back.

Writing an extension appears to be the solution. Thus where my laziness is the barrier.

Why I Love The Internet

Everything is out there. From the most profound to the most mundane, whatever I need to know when I need to know it.

Last week I set my DVR to record a series. I knew it was in re-runs and British. The DVR sucks in the sense it gives an original air date but not an episode number. The first episode I got was not called “Pilot”. At this point I had no idea whether I have the first, the sixth, or the eleventh.

So I toss the show title with episode list into a Google search. It pulls up several sites with episode titles and their dates. I could have just gone to Turns out I had the third. (Plus there are places offering to let me watch the series online.)

Probably I search too much instead of going to specific sites I know first.

There is something rewarding between hitting the button and seeing results. It feels so good.

Me Social Media

Dan Schultz doesn’t like Facebook or Twitter because they are too focussed on individual expression rather than the community.

That may be because he is using them wrong. I liked photography as a kid, but I didn’t know any photographers. Flickr happened to come into my life just after I bought my first digital camera. My participation in photography exploded. Not because I had a way to post my photos but because I had a way to find other local photographers for mutual encouragement. Even better was forming local groups to encourage people to meet. The value of Flickr is developing the community.

Worldwide Photowalk Panorama

Similarly, I got into Twitter because my community, peers at other universities running the same software as myself, were seeking help there. Any place with answers to the problems we face, which is where people with the answers are watching, is where we go. Twitter was the place to get the attention of the right people not a forum like phpBB. (There are already lots of email lists.) My other community, people using the software I run are also on Twitter. I’ve resolved issues for many clients by finding their public complaints and offering solutions. When my focus changed away from using Twitter for the community is when I stopped liking Twitter.

Personally, I have yet to find much sense of community in the phpBB, Google Wave, and Ning. So I find it strange these are the exemplars of community applications. They seem fractured so one finds dozens of groups to covering the same interest. Sometimes this is because some moderator upset a portion of the community with draconian behavior causing people to form an alternative community. Bad blood exists for a while. Other times people set up a new community unaware others exist.

Useful User Agents

Rather than depend on end users to accurately report the browser used, I look for the user-agent in the web server logs. (Yes, I know it can be spoofed. Power users would be trying different things to resolve their own issues not coming to us.)

Followers of this blog may recall I changed the Weblogic config.xml to record user agents to the webserver.log.

One trick I use is the double quotes in awk to identify just the user agent. This information is then sorting by name to count (uniq -c) how many of each is present. Finally, I sort again by number with the largest at the top to see which are the most common.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | sort | uniq -c | sort -n -r

This is what I will use looking for a specific user. If I am looking at a wider range, such as the user age for hits on a page, then I probably will use the head command to look at the top 20.

A “feature” of this is getting the build (Firefox 3.011) rather than just the version (Firefox 3). For getting the version, I tend to use something more like this to count the found version out of the log.

grep <term> webserver.log | awk -F\” ‘{print $2}’ | grep -c ‘<version>’

I have yet to see many CE/Vista URIs with the names of web browsers. So these are the most common versions one would likely find (what to grep – name – notes):

  1. MSIE # – Microsoft Internet Explorer – I’ve seen 5 through 8 in the last few months.
  2. Firefox # – Mozilla Firefox – I’ve seen 2 through 3.5. There is enough difference between 3 and 3.5 (also 2 and 2.5) I would count them separately.
  3. Safari – Apple/WebKit – In searching for this one, I would add to the search a ‘grep -v Chrome’ or to eliminate Google Chrome user agents.
  4. Chrome # – Google Chrome – Only versions 1 and 2.

Naturally there many, many others. It surprised me to see iPhone and Android on the list.

Name Collisions

Blackboard has a conference they call BbWorld. I noticed there are some odd tweets with the same #bbworld hashtag lately. These appear to be about a Blackberry conference to be held next month.

Collisions on names are common enough. For example, here are a couple names our clients use to brand their sites which other places also use.

My own project, GeorgiaVIEW is not immune. Some time ago I noticed the GeorgiaView Consortium (geological remote sensing) at the University of West Georgia.

I guess it is a good thing one Bbworld is in July and the other is in September.

For now I’ll just drop my RSS feed for the hashtag.

I’m surprised I have not blogged here about the student lawsuit against An anti-plagiarism service, Turnitin has students or faculty members upload papers into the database. By comparing new papers to the database, it gives ratings as to whether it is likely a student plagiarized.

Now the search goes out for any student who has a paper that’s being held by TurnItIn that they did not upload themselves. Students Settle with TurnItIn

In theory I could be someone in this situation. Back in 2005, a coworker asked my mother if someone by my name was related to her. This coworker was taking some classes at the university I attended. Turnitin had threw up a cautionary flag on the Originality Report because it was somewhat similar to something with my name on it. The problem is this product came into use at the university after the time I was a student. So I never submitted anything to it. The department from which I got my degree kept a copy of my papers (many submitted by email) and used this product at the time.

Another possibility is this tidbit about the product: Over 11 Billion Web Pages Crawled & Archived. I was actively blogging before and at the time of the incident. Assuming it could identify my name out of all that content, this match could have come from my blogging.

When I contacted Turnitin about this back in 2005, they told me I would have to remove my paper. I re-explained that I didn’t submit the paper. So Turnitin explained that whoever did put the paper in the system would have to remove it. The guy acknowledged the difficulty of the situation in identifying who posted it.

Racial Profiling

Walking home from the bus in high school, I saw police cars and officers in front of my house. Their presence made me extremely apprehensive. The only little assurance was my father talking to the officers. Someone broke into the house and stole some of our stuff.

We felt violated. Our own home was unsafe.

At the time however, the people with guns with a tendency to keep their hands near them were much more threatening than some anonymous teen who wanted some quick cash.

Police officers are the good guys.

Take this scenario:

  1. You’ve spent almost a full day on a plane or in airports flying from Shanghai to Boston so you are extremely jet-lagged.
  2. (SUGGESTED ADDITION) You picked up the flu while in China (remember Avian Bird Flu?).
  3. Your front door won’t open when you get home, so you end up gaining access to the house from the back door. Eventually with help you do get it opened.
  4. While calling someone to come fix the door, a police officer shows up to question you about being the owner of the house. (Let’s ignore that Harvard owns it. You just reside there.)

This is like Alexander and the Terrible, Horrible, No Good, Very Bad Day: “Nothing at all was right.” Except… This state of mind was interpreted by the police officer this way:

“From the time he opened the door it seemed that he was very upset, very put off that I was there in the first place,” Sergeant Crowley told the station, WEEI. “Not just what he said, but the tone in which he said it, just seemed very peculiar — even more so now that I know how educated he is.” NYT

This seems like the perfect opportunity to ask questions about Dr. Gates’ day to establish something of a rapport to ascertain why he might be so upset. It’s not so peculiar when the context is known. I bet if all this had been placed in context at the time, then this would not be front page news.

Email Harvesters

Good Sign I missed the story about brothers convicted of harvesting emails the first time. Well, I noticed a followup.

Back around 2001, the CIO received complaints about performance for the web server. So, I went log trolling to see what the web server was doing. A single IP dominated the HTTP requests. This one IP passed various last names into the email directory. Some quick research revealed Apache could block requests from that IP. That calmed things down enough for me to identify the owner of the IP. The CIO then bullied the ISP to provide contact information for the company involved.

Previous little adventures like this landed me a permanent job, so I jumped at similar challenges.

Well, a few years later, it happened again. This time my boss had made me develop a script for the dissemination of the anti-virus software package to home users. Basically, it used email authentication for verification if someone could get the download link. So, I applied the same technique to the email directory. Well, this upset some people who legitimately needed email addresses. So the human workers would provide email addresses to people with a legitimate need.

I’m glad since I’ve left, VSU no longer looks up email addresses for people. (I thought some of the requests questionable.) Also, my little email authentication script was before LDAP was available to the university. I think the new solution much better.

One the more vocal complainers about my having stopped non-VSU access to the email directory was my current employer. We apparently list email addresses for employees freely. Which makes me wonder how much spam we get is due to the brothers described at the beginning of this story? Or other email harvesters? Just hitting the send button potentially exposes the email address.

No worries. I’m sure Glenn is protecting me. 🙂