Convert Little-endian UTF-16 to ASCII

hacker screen
Photo by Markus Spiske on

I generated some text files working with Get-Acl Powershell, but I did not know how to get Powershell to do some advanced features. (Basically, I wanted to the Select-String to include the next 2 lines and see whether a specific group was in that list. And maybe some exclusions.) So, I copied the files over to my Linux home to check there.

The basic most grep? Nothing.

I used ls -l and confirmed they have data. I used less to confirm I can see it.

I copied a string and did a grep for it. Nothing.

I did a dos2unix. That didn’t fix it. Finally, I did:

file filename.txt

That revealed the files had types of:

  1. Original: Little-endian UTF-16 Unicode text, with CRLF line terminators
  2. dos2unix converted: Little-endian UTF-16 Unicode text

Basically, this told me that the dos2unix fixed one problem but not both. The “with CRLF line terminators” means that Windows and Unix have philosophical differences in how to format text lines.

Little-endian is a geeky homage to Gulliver’s travels. It has to do with which direction one encodes the bits. But, it isn’t really the big problem here. UTF-16 is the problem because apparently, I need it to be UTF-8 for grep to read it. So, the fix is to use an encoding converting:

iconv -f utf-16 -t utf-8 filename.txt > filename_new.txt

Separating Bash Histories

Years ago, maybe 2011 based on the age of files, our Systems group moved our Linux home directories to a central system. My only real complaint about this move was finding anything I needed in my Bash history. See, I am terrible at remembering things and often make typos. It is easier to go back in my history to a prior command and either run it or modify that one and run. The same home directory across all these systems complicated things by co-mingling commands. I was able to find things. Just eventually. That seemed inefficient.

Eventually, this situation annoyed me to the point I decided to fix it. And the fix was so simple it is amazing that I did not immediately address it rather than suffering with it for a couple years. (Well, actually, we picked Desire2Learn before the change so 90% of my server responsibilities were on Windows. Only when I was promoted to a Technology Strategist and returned to majority work in Linux did it get annoying enough to address.)

The fix? Add hostname to the HISTFILE variable in .bash_profile.

export HISTFILE=”${HOME}/.bash_history.`hostname`”

Apparently I made the change back on December 18th. In the six months since, I have not noticed any oddities with the history. This morning I noticed that I have about twenty different host named history files of various sizes and dates.

Given the number of files, while writing this post, I decided to re-organize these into a directory. (An organized home directory is a happy home directory. Heh.)

export HISTFILE=”${HOME}/.bash_history/.bash_history.`hostname`”

Then I ran these.

mv .bash_history
mkdir .bash_history
mv .bash_history.* .bash_history

Then I exited which dumped that session’s history into a file in the old location. I logged in again and used cat and the output redirect to append those new lines to the correct file in the new location.

Exited again and logged in again. And everything still looks good.

Listing Lists

A mini project is to hand over the course packages for the prior product to each of our clients. A good idea was to include a list of the files so down the road, if something is missing then, we can say this list in the ticket has what they received.

So I wrote this shell script to make the lists for me. (Well, really the analyst doing the hard work wanted to know if he should make the list. Told him I could really easily through Linux.) This is because I am talking about 385,528 courses and 37 targets. First step generates a list of the clients (schools) involved. Next, the path to where the files are stored have two subdirectories, so I pull them out of the path. The list generates with a find command stripping out the “./” at the beginning and writing the results to a file. Finally I check the size and number of lines in the file.

SCHOOLLIST=`find /${BASEDIR} -name bak`
SCHOOL=`pwd | awk -F\/ ‘{print $4}’`
CLUSTER=`pwd | awk -F\/ ‘{print $3}’`
find . -name “*.bak” | sed -e ‘s|^./||g’ > ${BASEDIR}/${CLUSTER}/${SCHOOL}/course_list_${SCHOOL}.txt
head /${BASEDIR}/${CLUSTER}/${SCHOOL}/course_list_${SCHOOL}.txt
ls -h /${BASEDIR}/*/*/course_list*
wc -l /${BASEDIR}/*/*/course_list*

Since each course is on its own line, I can compare these numbers to other known numbers of courses.

So nice to get the computer to work for me. Purely by hand this would have taken days. It took about half an hour to craft the core and make sure it looked right. Then another half hour for the loop to work right.

Of course, I need to figure out how to do this in Powershell. 🙂


In the beginning I used TOTAL=`ls /d2lmig/*/*/bak/*/* | wc -l` to get a total count. All was good.

Until at around 55,000 files I got: -bash: /bin/ls: Argument list too long.

Then I used TOTAL=`find /d2lmig/*/*/bak/*/* -name *.bak | wc -l` to get a total count. All was good.

Until at around 90,000 files I got: -bash: /usr/bin/find: Argument list too long.

It happened that I was already using a for do done loop and getting a count for each bak directory. So I added within the for loop:

if [ $RUNTYPE = “INTERNAL” ] ; then echo ” Running total = $TOTAL” ; fi

(INTERNAL is a value I pass at the command line that controls whether the email is sent to me or other parties.)

Comment out the TOTAL using find. Voila.

I know, I should have done it this way in the beginning. Sloppy code to go back and get the total. Maybe because I have this post here, I reference it and not be so bad in the future.

Convert Webserver.log to CSV

A security guy at a campus wanted our web server log file in the CSV format. The original file has lines which look something like: webserver.log13646,2010-11-30        11:08:32        0.0010  999.999.999.999    b7tPM1hTgGYMn90bLTM1    200     GET     /webct/urw/lc987189066271.tp1333853785371/blank.html    –       262     “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4” username:0:0

Turns out I only need three sed edits to make it look the way I want:

sed ‘s|:2009-|,2009-|g’ testfile.txt | sed ‘s|\t|,|g’ | sed ‘s|: |,|g’

The first converts the colon between the end of the file name and the year into a comma. The second converts all the tabs into commas, and the last changes the colon-space between the host name and webserver.log into a comma.

Easy enough. That line from the web server log now looks like:,webserver.log13646,2010-11-30,11:08:32,0.0010,999.999.999.999,b7tPM1hTgGYMn90bLTM1,200,GET, /webct/urw/lc987189066271.tp1333853785371/blank.html,-,262, “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4”,username:0:0

I love regular expressions.

I have a feeling I’ll need to make a primer for this guy too. 🙁

Hostname,Log Name, Date, Time, Seconds to Process, Load Balancer IP, Session ID, HTTP Response Code, HTTP Method, URI, URI Parameters, Bytes Returned, User Agent, Username:Transactions Read:Transaction Written

Special Characters Are Meaningful Too

Dear Google,

When you treat special characters such as underscores, colons, and hyphens as a space, you corrupt my search for a single term into multiple terms, aka not what I sought, so I get too many useless results. Function names, class names, or file  names ought to be treated as a single word not several words. Even when I place quotes around them you treat it as two concurrent words not a single word.

Please correct your algorithms or at least give me the option to have your product work correctly. Maybe like Google Book Search you should have Google Code Search? Software is information too.



Notepad++ load langs.xml failed

Notepad++ is my Windows text editor. Work has a site license for UltraEdit. I bought a personal license for EditPlus back in 2002-ish. Notepad++ does what I need it to do without having to track down a license key.

This week I started getting this error when it starts.

Load langs.xml failed!

Apparently this happens enough, Google was able to suggest the search and pulled up a solution in the first result. Rob3C at recommends renaming the langs.xml out of the way and copying the langs.model.xml into place. It was thoughtful of the developers to provide a default good version on which to fall back.

I was curious what was wrong with the the file. Turns out it just ended in the middle of line 100 which is ini, a style I would not have modified. (I mainly use SQL and bash.) Also, the bad one uses the wrong end style for the language element despite it being correctly on adjacent lines. So the file is no longer valid XML. Very odd.

Selected Quotes About Computers & Software – The Core Memory

10 Types of People Ran across Selected Quotes About Computers & Software at a site called The Core Memory. I have the teeshirt for this first one. The rest are for inpiration.

There are 10 types of people. Those that understand binary and those that do not.

— Ray Roton

Old programmers never die… They just decompile.

— Peter Dick

I haven’t lost my mind, I have it backed up on tape somewhere.

— Unknown

Beware of bugs in the above code; I have only proved it correct, not tried it.

— Donald Knuth

Beware of programmers who carry screw drivers.

— Leonard Brandwein

Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.

— Rich Cook

Computers are useless. They can only give you answers.

— Pablo Picasso

If it’s there and you can see it – it’s real.
If it’s not there and you can see it – it’s virtual.
If it’s there and you can’t see it – it’s transparent.
If it’s not there and you can’t see it – you erased it!

— Scott Hammer

I have a spelling checker,
It came with my PC;
It plainly marks four my revue
Mistakes I cannot sea.
I’ve run this poem threw it,
I’m sure your pleased too no,
Its letter perfect in it’s weigh,
My checker tolled me sew.

— Janet Minor

Programming is like sex, one mistake and you have to support it for the rest of your life.

— Michael Sinz

Computers are not intelligent. They only think they are.

— Unknown

I’d love to change the world, but they won’t give me the source code!

— Unknown

If a train station is where a train stops, what’s a workstation?

— Unknown

As soon as we started programming, we found to our surprise that it wasn’t as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.

— Maurice Wilkes

They have computers, and they may have other weapons of mass destruction.

— Janet Reno

Computers are like Old Testament gods; lots of rules and no mercy.

— Joseph Campbell

The most likely way for the world to be destroyed, most experts agree, is by accident. That’s where we come in; we’re computer professionals. We cause accidents.

— Nathanie Borenstein

The real danger is not that computers will begin to think like men, but that men will begin to think like computers.

— Sidney J. Harris

Man is the best computer we can put aboard a spacecraft … and the only one that can be mass produced with unskilled labor.

— Wernher von Braun

Man is a slow, sloppy and brilliant thinker; the machine is fast, accurate and stupid.

— William M. Kelly

Google Chrome on Linux

I was excited to read today a Google Chrome Beta is now available on Linux. Gmail and Google Reader have weird font issues for me on both Linux and Window Firefox. So I tend split my browser load based on where the sites work best for me.

Making the Linux switch meant leaving Chrome behind unless I went for the unstable version. I was willing to wait for a beta. I just expected to wait a few more months. Whew.

So far so good!