Rants, Raves, and Rhetoric v4

Convert Little-endian UTF-16 to ASCII

hacker screen
Photo by Markus Spiske temporausch.com on Pexels.com

I generated some text files working with Get-Acl Powershell, but I did not know how to get Powershell to do some advanced features. (Basically, I wanted to the Select-String to include the next 2 lines and see whether a specific group was in that list. And maybe some exclusions.) So, I copied the files over to my Linux home to check there.

The basic most grep? Nothing.

I used ls -l and confirmed they have data. I used less to confirm I can see it.

I copied a string and did a grep for it. Nothing.

I did a dos2unix. That didn’t fix it. Finally, I did:

file filename.txt

That revealed the files had types of:

  1. Original: Little-endian UTF-16 Unicode text, with CRLF line terminators
  2. dos2unix converted: Little-endian UTF-16 Unicode text

Basically, this told me that the dos2unix fixed one problem but not both. The “with CRLF line terminators” means that Windows and Unix have philosophical differences in how to format text lines.

Little-endian is a geeky homage to Gulliver’s travels. It has to do with which direction one encodes the bits. But, it isn’t really the big problem here. UTF-16 is the problem because apparently, I need it to be UTF-8 for grep to read it. So, the fix is to use an encoding converting:

iconv -f utf-16 -t utf-8 filename.txt > filename_new.txt

Posted

in

,

by

Comments

3 responses to “Convert Little-endian UTF-16 to ASCII”

  1. […] Convert Little-endian UTF-16 to ASCII published February 27, 2019 at […]

  2. psychocod3r Avatar

    What exactly is the difference between UTF-8 and UTF-16? I’ve never understood this.

    1. Ezra S F Avatar

      UTF-8 uses 8 bits to encode a character where UTF-16 uses 16. 8 bits is a byte. So, UTF-16 files are twice as large. But the character set is larger. If one uses characters outside the normal 8 bit, then one can just use an extra byte to encode it rather than waste disk on useless encoding.

      Modern OSes prefer UTF-16.

Leave a Reply to psychocod3rCancel reply