Filename encoding issues

The Problems

Music Collection

My wife has a large collection of Hebrew music. Since we imported it from some ancient MP3 CDs (I think it was burned on an old OS 9 Mac or something like that), we’ve always had filename and tag character encoding issues, so that the titles come out looking like àøé÷ àééðùèééï, éöç÷ ÷ìôèø. I keep saying I’ll get around to fixing it…

Photo Collection

The other day, our landlord got a new Windows XP system to replace his Windows 98 one. He had a large collection of photos on it, many with Hebrew titles. I wanted to just transfer the files across the network to his Windows computer, but I couldn’t get them to talk. Instead of debugging that, I just used secure shell to copy the files directly to my Linux system, from which I intended to burn a CD. Unfortunately, when I got to my computer, I saw that all his files had an “Invalid encoding” message.

Explanation

Linux (or at least my Ubuntu system, I can’t speak authoritatively here) stores filenames in UTF-8 character encoding. Many legacy systems, like Windows 98, stored files in language specific character encodings. In the case of Hebrew, it’s called WINDOWS-1255. This is a single-byte character set, meaning the first 128 possible values are the same as ASCII, and the next 128 are language-specific. Unfortunately, there are many encodings like this, and there is no way to tell them apart without outside information. The most common of these is Latin-1, which includes a lot of vowels with funny marks over them (see the music collection sample above).

So, when importing the music collection, my Linux box attempted to convert from the legacy character encoding to UTF-8. (I actually don’t remember at which point this conversion happened, it could have been earlier. It’s irrelevant in any event.) Unfotunately, it didn’t know it was dealing with Hebrew, and so took a guess that it was Latin-1. Since, for example, the Hebrew letter Alef has a hex code of E0 in Windows-1255, which is à in Latin-1, all of the Hebrew looks like I fell asleep doing my Spanish homework.

With the photo collection, the secure shell transfer never attempted to do the Latin-1 to UTF-8 conversion, and thus the files on the Linux box showed up with the original Windows-1255 encoding. This is actually slightly easier to deal with.

The Solution

Below is the code I used to fix this whole thing up. I’ll appreciate any critiques that are available. I’m not sure if this is a common problem for people or not; if people want it, I’ll package this up and put it on Hackage.

The basic code flow is: for each file in the source directory, convert the directory and file name to UTF-8 encoding, create the destination directory, and create a hard link. The caveats: if you specify that you want to convert back to Latin-1 (which was necesary for the music collection), the conversion process will go from UTF-8 to Latin-1 and then your specified encoding (mine was Windows-1255) back to UTF-8. If you do not wish that step (as in the photo collection), it only does the second conversion.

Additionally, it seems that Haskell- or at least the directory package- does not properly convert Strings to UTF-8 when making system calls. Thus I have an ugly function (utf8StringHack) to address this. I hope that in the future this won’t be necesary.

The Code

import System.Directory
import Codec.Text.IConv
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Class
import qualified System.UTF8IO as U
import Control.Monad
import Data.List
import System.Posix.Files
import System.Environment

usage :: String
usage = "<convert to latin-1 first> <source encoding> <input dir> " ++
        "<output dir>"

main :: IO ()
main = do
    args <- getArgs
    when (length args /= 4) $ error usage
    let [toLatin1Str, encoding, input, output] = args
    -- convert the string version to a Bool version
    -- this variable specifies whether we need to convert from
    -- UTF-8 to Latin-1 first (see comments below)
    let toLatin1 = case toLatin1Str of
                    ('y':_) -> True
                    ('Y':_) -> True
                    _ -> False
    allFiles <- getTree input
    mapM_ (fixFile toLatin1 encoding input output) allFiles

-- | Convert the filename encoding of a single file.
--
-- Creates necesary directories and uses hard links.
fixFile :: Bool -- ^ whether to first convert to Latin-1 from UTF-8
        -> String -- ^ encoding
        -> FilePath -- ^ top of source directory
        -> FilePath -- ^ top of destination directory
        -> [String] -- ^ subpath of the file to fix
        -> IO ()
fixFile toLatin1 encoding input output path = do
    -- Fix the encoding of the subpath.
    let path' = map (convertName toLatin1 encoding) path
    -- The name of the directory which must be created.
    let destdir = utf8StringHack
                $ output ++ "/" ++ intercalate "/" (init path')
    -- The ultimate file destination.
    let destfile = utf8StringHack
                 $ output ++ "/" ++ intercalate "/" path'
    -- And the current filename, in all its badly-encoded glory.
    let srcfile = input ++ "/" ++ intercalate "/" path
    createDirectoryIfMissing True destdir
    createLink srcfile destfile

-- | I hope that this function will not be necesary in the future.
-- This takes a sequence of Unicode characters, encodes them to bytes
-- using UTF-8 encoding, and then puts those bytes back into a string.
--
-- This is needed for passing off to the System.Directory calls
-- like createDirectoryIfMissing and createLink.
--
-- In theory, all functions touching the outside world could properly
-- do the character encoding/decoding themselves.
utf8StringHack :: String -> String
utf8StringHack = map (toEnum . fromIntegral) . B.unpack . toLazyByteString

-- | Simply determine if the filename begins with a period.
notHidden :: String -> Bool
notHidden ('.':_) = False
notHidden _ = True

-- | Get all of the files in the given path.
getTree :: FilePath -> IO [[String]]
getTree f = getTree' f []

getTree' :: FilePath -- ^ containing path for the directory currently worked on
         -> [String] -- ^ current subpath
         -> IO [[String]]
getTree' dir prev = do
    -- Immediate children.
    contents <- getDirectoryContents dir
    -- Unhidden children.
    let contents' = filter notHidden contents
    -- Generate the full path for a file here.
    let addDir :: String -> String
        addDir s = dir ++ "/" ++ s
    files <- filterM (doesFileExist . addDir) contents'
    dirs <- filterM (doesDirectoryExist . addDir) contents'
    -- Tack the current filename onto the running subpath.
    let files' = map ((++) prev . return) files
    -- Recursive part.
    dirs' <- mapM helper dirs
    -- Stick together current files and files in subdirs.
    return $! files' ++ concat dirs'
    where
        -- Recursively call getTree' for a subdir here.
        helper :: FilePath -> IO [[String]]
        helper dirPart = do
            let dir' = dir ++ "/" ++ dirPart ++ "/"
                prev' = prev ++ [dirPart]
            getTree' dir' prev'

-- | Convert an incorrectly encoded file name to a proper Unicode string.
--
-- Often times the filename will be incorrectly translated at some point
-- from Latin-1 to UTF-8. This is all well and good- if you're dealing
-- with LATIN1. Otherwise, you now need to do two things: convert from
-- UTF-8 to LATIN1 to undo the incorrect conversion, and convert from
-- your real encoding to UTF-8. That is the purpose of the first parameter.
convertName :: Bool -- ^ convert from UTF-8 to Latin-1 first
            -> String -- ^ character encoding of the filename
            -> String -- ^ incorrectly encoded filename
            -> String -- ^ corrected filename
convertName toLatin1 encoding =
    fromLazyByteString .
    convert encoding "UTF-8" .
    (if toLatin1 then convert "UTF-8" "LATIN1" else id) .
    B.pack .
    map (toEnum . fromEnum)
Advertisements

Tags:

7 Responses to “Filename encoding issues”

  1. ephemient Says:

    My shell history is full of lines similar to

    for i in *; do ln “$i” “../fixed/$(iconv -fGBK -tUTF-8 <<<“$i”)”; done

    as I’m often dealing cleaning up after extracting zips created on Windows machines from certain Asian countries… sometimes something like “iconv -fUTF-8 -tiso8859-1 | iconv -fGBK -tUTF-8” is needed when there have been intermediaries introducing their own corruptions, like your music collection.

    I’m a little surprised that you’d go all the way to Haskell to solve this problem, but that works too 🙂

  2. zong_sharo Says:

    convmv

  3. Shimuuar Says:

    BTW there is a nice program “convmv” which convert filename encodings. I found it very useful.

    Not sure about ubuntu, but it’s in debian repos.

  4. Elad Says:

    Hi,

    My problem wasn’t in Linux but it’s almost the same one. reading CDDB info for a CD I wanted to rip to my library I got back the all the song names in (you guessed it right) Latin-1 encoding instead of Hebrew.
    Windows shell being less powerful than the linux equivalent I turned to C# .Net.
    Basically all I did was read each filename as a unicode string, encode that string using the Latin-1 codepage in to a byte array, then, decode the byte array back into unicode using the Hebrew codepade. And copy the file using the new name (wishing i could hardlink instead).

    Thank for your explanation and solution it steered me in the right direction.

  5. moshe beeri Says:

    convmv -r –notest -f windows-1255 -t UTF-8 *
    please note that omitting –notest will let you test prior to changing.

  6. tur Says:

    any chance you have any clue how to get this done in windows? (@elad especially)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: