Archive for June, 2009

Hack Introduction

June 28, 2009

There’s been some noise– and confusion– recently about hack. Hopefully this post can address some of the issues.

What it is

Hack is a webserver interface. This means, it defines a protocol for allowing web applications to talk to different web servers. For example, I can write a web application to use the Hack protocol and then easily switch backends from CGI to FastCGI to Happstack.

Hack is authored by Jinjing Wang.

What is isn’t

  • A web server. This is just a protocol for talking to web servers (see handlers later on)
  • A framework. If you’re looking for a Rails replacement, you’re looking in the wrong place. However, if you want to write a Rails replacement, I would recommend Hack as a good base for it.
  • A coffee maker.

Architecture

The architecture is very simple. Hack defines the following:

Env

The Env data type is essentially the request object. It has the query string, the POST body, HTTP headers, etc. Notice I said query string and not get parameters. In an effort to keep the protocol as light weight as possible, there is not query string processing, POST parameter processing, cookie handling, etc handled by Hack. The application must handle it all.

That said, there are a few options:

  • Write all the processing code yourself.
  • Use my web-encodings package, which handles processing of those fields.
  • Use a hack frontend library (see below).
  • Use a framework. None are available right now, but I’m working on a Restful front controller. That’s what I currently use for a few sites.

Response

Response is simply the output of an application for a single request. It is the status code, HTTP headers and body. Remember, we’re talking low level here: you don’t have any high level templates or Haskell-to-Javascript converters at this level. That’s where a framework would come in.

Application

An application is just a “Env -> IO Response”. It takes a single request and generates a response. As a little piece of advice, if you want to have long-running processes (like with FastCGI) and don’t want to have to reload your data every time, use currying! (Hopefully, my next post will be a sample Hack application which will do just that. I appologize for the lack of examples here, but I’m trying to just give an overview.)

Middleware

Some tasks are going to be performed by many applications, and thus it would be a waste to force each application to reimplement that functionality. For example, do you want to have to write gzip compression into every application you write? I thought not. Therefore, middleware just takes an existing application and wraps it with extra functionality. Two notes:

  1. You can use multiple middlewares at once. I use, for example, cleanpath, clientsession, gzip and jsonp.
  2. The order in which you apply these matters. (Again, hopefully more details on this in the next post.)

Handler

A handler is simply a function with the type signature “Application -> IO ()” (or something similar enough). Basically, it’s what “runs” your application. Jinjing has written a number of handlers, but I’m not very familiar with those. I’ve written three which I use on a regular basis, so I’ll describe them here.

hack-handler-cgi

Run your application as a regular old CGI application. If you don’t know about CGI, you probably should do a little more research into web programming before attempting Hack.

hack-handler-fastcgi

Simply wraps up hack-handler-cgi with the FastCGI C library, in the same way that the fastcgi package wraps up cgi.

hack-handler-simpleserver

This is a little standalone HTTP server that I wrote. It is not meant to be production quality. I only use this for debugging purposes (ie, so I don’t have to set up Apache on my local system). Caveat emptor.

Frontend

I wrote a monadcgi frontend for kicks, and now looking at Hackage I see Jinjing also wrote one for happstack. Not being familiar with that package or Happstack, I’ll just address the monadcgi one.

Basically, there has been a CGI library around for a while that defines a CGI monad. There are two problems with this:

  1. Some people (including me) think that the approach chosen for the library is too “object oriented”.
  2. If you write code for this library, you’re stuck with CGI (or FastCGI with the fastcgi package).

Using the monadcgi frontend for Hack, you can take any application written for the old CGI monad and make it work with any Hack handler.

Conslusion

Hack is in its infancy right now; don’t let the large number of Hack packages on Hackage let you think otherwise. Nonetheless, some of us are using it in production settings now with great success. The documentation is lacking, but on the other hand, Hack is so incredibly simple that it doesn’t really need documentation. In any event, I hope to rectify the documentation issue with some code samples soon.

Also, I’d like to address some potential criticism: Hack does not solve many problems. I’ve heard that people are considered with leaving file handles open, database locking, etc. These are real issues that plague us all in web development. However, this is not Hack’s concern. Hack simply let’s your application talk to a handler. Period. You still need to figure out if you want to use HSP or the html library, if you’ll use jquery or HJScript, or if you’ll go the HDBC, Takusen or happstack-state route.

No. Hack ignores all these issues, and hopefully will allow people around the Haskell community to begin to standardize our web development practices in at least one arena.

Advertisements

Filename encoding issues

June 11, 2009

The Problems

Music Collection

My wife has a large collection of Hebrew music. Since we imported it from some ancient MP3 CDs (I think it was burned on an old OS 9 Mac or something like that), we’ve always had filename and tag character encoding issues, so that the titles come out looking like àøé÷ àééðùèééï, éöç÷ ÷ìôèø. I keep saying I’ll get around to fixing it…

Photo Collection

The other day, our landlord got a new Windows XP system to replace his Windows 98 one. He had a large collection of photos on it, many with Hebrew titles. I wanted to just transfer the files across the network to his Windows computer, but I couldn’t get them to talk. Instead of debugging that, I just used secure shell to copy the files directly to my Linux system, from which I intended to burn a CD. Unfortunately, when I got to my computer, I saw that all his files had an “Invalid encoding” message.

Explanation

Linux (or at least my Ubuntu system, I can’t speak authoritatively here) stores filenames in UTF-8 character encoding. Many legacy systems, like Windows 98, stored files in language specific character encodings. In the case of Hebrew, it’s called WINDOWS-1255. This is a single-byte character set, meaning the first 128 possible values are the same as ASCII, and the next 128 are language-specific. Unfortunately, there are many encodings like this, and there is no way to tell them apart without outside information. The most common of these is Latin-1, which includes a lot of vowels with funny marks over them (see the music collection sample above).

So, when importing the music collection, my Linux box attempted to convert from the legacy character encoding to UTF-8. (I actually don’t remember at which point this conversion happened, it could have been earlier. It’s irrelevant in any event.) Unfotunately, it didn’t know it was dealing with Hebrew, and so took a guess that it was Latin-1. Since, for example, the Hebrew letter Alef has a hex code of E0 in Windows-1255, which is à in Latin-1, all of the Hebrew looks like I fell asleep doing my Spanish homework.

With the photo collection, the secure shell transfer never attempted to do the Latin-1 to UTF-8 conversion, and thus the files on the Linux box showed up with the original Windows-1255 encoding. This is actually slightly easier to deal with.

The Solution

Below is the code I used to fix this whole thing up. I’ll appreciate any critiques that are available. I’m not sure if this is a common problem for people or not; if people want it, I’ll package this up and put it on Hackage.

The basic code flow is: for each file in the source directory, convert the directory and file name to UTF-8 encoding, create the destination directory, and create a hard link. The caveats: if you specify that you want to convert back to Latin-1 (which was necesary for the music collection), the conversion process will go from UTF-8 to Latin-1 and then your specified encoding (mine was Windows-1255) back to UTF-8. If you do not wish that step (as in the photo collection), it only does the second conversion.

Additionally, it seems that Haskell- or at least the directory package- does not properly convert Strings to UTF-8 when making system calls. Thus I have an ugly function (utf8StringHack) to address this. I hope that in the future this won’t be necesary.

The Code

import System.Directory
import Codec.Text.IConv
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Class
import qualified System.UTF8IO as U
import Control.Monad
import Data.List
import System.Posix.Files
import System.Environment

usage :: String
usage = "<convert to latin-1 first> <source encoding> <input dir> " ++
        "<output dir>"

main :: IO ()
main = do
    args <- getArgs
    when (length args /= 4) $ error usage
    let [toLatin1Str, encoding, input, output] = args
    -- convert the string version to a Bool version
    -- this variable specifies whether we need to convert from
    -- UTF-8 to Latin-1 first (see comments below)
    let toLatin1 = case toLatin1Str of
                    ('y':_) -> True
                    ('Y':_) -> True
                    _ -> False
    allFiles <- getTree input
    mapM_ (fixFile toLatin1 encoding input output) allFiles

-- | Convert the filename encoding of a single file.
--
-- Creates necesary directories and uses hard links.
fixFile :: Bool -- ^ whether to first convert to Latin-1 from UTF-8
        -> String -- ^ encoding
        -> FilePath -- ^ top of source directory
        -> FilePath -- ^ top of destination directory
        -> [String] -- ^ subpath of the file to fix
        -> IO ()
fixFile toLatin1 encoding input output path = do
    -- Fix the encoding of the subpath.
    let path' = map (convertName toLatin1 encoding) path
    -- The name of the directory which must be created.
    let destdir = utf8StringHack
                $ output ++ "/" ++ intercalate "/" (init path')
    -- The ultimate file destination.
    let destfile = utf8StringHack
                 $ output ++ "/" ++ intercalate "/" path'
    -- And the current filename, in all its badly-encoded glory.
    let srcfile = input ++ "/" ++ intercalate "/" path
    createDirectoryIfMissing True destdir
    createLink srcfile destfile

-- | I hope that this function will not be necesary in the future.
-- This takes a sequence of Unicode characters, encodes them to bytes
-- using UTF-8 encoding, and then puts those bytes back into a string.
--
-- This is needed for passing off to the System.Directory calls
-- like createDirectoryIfMissing and createLink.
--
-- In theory, all functions touching the outside world could properly
-- do the character encoding/decoding themselves.
utf8StringHack :: String -> String
utf8StringHack = map (toEnum . fromIntegral) . B.unpack . toLazyByteString

-- | Simply determine if the filename begins with a period.
notHidden :: String -> Bool
notHidden ('.':_) = False
notHidden _ = True

-- | Get all of the files in the given path.
getTree :: FilePath -> IO [[String]]
getTree f = getTree' f []

getTree' :: FilePath -- ^ containing path for the directory currently worked on
         -> [String] -- ^ current subpath
         -> IO [[String]]
getTree' dir prev = do
    -- Immediate children.
    contents <- getDirectoryContents dir
    -- Unhidden children.
    let contents' = filter notHidden contents
    -- Generate the full path for a file here.
    let addDir :: String -> String
        addDir s = dir ++ "/" ++ s
    files <- filterM (doesFileExist . addDir) contents'
    dirs <- filterM (doesDirectoryExist . addDir) contents'
    -- Tack the current filename onto the running subpath.
    let files' = map ((++) prev . return) files
    -- Recursive part.
    dirs' <- mapM helper dirs
    -- Stick together current files and files in subdirs.
    return $! files' ++ concat dirs'
    where
        -- Recursively call getTree' for a subdir here.
        helper :: FilePath -> IO [[String]]
        helper dirPart = do
            let dir' = dir ++ "/" ++ dirPart ++ "/"
                prev' = prev ++ [dirPart]
            getTree' dir' prev'

-- | Convert an incorrectly encoded file name to a proper Unicode string.
--
-- Often times the filename will be incorrectly translated at some point
-- from Latin-1 to UTF-8. This is all well and good- if you're dealing
-- with LATIN1. Otherwise, you now need to do two things: convert from
-- UTF-8 to LATIN1 to undo the incorrect conversion, and convert from
-- your real encoding to UTF-8. That is the purpose of the first parameter.
convertName :: Bool -- ^ convert from UTF-8 to Latin-1 first
            -> String -- ^ character encoding of the filename
            -> String -- ^ incorrectly encoded filename
            -> String -- ^ corrected filename
convertName toLatin1 encoding =
    fromLazyByteString .
    convert encoding "UTF-8" .
    (if toLatin1 then convert "UTF-8" "LATIN1" else id) .
    B.pack .
    map (toEnum . fromEnum)

Functors and Monads (containers)

June 2, 2009

Introduction

In a philosophy class in college, I remember learning the idea that in order to understand something, you can’t simply study it. For example, if you study the heart, you’ll understand the function of the valves, what causes the muscle to relax and contract, and so on, but you’ll have no idea what a heart is and what its purpose is. In order to learn that, you need to study the human body.

I’m going to take the same approach to Monads and Functors (and hopefully later Applicative). I hope this doesn’t turn in another Monads are Burritos. I’ll avoid as much as possible type signatures and such so as not to obscure the main point.

Containers

Two scary concepts in Haskell, Functors and Monads, both deal with containers. These aren’t as limited as containers from a language like Java. There, containers are linked lists, arrays maps, sets; some way of managing collections of stuff. In Haskell, we use containers to represent actions, logging, and- our topic today- things which might exist. I’m speaking about Maybe.

I’ve chosen Maybe since I find it easy to think about. You either have Just a result, or Nothing. To further simplify, we’ll only deal with Maybe Int; a number that might exist.

How can such a thing occur? Let’s say these numbers represent the price for an item, given in cents. We might have a getPrice function which takes an Item as an argument. If a Toothbrush costs $1.25, then the function would return Just 125.

On the other hand, say you don’t have a price for Toothpaste. Then the function would return Nothing. (Think for a moment how you’d deal with this in Java, should a 0 represent unknown? What if you have a buy one-get one free sale? Use -1? You’ll have to remember to check for it everywhere. But I digress.)

Monads

Anyway, let’s now say that you have another function to add tax onto the purchase price. It doesn’t deal with Maybes; it adds 5% to an Int. But our getPrice function returns a Maybe Int. How can we stick these two functions together (ie, compose them)?

This is where we can use the trusty old Monad. Let’s start with do notation:

getPriceWithTax item = do
    price <- getPrice item
    return $ addTax price

That’s a little verbose for something so simple. Let’s drop the do:

getPriceWithTax item = getPrice item >>= return . addTax

Great… except why the return? We have two types at play: Int and Maybe Int. When dealing with one-argument functions, you get four possible function signatures:

  1. Int -> Int
  2. Int -> Maybe Int
  3. Maybe Int -> Int
  4. Maybe Int -> Maybe Int

The third option is unwrapping a contained value; it is not a topic for Monads or Functors, so ignore it for now. (If you care, look up fromJust.) The first option is a regular old function, like addTax. Option 2 is like getPrice: it ends up wrapping a container. (We’ll address option 4 later.)

Composition

In order to compose two functions, the output of one must be the same type as the input of the other. With containers, we in general can only add containers, not take them away. So in our getPriceWithTax function above, we have getPrice returning a Maybe Int, and addTax taking a plain Int.

To make them compatible, one of them will have to change. Guess which? That’s right, we need to make addTax take a Maybe Int instead of an Int. Also, since we can’t remove the container in the middle, we end up returning a Maybe Int as well, which leads us to option 4 above.

Creating an option 4 function is where the Monadic bind function (>>=) comes in handy. It converts an contained -> contained function into a contained -> contained one. This gives us back composibility! But wait: addTax returns an uncontained value. No problem: we just add a return to make addTax return contained values.

Functors FTW

You might be thinking that it’s kind of silly to do business this way, and I’ll agree with you. What we really want is a way to convert totally uncontained functions (option 1) to totally contained ones (option 4). This is what Functors are for. A Functor has a single function, fmap, which does exactly that. So our code becomes:

getPriceWithTax3 item = addTax `fmap` getPrice item

Monads are still good

Now, there are times- many of them- when you’ll need the full power of Monads to deal with things. I hope to address that in a future post. But for now, the moral is: if you are using return with Monadic bind, you might want to consider fmap instead.