Equals != Equality

posted by Christian in Programming

The other day, I read the article How to Write an Equality Method in Java. Which reminded me that I wanted to write an entry about equality in programming. I cannot hope to cover this topic adequately in a blog post. My goal is merely to show that equality is very hard and that we often have a sever misconception about it.

Most programming languages give us several ways to check equality. Java has the operator == to test equality on primitives and references and the method boolean Object.equals(Object o) to test semantic equality of objects (we will discuss later what that means). Common Lisp has a lot more (eq, eql, equal, equalp, =, string=, ...). And did you know about the === operator in Javascript?

Semantics of Equality

What does it mean that two Objects are the same? In the OO world a naive approach is to say, that two objects are the same when they are instances of the same class and all members are the same. This is a nice recursive approach, and it is also not to hard to create a terminating implementation, if you take a bit care of circular references.

However, let us consider the numerical tower. Is the integer 2 the same as the float 2.0? If you interpret class in a Java sense, then they are not the same according to the above definition, because they are not instances of the same class. You might argue that integers can be seen as a super-class of floats (meaning that integers are a subset of floats, i.e., every integer is a float). So you could let the more specialized class decide the equality.

This approach does not help us with a different problem: Two strings, according to our definition, are equal if both strings consist of the same characters (in the same order). So the strings "FooBar" and "foobar" are not equal. But what if we do not care fore case sensitivity, e.g. because our file system does not care for it? Building up a set with all file names is a reasonable use-case. A standard Java HashSet<String> would not be very helpful here.

As a matter of fact, the Java String acknowledges this problem and provides equalsIgnoreCase(), but I cannot tell HashSet to use this method instead of equals() (without extending String).

Object relational mappers (ORM) have a similar, but slightly different, problem. They have to make sure that objects in the data base and in the program are matched up correctly. Here the instances of the class Person with name Maria Smith and Maria Jones might be the same object, just at a different point of time. This is usually mitigated by surrogate keys. So these objects are the same with respect to their ID (the surrogate key), but not with respect to the data they hold. A different distinction for the ORM system when it needs to synchronize objects with the persistency context.

Equivalence Relations

The equals() methods should of course implement an equivalence relation. However, very often they do not. An equivalence relation is reflexive (a = a), symmetrical (if a = b, then b = a), and transitive (if a = b and b = c, then a = c). It is quite easy to break the symmetry the class structure is not considered correctly. Let us assume you have a class A and a class B that is a subclass of A (i.e., it extends A in Java).

class A {
  int x,y;
  boolean equals(Object o) {
    if (o instanceof A) {
      A that = (A)o;
      return that.x == this.x && that.y == this.y;
    }
    return false;
  }
}

class B {
  int z;
  boolean equals(Object o) {
  if (o instanceof B) {
    B that = (B)o;
    return that.z == this.z && super.equals(o);
  }
}

If we have an instance a of A with x=1 and y=2 and an instance b of B with x=1, y=2, and z=3, then a.equals(b) will be true, but b.equals(a) will not. Just saying, only instances of the same class may be equal, is not always a viable solution, as explained above.

There Cannot be Just One

The real problem is, that there is more than one equivalence relation. Java (and many other OO languages) forces us to come up with one (more or less) canonical equivalence relation, rather than being able to define the real equivalence relation one currently needs.

A map implementation where you can define the equivalence relation would help. A Java interface for such a definition could look like this:

public interface Equality {
  boolean equals(Object o1, Object o2);
  int hash(Object o);
}

The problem is, that this equality definition needs to know about all classes whose instances might end up in the map. This problem can be solved by open classes and multiple dispatch (Java supports neither), but also by other techniques.

Mutability

The article I mentioned in the first paragraph mentions defining equality on mutable fields as a pitfall. IMHO the final keyword on variables and fields is underutilized in Java (and final should be the default) but this is beside the point. However, it is not so important that the fields used in the equals method are immutable, but that they do not change as long as the object is in the map. The way a lot of libraries and frameworks (e.g., Spring and Hibernate) work, you must give them mutable fields, but these fields might not change after the initialization is done (which is more than just calling the constructor).

Duckprxy now with Javassist Implementation

posted by Christian in Programming

I just released version 0.2 of duckprxy. It now comes with a duck proxy implementation that uses Javassist: JavassistDuckPrxy.

Proxies created with JavassistDuckPrxy are not using runtime reflection, but dynamically created code that is compiled to byte code by Javassist. This should reduce the overhead to that of normal delegation, hence, very little. I did not do any performance tests yet, though.

Duck Typing for Java

posted by Christian in Programming

Every now and then I need to implement some interface, but only one (or very few) methods of it. Usually I need something like this in test code, but it is useful elsewhere, too. The mocking libraries I have seen so far offer something similar, but not quite in the way I want to have it.

I had the idea to use something like duck typing for it for a while now. Yesterday I finally found some time to implement it. So I present to you duckprxy. You give it an interface to implement and an object to delegate to you and duckprxy will call the delegate according to the name of the method.

The service interface for this service is defined by DuckProxy. Currently there are two different implementations, but more on that later. Let us assume that the field duckProxy holds such an implementation of DuckProxy (e.g., via injection). Then we can do the following:

interface MyInterface {
    void foo();
    int bar(int x, int y);
}
public class Delegate {
    public int bar() { return 42; }
}
...
MyInterface prxy = duckProxy.makeProxy(MyInterface.class, new Delegate());
prxy.bar(2, 3); // Will return 42.

The standard behavior if an unimplemented method is called is to throw an exception. However, the delegate class can specify a fallback method to call.

public class Delegate {
    ...
    @DuckMethod(fallback = true)
    public int defaultMethod(
            @DuckArg(DuckArgType.NAME) String name) {
        log.info("called method " + name);
    }
}

Now every time an unspecified method is called, it will simply be logged.

Other features:

  • Mapping calls to methods with patterns.
  • Defining sub-delegates to allow easy overriding of a few methods, while the rest is handled by the sub-delegate.

Right now you need to define a class to use the extra-features with annotations. But I am thinking about also allowing to annotate the delegate variable instead of the delegate class.

As mentioned before, there are currently two implementations of DuckProxy. The first is DuckPrxyImpl. This one basically looks what method to call when the call actually happens. The second is DuckPrxyPreCompImpl. This one pre-computes the mapping of the methods in the given interface to the delegate.

I am thinking about using something like javassist for a third implementation that creates a class instead of a proxy. This way the performance impact should be negligible.

The Visitor Pattern in Differen Languages

posted by Christian in Programming

In this post I want to explore the usage of the visitor pattern in different programming languages.

The goal is always to represent expression trees, i.e., to have the abstract syntax tree (AST) of an algebraic expression like 2 * 5 - 3.

Let us start with a simple implementation in Java:

import java.util.HashMap;
import java.util.Map;

public class ExpressionTree {
    
    public static interface Operator {
        int eval(int ... args);
    }
    
    public static Map<String, Operator> OPERATOR_MAP
    = new HashMap<String, Operator>();
    
    static {
        OPERATOR_MAP.put("+", new Operator() {
            public int eval(int ... args) {
                int result = 0;
                for (int i : args) {
                    result += i;
                }
                return result;
            }
        });
        OPERATOR_MAP.put("*", new Operator() {
            public int eval(int ... args) {
                int result = 1;
                for (int i : args) {
                    result *= i;
                }
                return result;
            }
        });
        OPERATOR_MAP.put("-", new Operator() {
            public int eval(int ... args) {
                if (args.length != 2) {
                    throw new IllegalArgumentException("wrong arity");
                }
                return args[0] - args[1];
            }
        });
        OPERATOR_MAP.put("/", new Operator() {
            public int eval(int ... args) {
                if (args.length != 2) {
                    throw new IllegalArgumentException("wrong arity");
                }
                return args[0] / args[1];
            }
        });
    }
    
    public static interface Visitor {
        void apply(Node node);
    }
    
    public static class Node {
        public void visit(Visitor visitor) {
            visitor.apply(this);
        }
    }
    
    public static class Literal extends Node {
        private int value;
        public Literal(int value) {
            this.value = value;
        }
        public int getValue() {
            return value;
        }
    }
    
    public static class Operation extends Node {
        private Node[] children;
        private String name;
        public Operation(Node[] children, String name) {
            this.children = children;
            this.name = name;
        }
        public String getName() {
            return name;
        }
        public Node[] getChildren() {
            return children;
        }
    }
    
    public static class EvalVisitor implements Visitor {
        private int value;
        public int evaluate(Node node) {
            apply(node);
            return this.value;
        }
        public void apply(Literal node) {
            value = node.getValue();
        }
        public void apply(Operation operation) {
            Node[] children = operation.getChildren();
            int[] values = new int[children.length];
            for (int i=0; i<children.length; i++) {
                EvalVisitor visitor = new EvalVisitor();
                children[i].visit(visitor);
                values[i] = visitor.getValue();
            }
            String name = operation.getName();
            Operator op = OPERATOR_MAP.get(name);
            if (op == null) {
                throw new RuntimeException("unknown operation " + name);
            }
            this.value = op.eval(values);
        }
        public void apply(Node node) {
            if (node instanceof Literal) {
                apply((Literal) node);
            }
            else if (node instanceof Operation) {
                apply((Operation) node);
            }
            else {
                throw new RuntimeException("unknown node instance");
            }
        }
        public int getValue() {
            return value;
        }
    }

    public static void main(String[] args) {
        System.out.println(
                new EvalVisitor().evaluate(
                        new Operation(new Node[] {
                                new Operation(new Node[] {
                                        new Literal(2),
                                        new Literal(5)},
                                        "*"),
                                new Literal(3)},
                                "-")));
    }

}

Two things can be observed here: 1) The base class Node has a method visit() which simply calls the visitor with itself as the argument. 2) The visitor uses instanceof to delegate to the according apply() methods. Hence, run time type information is used. Java's OO system uses single dispatch. Multiple dispatch would make this delegation apply() method superfluous, as we shall see later.

Not all languages allow us to use run time type information and we can get rid of it by modifying the the Visitor interface as follows:

    public static interface Visitor<T> {
        T apply(Literal node);
        T apply(Operation node);
    }

This, however, forces us to make the visit() method in Node abstract and to implement it in Literal and Operation. The code is always exactly the same. Thus, we have the same method in every concrete subclass of Node instead of having it just once. The approach would be required in OCaml if you want to use objects. Another option would be to use algebraic data structures, which are very well supported in OCaml.

Let's move on to a language that has multiple dispatch: Groovy.

def OPERATOR_MAP = [
    "+": { args ->
        int result = 0;
        for (int i : args) {
           result += i;
        }
        return result;
    },
    "*": { args ->
        int result = 1;
        for (int i : args) {
            result *= i;
        }
        return result;
    },
    "-": { args ->
        if (args.length != 2) {
            throw new IllegalArgumentException("wrong arity");
        }
        return args[0] - args[1];
    },
    "/": { args ->
        if (args.length != 2) {
            throw new IllegalArgumentException("wrong arity");
        }
        return args[0] / args[1];
    }]

public interface Visitor {
    void apply(Node node);
}

public class Node {
    public void visit(Visitor visitor) {
        visitor.apply(this);
    }
}

public class Literal extends Node {
    private int value;
    public Literal(int value) {
        this.value = value;
    }
    public int getValue() {
        return value;
    }
}

public class Operation extends Node {
    private def children;
    private String name;
    public Operation(def children, String name) {
        this.children = children;
        this.name = name;
    }
    public String getName() {
        return name;
    }
    public def getChildren() {
        return children;
    }
}

public class EvalVisitor implements Visitor {
    private int value;
    private def operatorMap;
    public EvalVisitor(def operatorMap) {
        this.operatorMap = operatorMap;
    }
    public int evaluate(Node node) {
        apply(node);
        return this.value;
    }
    public void apply(Literal node) {
        value = node.getValue();
    }
    public void apply(Operation operation) {
        Node[] children = operation.getChildren();
        int[] values = new int[children.length];
        for (int i=0; i<children.length; i++) {
            EvalVisitor visitor = new EvalVisitor(operatorMap);
            children[i].visit(visitor);
            values[i] = visitor.getValue();
        }
        String name = operation.getName();
        def op = operatorMap[name];
        if (op == null) {
            throw new RuntimeException("unknown operation " + name);
        }
        this.value = op(values);
    }
    public void apply(Node node) {
        throw new RuntimeException("unknown node instance");
    }
    public int getValue() {
        return value;
    }
}

println(new EvalVisitor(OPERATOR_MAP).evaluate(
    new Operation([
        new Operation([ new Literal(2), new Literal(5)], "*"),
        new Literal(3)],
        "-")));

The Java code was taken as the basis for the Groovy code, as one can clearly see. As mentioned before, the method apply() of the visitor that delegated.

The method visit() is not really needed, since we use run time type information, anyway. But let us first look at an implementation in Common Lisp.

(defpackage :expression-tree
  (:use :common-lisp))

(in-package :expression-tree)

(defparameter *operator-map*
  `((+ . ,#'+)
    (* . ,#'*)
    (- . ,#'-)
    (/ . ,#'/)))

(defclass visitor () ())

(defgeneric visitor-apply (visitor node))


(defclass node () ())

(defgeneric visit (node visitor))

(defmethod visit ((node node) (visitor visitor))
  (visitor-apply visitor node))


(defclass literal (node)
  ((value :reader literal-value :initarg :value)))

(defclass operation (node)
  ((name     :reader operation-name     :initarg :name)
   (children :reader operation-children :initarg :children)))


(defclass eval-visitor (visitor)
  ((value :accessor eval-visitor-value :initform 0)))

(defmethod visitor-apply ((visitor visitor) (node literal))
  (setf (eval-visitor-value visitor) (literal-value node)))

(defmethod visitor-apply ((visitor visitor) (node operation))
  (let ((operation (cdr (assoc (operation-name node) *operator-map*))))
    (when (null operation)
      (error "unknown operation ~A" (operation-name node)))
    (setf (eval-visitor-value visitor)
          (apply operation
                 (mapcar (lambda (child)
                           (let ((sub-visitor (make-instance 'eval-visitor)))
                             (visit child sub-visitor)
                             (eval-visitor-value sub-visitor)))
                         (operation-children node)))))

(defun expression-evaluate (node)
  (let ((visitor (make-instance 'eval-visitor)))
    (visit node visitor)
    (eval-visitor-value visitor)))

(expression-evaluate
 (make-instance 'operation :name '-
                :children
                (list
                 (make-instance 'operation
                                :name '*
                                :children
                                (list
                                 (make-instance 'literal :value 2)
                                 (make-instance 'literal :value 5)))
                 (make-instance 'literal :value 3))))

The same approach is used here with the visitor class and the visit method. The fact that visit really just calls visitor-apply with swapped arguments, makes it even more obvious that this method is not really needed in dynamic typed languages. So let us get rid of this extra step. And while we are at it, let us define a function sexp-to-ast that transforms an s-expression into a node graph.

(defpackage :expression-tree
  (:use :common-lisp))

(in-package :expression-tree)

(defparameter *operator-map*
  `((+ . ,#'+)
    (* . ,#'*)
    (- . ,#'-)
    (/ . ,#'/)))

(defclass visitor () ())

(defclass node () ())

(defgeneric visit (node visitor))

(defclass literal (node)
  ((value :reader literal-value :initarg :value)))

(defclass operation (node)
  ((name     :reader operation-name     :initarg :name)
   (children :reader operation-children :initarg :children)))


(defclass eval-visitor (visitor)
  ((value :accessor eval-visitor-value :initform 0)))

(defmethod visit ((node literal) (visitor eval-visitor))
  (setf (eval-visitor-value visitor) (literal-value node)))

(defmethod visit ((node operation) (visitor eval-visitor))
  (let ((operation (cdr (assoc (operation-name node) *operator-map*))))
    (when (null operation)
      (error "unknown operation ~A" (operation-name node)))
    (setf (eval-visitor-value visitor)
          (apply operation
                 (mapcar (lambda (child)
                           (let ((sub-visitor (make-instance 'eval-visitor)))
                             (visit child sub-visitor)
                             (eval-visitor-value sub-visitor)))
                         (operation-children node))))))

(defun expression-evaluate (node)
  (let ((visitor (make-instance 'eval-visitor)))
    (visit node visitor)))
    (eval-visitor-value visitor)))

(defclass eval-visitor (visitor) ())

(defmethod visit ((node literal) (visitor visitor))
  (literal-value node))

(defmethod visit ((node operation) (visitor visitor))
  (let ((operation (cdr (assoc (operation-name node) *operator-map*))))
    (when (null operation)
      (error "unknown operation ~A" (operation-name node)))
    (apply operation
           (mapcar (lambda (child)
                     (let ((sub-visitor (make-instance 'eval-visitor)))
                       (visit child sub-visitor)))
                   (operation-children node)))))



(defun sexp-to-ast (sexp)
  (if (listp sexp)
      (let ((children (mapcar #'sexp-to-ast (cdr sexp))))
        (make-instance 'operation :name (car sexp) :children children))
      (make-instance 'literal :value sexp)))

(expression-evaluate (sexp-to-ast '(- (* 2 5) 3)))

Visitors can have state. We exploited that fact, but let us redefine visit to return a value. This way we can get rid of the state and reuse our visitor.

(defpackage :expression-tree
  (:use :common-lisp))

(in-package :expression-tree)

(defparameter *operator-map*
  `((+ . ,#'+)
    (* . ,#'*)
    (- . ,#'-)
    (/ . ,#'/)))

(defclass node () ())

(defgeneric visit (node visitor))

(defclass literal (node)
  ((value :reader literal-value :initarg :value)))

(defclass operation (node)
  ((name     :reader operation-name     :initarg :name)
   (children :reader operation-children :initarg :children)))


(defclass eval-visitor () ())

(defmethod visit ((node literal) (visitor eval-visitor))
  (literal-value node))

(defmethod visit ((node operation) (visitor eval-visitor))
  (let ((operation (cdr (assoc (operation-name node) *operator-map*))))
    (when (null operation)
      (error "unknown operation ~A" (operation-name node)))
    (apply operation
           (mapcar (lambda (child) (visit child visitor))
                   (operation-children node)))))

(defun expression-evaluate (node)
  (let ((visitor (make-instance 'eval-visitor)))
    (visit node visitor)))


(defun sexp-to-ast (sexp)
  (if (listp sexp)
      (let ((children (mapcar #'sexp-to-ast (cdr sexp))))
        (make-instance 'operation :name (car sexp) :children children))
      (make-instance 'literal :value sexp)))

(expression-evaluate (sexp-to-ast '(- (* 2 5) 3)))

This, of course, can also be done in Groovy (and in Java). Here it is (with OPERATOR_MAP omitted):

public interface Visitor {
    Object apply(Node node);
}

public class Node {
}

public class Literal extends Node {
    private int value;
    public Literal(int value) {
        this.value = value;
    }
    public int getValue() {
        return value;
    }
}

public class Operation extends Node {
    private def children;
    private String name;
    public Operation(def children, String name) {
        this.children = children;
        this.name = name;
    }
    public String getName() {
        return name;
    }
    public def getChildren() {
        return children;
    }
}

public class EvalVisitor implements Visitor {
    private def operatorMap;
    public EvalVisitor(def operatorMap) {
        this.operatorMap = operatorMap;
    }
    public int apply(Literal node) {
        return node.getValue();
    }
    public int apply(Operation operation) {
        Node[] children = operation.getChildren();
        int[] values = new int[children.length];
        for (int i=0; i<children.length; i++) {
            values[i] = apply(children[i]);
        }
        String name = operation.getName();
        def op = operatorMap[name];
        if (op == null) {
            throw new RuntimeException("unknown operation " + name);
        }
        return op(values);
    }
    public Object apply(Node node) {
        throw new RuntimeException("unknown node instance");
    }
}


println(new EvalVisitor(OPERATOR_MAP).apply(
    new Operation([
        new Operation([ new Literal(2), new Literal(5)], "*"),
        new Literal(3)],
        "-")));

Here it is in Java (with OPERATOR_MAP omitted):

import java.util.HashMap;
import java.util.Map;

public class ExpressionTree {
    
    public static interface Operator {
        int eval(int ... args);
    }
    
    public static interface Visitor {
        void apply(Node node);
    }
    
    public static class Node {
        public void visit(Visitor visitor) {
            visitor.apply(this);
        }
    }
    
    public static class Literal extends Node {
        private int value;
        public Literal(int value) {
            this.value = value;
        }
        public int getValue() {
            return value;
        }
    }
    
    public static class Operation extends Node {
        private Node[] children;
        private String name;
        public Operation(Node[] children, String name) {
            this.children = children;
            this.name = name;
        }
        public String getName() {
            return name;
        }
        public Node[] getChildren() {
            return children;
        }
    }
    
    public static class EvalVisitor implements Visitor {
        private int value;
        public int evaluate(Node node) {
            apply(node);
            return this.value;
        }
        public void apply(Literal node) {
            value = node.getValue();
        }
        public void apply(Operation operation) {
            Node[] children = operation.getChildren();
            int[] values = new int[children.length];
            for (int i=0; i<children.length; i++) {
                EvalVisitor visitor = new EvalVisitor();
                children[i].visit(visitor);
                values[i] = visitor.getValue();
            }
            String name = operation.getName();
            Operator op = OPERATOR_MAP.get(name);
            if (op == null) {
                throw new RuntimeException("unknown operation " + name);
            }
            this.value = op.eval(values);
        }
        public void apply(Node node) {
            if (node instanceof Literal) {
                apply((Literal) node);
            }
            else if (node instanceof Operation) {
                apply((Operation) node);
            }
            else {
                throw new RuntimeException("unknown node instance");
            }
        }
        public int getValue() {
            return value;
        }
    }

    public static void main(String[] args) {
        System.out.println(
                new EvalVisitor().evaluate(
                        new Operation(new Node[] {
                                new Operation(new Node[] {
                                        new Literal(2),
                                        new Literal(5)},
                                        "*"),
                                new Literal(3)},
                                "-")));
    }

}

For the sake of completeness here is the mentioned variant in Java that does not use dynamic type features (with OPERATOR_MAP omitted):

import java.util.HashMap;
import java.util.Map;

public class ExpressionTree {
    
    public static interface Operator {
        int eval(int ... args);
    }
    
    public static interface Visitor<T> {
        T apply(Literal node);
        T apply(Operation node);
    }
    
    public static abstract class Node {
        public abstract Integer visit(Visitor<Integer> visitor);
    }
    
    public static class Literal extends Node {
        private int value;
        public Literal(int value) {
            this.value = value;
        }
        public int getValue() {
            return value;
        }
        public Integer visit(Visitor<Integer> visitor) {
            return visitor.apply(this);
        }
    }
    
    public static class Operation extends Node {
        private Node[] children;
        private String name;
        public Operation(Node[] children, String name) {
            this.children = children;
            this.name = name;
        }
        public String getName() {
            return name;
        }
        public Node[] getChildren() {
            return children;
        }
        public Integer visit(Visitor<Integer> visitor) {
            return visitor.apply(this);
        }
    }
    
    public static class EvalVisitor implements Visitor<Integer> {
        public Integer apply(Literal node) {
            return node.getValue();
        }
        public Integer apply(Operation operation) {
            Node[] children = operation.getChildren();
            int[] values = new int[children.length];
            for (int i=0; i<children.length; i++) {
                values[i] = children[i].visit(this);
            }
            String name = operation.getName();
            Operator op = OPERATOR_MAP.get(name);
            if (op == null) {
                throw new RuntimeException("unknown operation " + name);
            }
            return op.eval(values);
        }
    }

    public static void main(String[] args) {
        System.out.println(
                new EvalVisitor().apply(
                        new Operation(new Node[] {
                                new Operation(new Node[] {
                                        new Literal(2),
                                        new Literal(5)},
                                        "*"),
                                new Literal(3)},
                                "-")));
    }

}

So, what does all this tell us? Two things are obvious: 1) To get rid of the method visit() in the target object structure we need dynamic typing. 2) To get rid of the manual dispatching we need either multiple dispatch or we have to implement the visit() in every concrete sub-class of the target object structure (here all sub-classes of Node).

But there is still more to it. There is a subtle difference between the Groovy and the Common Lisp version. To emphasise this, in CL the visiting method is visit, but in Groovy it is apply. The method apply belongs to the visitor class. In CL a method does not belong to any class, but it is just a question of specialisation.

Character Encoding --- Part 2: HTTP Auth

posted by Christian in Programming

My last blog entry was about the character encoding problems I had with Klatschbase. Of course, there has to be a follow-up, since encoding problems are so ubiquitous.

I did a change to hunchentoot's basic authorization method to allow utf-8 login name and password. As Edi Weitz pointed out in the hunchentoot mailinglist, this change is not standard compliant. Thinking about it, my assumptions were quite stupid. Why should the header, that is sent by the client, care about the standard encoding of the server. If the encoding could depend on anything, than it would be the content encoding. But this is for the body, and certainly not for a header field that might even appear before the content encoding definition.

Anyway, Edi Weitz's post showed up the interesting parts of the concerned RFCs. RFC 2617 defines the basic auth itself. However, RFC 2616 defines how words of TEXT must be encoded, which basically says, that non-ISO-8859-1 characters must use RFC 2047 encoding.

The consequences are disturbing. RFC 2047 defines an encoding scheme, that itself might use Base 64. So you have a Base 64 encoded string inside a Base 64 encoded string. It's a bit overhead, but that's not the bad part. A big problem in RFC 2047 IMHO is the restriction that encoded words have to be no longer than 75 charcters. If you want to encode more, you must use multi-line fields. Combined with a variable length encoding, this is messy. If you want to use some other library that takes care of the actual encoding of the characters to octets, you only have the choice to assume the worst case, i.e., that each character takes up the maximum number of bytes it can (4 for UTF-8 and UTF-16). In sum, this yields a lot of overhead.

And I somehow doubt that a lot of HTTP clients and servers implement this correctly or at all. Most HTTP software will probably simply assume ISO-8859-1 encoded credentials. But with the rising of more and more REST APIs this will become a real issue.

I have started an ASDF-installable library called CL-RFC2047 to handle this de- and encoding according to the RFC. It only works on strings, but you probably don't want to handle data with such an encoding that does not easily fit into memory. I also ported the encoding function to javascript, so I can use it in Klatschbase on the client side.

For the server side, I once again patched Hunchentoot's authentication method:

(defun authorization (&optional (request *request*))
  "Returns as two values the user and password \(if any) as encoded in
the 'AUTHORIZATION' header.  Returns NIL if there is no such header."
  (let* ((authorization (header-in :authorization request))
         (start (and authorization
                     (> (length authorization) 5)
                     (string-equal "Basic" authorization :end2 5)
                     (scan "\\S" authorization :start 5))))
    (when start
      (destructuring-bind (&optional user password)
	  (split ":" (base64:base64-string-to-string
		      (subseq authorization start)))
	(labels ((decode (str)
		   (if (and (> (length str) 2)
			    (string= "=?" str :end2 2))
		       (cl-rfc2047:decode str)
		       str)))
	  (values (decode user) (decode password)))))))

There is still a problem here: The RFC defines username and password as *TEXT, which means, they could be split up, meaning that encoded and non-encoded text can be mixed. The above method just assumes that the whole text (i.e., username or password) is either unecoded or encoded.

And by the way: Another consequence is that the username must not contain a collon. I think that could have been avoided if username and password would have been two colon seperated fields that are Base 64 encoded, instead of having the whole thing Base 64 encoded.

I do not want to rant about the HTTP protocol here. It is a really great protocol. I just wanted to comment on a problem I had here.

Character Encoding --- a Never Ending Story

posted by Christian in Programming

Character encoding really is a never ending story. I am a huge fan of Unicode and UTF-8. But it is still not the default everywhere. This blog entry is about the problems I encountered while making Klatschbase working correctly with UTF-8.

First of all, SBCL - my common lisp implementation of choice - supports UTF-8 out of the box. So, no problem here.

Hunchentoot, Edi Weitz's wonderful http server, supports UTF-8, but per default uses ISO-8859-1. It is easy to change the default behaviour, but it is also easy to overlook this.

(defparameter hunchentoot:*default-content-type* "text/html; charset=utf-8")
(defparameter hunchentoot:*hunchentoot-default-external-format*
  (flexi-streams:make-external-format :utf8))

Now at least the http body content is set correctly. But the next pitfall for me was the hashing of passwords. I use Ironclad for this. But Ironclad's helper functions only support ASCII. I somehow even failed to notice this problem until reading Leslie Polzer's blog and made a blatantly stupid comment. However, first I wrote a solution for the problem using the encoding translation library Babel which is easy to use. But I did not wanted to introduce another dependency, so I rewrote it to use Flexi Streams.

(defun sha256 (str)
  (let* ((utf8 (flexi-streams:make-external-format :utf-8))
	 (str* (flexi-streams:string-to-octets str :external-format utf8)))
    (ironclad:digest-sequence :sha256 str*)))

Now with this working, I stumbled upon another interesting fact: the JavaScript function escape uses a non-standard approach to convert non-ascii unicode characters. Wikipedia's percent encoding entry is enlightening. This is not so important, because jQuery takes care of these details. Just for the record: I don't know why the encoding used by the escape function was rejected by the W3C, but I think it make sense to reject it, because it uses a fix size of 16 bits. Since Unicode already needs more and grows further, this is not sensible.

But still, the characters where not decoded correctly on the server side when send via http auth header. The problem this time was in Hunchentoot's authorize method. It uses cl-base64's base64-string-to-string which does not respect the used encoding (it does not even seem to have a way to specify it). So, I fixed this, again with flexi streams.

(defun authorization (&optional (request *request*))
  "Returns as two values the user and password \(if any) as encoded in
the 'AUTHORIZATION' header.  Returns NIL if there is no such header."
  (let* ((authorization (header-in :authorization request))
         (start (and authorization
                     (> (length authorization) 5)
                     (string-equal "Basic" authorization :end2 5)
                     (scan "\\S" authorization :start 5))))
    (when start
      (let* ((auth-octets (base64:base64-string-to-usb8-array
			   (subseq authorization start)))
	     (auth (octets-to-string auth-octets
				     :external-format
				     *hunchentoot-default-external-format*)))
	(destructuring-bind (&optional user password)
	    (split ":" auth)
	  (values user password))))))

It is always interesting how much trouble character encodings are. I just hope the Unicode and UTF-8 become much more common. Just as a side note: a colleague just stumbled over a problem with a version of sed that did not respect the UTF-8 setting of the system, leading to umlauts beeing interpreted as two characters.