Monday, September 28, 2009

Funny Business with the Billion Dollar Mistake

First, null often gets a bad rap, and it deserves what it gets. Unfortunately, it can be quite handy, and it is pretty well understood by the average programmer, so I'm keeping it anyway, at least for now. Fortunately almost no one will use Concrete Mix, so it won't be an expensive mistake. Sidenote: I had to look this up while writing the above.

So, last time I said that null acts a bit strangely. To understand the how and the why, let's look at some code, and see how we would typecheck it:

main()
{
  var n = null;
  n.Foo();
  
  var m = null;
  m = new Bar();
  m.Bleck();
  
  return 0;
}

Some totally useless code, but it will prove illustrative, so bear with me. First, let's assume we're playing with the iterative algorithm I've been harping on about. So, after iterating to a fixed point, what does the typeset of n contain? Let's say that it contains nothing (that is, null indicates no type, and not an empty type, like void might indicate): {}. Then, should the call n.Foo() be deemed correct by the typechecker?

It turns out that classifying this call as incorrect has some unforseen consequences. I noticed the problem first when playing with a simple singly-linked list implementation. Consider the following code for a naive Length method:

class List
{
  private var m_Head;
  public List() {...};
  public Append(element) {...};
  public Length()
  {
    var count = 0;
    var current = this.Head;
    
    while(current is not null)
    {
      count += 1;
      current = current.Next();
    }
    return count;
  }
}

(Syntax note: as operator== can be overridden, Mix offers the is and is not operators for referential equality and inequality, respectively).

This code is a tiny bit more useful, though it's only a sketch. I'm sure you get the idea. The point is that, if we call Length on a list that is always empty, and we treat null as described above, we have a problem: the typeset of current is always empty, and so the call current.Next() would be flagged as incorrect by the typechecker. Now, you might suggest we perform some kind of nullness analysis and thereby determine that the body of the while loop cannot be executed, but this brings its own issues, so let's leave it alone for now.

So maybe our initial decision was wrong, and null should indicate the empty type. But this is no good, for consider m in the first code listing: its typeset would be something like {Null[]; Bar[...]}, and no method call on that typeset could ever succeed, as Null presumably implements no methods.

So it seems like the only solution (until we take a harder look at nullness analysis) is to classify calls on the empty typeset as correct. And in fact, this doesn't violate our definition of correctness, because the invocation n.Foo() will raise a null pointer exception, not a "method does not exist" exception.

Thursday, September 24, 2009

Recursion: See "Recursion"

So, taking a brief break from history, let's take yet another look at type inference in Mix. Way back when I mentioned that one of the restrictions we needed to place on the subset of Mix for which the algorithm worked was that no recursive calls we allowed. First, let's recall why; consider the following code:

Factorial(i)
{
  if(i == 0)
    return 1;
  else
    return i * Factorial(i - 1);
}

main()
{
  var i = Factorial(7);
}

A simple factorial, no big deal. So, what should they type of i be? Well, first let's get our terminology straight: what concrete types should be in the typeset of i?

Just by thinking about it we should conclude that it should hold only Int[...]. But, if we run the original, or even the iterative algorithm on this code, what happens? Infinite loopiness. The terror!

(Incidentally; if the call to Factorial in main were Factorial(7.0f), i would hold both Int[...] and Float[...] in it's typeset; the base case should tell you why).

The problem is that, when type checking the right hand side of the assignment in main, we are obligated to check the type of Factorial with an argument of Int[...]. In so doing, we check that Int[...] implements operator==, operator*, and operator-. And then we check Factorial with an argument of Int[...]. In so doing we check that... And on and on.

However, this is easy enough to fix (or at least, it's easy enough to partially fix): let's keep track of the "call chain", and whenever we start checking a new call, we'll search the call chain for said call. If it is not in the call chain, we'll add it to the call chain and check the type like normal. But if it is in the call chain, we get to stop type checking: we know that checking the call will only get us right back where we started.

Pause for a second: what exactly is a "call", and how do we represent a "call chain"? A call is just the (unique) name of a function or method, along with the type of each argument. This last bit is important, because otherwise we might realize that a previous call to AddTwoNumbers succeeded with integers, and imagine that it would succeed on files. A call chain is a list of such calls, in the order we've invoked them.

But, is finding a call in the call chain an error, or success? And anyway, what does it mean? Second question first: it means that there is a loop in the call graph, just like we saw in the code above, and furthermore that the recursive call is at the same types: the function is being used in a monomorphically recursive way. First question second: we'll treat this as a success, because it indicates that along this call chain all possible method invocations succeed (possibly while looping infinitely). If they succeed, it meets the correctness requirement we've already talked about. If it loops infinitely, it still meets this requirement. (Side note: we're essentially saying "it's correct, unless this other runtime error comes up, but that error will hide the incorrectness"; we'll see this with respect to null next time).

Side note: this is rather strange, in that it somehow means that we are interested in a greatest fixed point, whereas normally we are interested in least fixed points. We're taking the view (in a handwavish sort of way) that, if when trying to prove P, we eventually have to prove P, we'll consider this to be true, or provable.

So, if you take the above as correct, are we back to a terminating algorithm? Well... not necessarily. Once again we would have termination "for free" if only we had a finite number of types to deal with. Unfortunately, we don't, and so this doesn't terminate so easily. In fact we could force this to terminate by requiring that all recursive calls to a particular function have the same type (that is, make all functions monomorphically recursive), but that's no good -- check out this simple function that isn't monomorphically recursive:

identity(arg)
{
  if(identity(rand() == 42))
    return identity(arg);
  return arg;
}

It's just the identity function, but with a little twist: no matter what the type of arg is, the function might recursively called with a Bool[...] argument. In ML the type of the function is 'a -> 'a, but it is used in a polymorphically recursive way. Anyway, a strange function, but they can be found in "real life" (especially when mutual recursion comes into play); Henglein had an interesting (and approachable) paper on inference in the presence of generalized polymorphic recursion in an ML-like settings -- check it out. If you don't bother, let me just mention that inference for such a language can be shown to be undecidable (indeed that's a contribution of the paper) by showing that it is equivalent to semi-unification. And that means that, in what I'm trying to do with Mix, I will inevitably face some tough choices to reign in this undecidability somehow; sounds fun!

Friday, September 11, 2009

Been There

So, I'm not the first idiot to think about doing type inference in object-oriented languages, obviously. There's plenty of prior work in this area. Here's a list of some relevant research, by project.

  • O'Caml: that's what the O stands for, maybe? Anyway, it seems like O'Caml's object layer doesn't get much love, but the type system behind it is pretty interesting. It implements structural typing via row variables. "The Essence of Type Inference in ML", a chapter of Advanced Topics in Types and Programming Languages, is a very readable introduction to this (and indeed many interesting aspects of type inference in ML-like languages), written by Francois Pottier and Didier Remy.
  • Self: there's been a lot of work on type inference for Self, generally it seems for the purpose of efficient implementation. Ole Agesen's work on the Cartesian Product Algorithm (CPA) has been the basis for a number of different lines of research; for instance, Olin
  • Python: I haven't heard much on this front recently, but Mike Salib's thesis describes an application of the Cartesian Product Algorithm to Python which is pretty neat, though the subset of Python it is implemented for is a bit smaller than some might like.
  • Java: An extension of the CPA for dealing with data polymorphism (more on what that means in a second) called the Data-polymorphic CPA has been applied to Java in the context of detecting invalid downcasts.
  • More: we'll get to others at some point.

So, why don't I just implement some of this previous work and be done with it? Well, other than the fact that it isn't as fun as trying something new, each of the previously described approaches leaves something to be desired. So, let's first examine what we (ok, what I) want from the inference algorithm and language, and then see which criteria each of the aforementioned approaches meets.

  1. Structural Typing: If an expression implements a particular method, we should be able to call that method (subject to the "correctness" constraints we've already talked about).
  2. Recursive Types and Functions: A class C should be able to contain members that are instances of class C, see? And members should be able to call themselves recursively. Oh, and make that mutually in both cases.
  3. Data polymorphism: Variables should be able to hold objects of more than one type, again subject to the constraints already mentioned. This property is critical to useful data structures (we want lists that hold variables of several types, and that can be accessed in an intelligent way, for example. Or maps from strings to different kinds of entities in a super awesome MMOORRPPGG).

First, row variables a la O'Caml. They handle the structural typing end of things, and recursive types and functions, just fine. But row variables don't say anything about mutable data (and therefore data polymorphism), and O'Caml (like most ML-like languages) doesn't treat mutable data in the way that I wish it would. In O'Caml, a mutable data 'cell' is stored in a reference. When the reference created it is filled with some initial data, which can later be updated. However, the type that the reference holds cannot change; if the reference is first filled with a string, it cannot later be filled with an int.

In addition, the structural typing supported by row variables isn't quite as strong as we might like. It can deal with inferring the types of arguments to a function by the way that they are used, but some simple constructs throw the algorithm for a loop. For instance:

let someFun b =
  if b then
    new Foo
  else
    new Bar

The above will not compile: there is no implicit subsumption rule in O'Caml. The algorithm will not find a subtype of Foo and Bar, and use that in each branch of the if expression. It will only work if you explicitly "cast" to such a type (note that the cast is safe, and not a downcast). This is exactly the kind of code we will be writing (and be expecting to work!), so row variables as implemented in O'Caml aren't going to cut it.

I was going to continue with the rest of my bullets, but I've run out of ammo (and time), so you'll have to wait until next time.

Tuesday, September 8, 2009

Gone

So, I haven't written anything in the last week or two, my apologies. I was in Los Angeles visiting my good friend Anthony (1817). Went surfing every day, it was pretty much perfect. But now I'm back, and shall update with haste.