Thursday, August 20, 2009

This Big: Infinity

So, does the iterative algorithm we last discussed actually terminate? It's easy enough to see that it would if there were only a finite number of types, but last time I claimed that there are in fact an infinite number of concrete types. Let's see why:

class Pointer
{
  private var obj;
  public new(o) {this.obj = o;}
  public Get() {return this.obj;}
  public Set(o) {this.obj = o;}
}

Using just this class (and not even the "built-in" classes like Int and so on) we can construct an infinite number of concrete types. We have Pointer[obj: {}; ...], and Pointer[obj: {Pointer[obj: {}; ...]}; ...], and so on (the fields Get and Set have been ellided). That is, we could have a pointer to the empty typeset, or a pointer to a pointer to the empty typeset, or a pointer to a pointer to a pointer, and on and on.

So clearly we can't just rely on there being a finite number of concrete types when iterating to ensure the algorithm eventually terminates. We need to be a bit more clever. First, let's write some Mix code that leads to an infinite number of types as described above:

main()
{
  var p = new Pointer(null);
  for(...)
  {
    p = new Pointer(p);
  }
}

So, what happens here? First, p's typeset just has a Pointer[...] to an empty typeset. After one iteration, it's has a typeset has two elements: Pointer[...], as before, and another Pointer to it's own typeset. That is, if we could name p's typeset (let's call it p), then p = {Pointer[obj:{}; ...], Pointer[obj: p; ...]}.

So, what happens after another iteration? In fact, we try to add the same concrete type to the typeset again: a Pointer to p. As long as we can recognize that this is the same concrete type as last time, we can stop iterating as no typeset will have changed.

So, that takes care of this example, but what about in general? Is there any way to construct an infinite number of concrete types, without them being cyclic? Clearly it is possible to define a non-cyclic concrete type of arbitrary size, but is it possible to write code that leads to one? We'll see.

Wednesday, August 19, 2009

Loopiness and Stringiness

Actually, there will be no strings here; right now I'm just going to be talking about loops. Earlier I mentioned that the simple algorithm for infering types couldn't handle loops (amongst other things), so let's see why they are an issue more clearly, and try to solve it.

I gave a simple example last time:

main()
{
  var i = "Hello!";
  for(...)
  {
    i.SomeStringMethod();
    i = new Foo();
  }
  i.SomeFooMethod();
}

The problem is that the call to SomeStringMethod (which is presumably valid when i is a string) is valid if the loop runs only once, but isn't if the loop runs more than once. So, we'll assume that the loop runs more than once; now how do we build this assumption into the algorithm?

At first glance we might simply analyse the body of the loop twice; the first time when we check the call to SomeStringMethod i is known only to have type String[], and so the call succeeds. The second time i's typeset contains both String[] and Foo[...], and so the call fails (assuming that Foo doesn't implement SomeStringMethod, which is rather the point of that name).

That sounds good, and it's simple, right? Unfortunately, it's not right, either. Consider this more convoluted piece of code:

main()
{
  var a = new SomeClass();
  var b = new SomeOtherClass();
  var c = new YetAnotherClass();
  
  for(...)
  {
    a = b;
    b = c;
    c = a;
  }
  a.SomeMethod();
}

It's pretty obvious what's going on here: depending on how many times the loop runs, this call could be valid, or it could be invalid. Let's say that both SomeClass and SomeOtherClass implement SomeMethod, but YetAnotherClass does not. Then after 0 iteration the call would succeed, and after 1 it would succeed, but after 2 it would fail. So, let's analyse the loop body 3 times; 3 is enough for anybody, right?

But wait, you say: we could keep adding variables (d, e, and so on) to cause a greater number of iterations to succeed before failure. Right you are. Instead of hardcoding the number of times we should check the body of the loop, we need a way to determine that during analysis.

One way we could approach this is by analysing the code in a separate pass and calculating the number of times, perhaps be checking the number of variables. That would work in the above case, but what about in more complicated cases? Trust me, there are cases in which that sort of simple analysis won't work. Maybe there's a more complicated analysis? I don't know it.

Another way, and indeed the way I will choose, it to iterate to a fixed point. That is we keep on analysing the body of the loop until no typeset changes; at that point every typeset should be complete. Fixed point analyses are really common in Computer Science, and you seem to bump into them quite frequently when working with compilers; for example, the first time I ran into it (I think) was when I was implementing a compiler for Tiger (a simple imperative language) based on Andrew Appel's book Modern Compiler Implementation in ML, which is, incidentally, also the first time I programmed in Standard ML. The specific usage was during liveness analysis, solving dataflow equations.

Anyway, to see how this works, let's run through the last example.

At the outset, each typeset (for a, b, and c) has a single element in it. After the first iteration, a has SomeClass[] and SomeOtherClass[], b has SomeOtherClass[] and YetAnotherClass[], and c has YetAnotherClass[] and SomeClass.

After the second iteration, each typeset has all three classes; we could stop at this point, but how would we know to in the general case? Instead we continue.

After the third iteration, each typeset still has all three classes; no typeset has changed. Now we can stop; we know that further iterations will not yield additional information.

There are a number of questions that remain to be answered. First, how do we know that we'll eventually stop adding concrete types to the typesets, and thereby terminate. Clearly there are an infinite number of concrete types, so we could concievably continue to add types indefinitely.

Second, how do we know that the collection of types in the typesets is accurate after we reach a fixed point? With the simple algorithm I didn't talk about correctness because it was obviously correct (for the ridiculously limited subset of Mix to which it applied), but now correctness is less obvious.

Stay with me, and I'll get to answering these questions (or at least trying to) soon enough.

Tuesday, August 18, 2009

Static is only missing an N

So, I said that some folks might not be interested in static analyses like type correctness, or const correctness, or god knows what. And, to be honest, I can't blame you (too much), because in a lot of popular langauges the available static analyses (the type system, in most cases) isn't that great. Maybe it isn't powerful enough to encode useful properties (C, anybody?), or maybe it is powerful enough if you really want it to be, but everyone sort of ignores it because its more trouble than it's worth. Anyway, lot's of people aren't all about static typing, are all about dynamic typing, and will kill you with an ASCII blade if you disagree. On the other hand, some people love static typing, and static analysis in general, more than they love their own aged grandmothers. They will fight you (and any passing dragons) to the death with the dual weapons of dataflow analysis and syntax directed translation. Or something.

Anyway, the point is that I'm about to jump into the fray, so to speak, and try to convince you that thinking about static typing for a language like Mix is really a good idea. So, away we go!

First, there are a number of reasonable complaints about the static type systems of popular existing languages (clearly this list isn't even close to exhaustive, but it hits a few):

  1. They require a ton of extra code (everyone say "Java" with me).
  2. They can be subverted, making them not particularly useful (C's type system accepts code that will set you on fire; CCured is an interesting attempt to provide you with a fire extinguisher by translating flamable C to safe(r) C).
  3. They limit your freedom as a programmer, making programming slower.
  4. Their usefulness is subsumed by runtime tests.

Ok, lot's of arguments against static typing. The first one I'll give away: lot's of languages have got it "wrong" according those voicing this particular complaint. But, others have gotten it right (I'm thinking O'Caml, Haskell, F#): type inference can save a language from failing this early in the game.

The second one we can give away in the same fashion: Java, C, C++, and all kinds of others share this problem, to varying degrees: in Java you can get a ClassCastException, and in C you can cause a core dump. But again, there are plenty of languages whose type system you cannot subvert to cause runtime problems, so stronger static typing could save us from this failing as well.

The third one is were the complaints start "getting real", to quote someone or another. Basically, the idea is this: when you're programming, especially during the early phases of a project, or when you're really just "exploring the solution space" (buzzword!), or doing rapid development, types can make your code less flexible. It can take longer to twist things, experiment. There's all kinds of inertia to the code, damnit. Well, again this can be true. Here's my argument: if there isn't any manifest typing (in the form of type annotations), and you have a "good" type system, then changing your program around in this manner should only ever irritate the type checker when you've actually done something wrong, which is exactly what you want, right? Of course, the problem becomes defining "good"; that's what this blog is about.

The fourth one is tricky as well. I read about this one in a book edited by Joel Spolsky called The Best Software Writing, in a piece by Bruce Eckel; go read this if you haven't (only takes a minute). Back already?

Ok, so the argument is that you're already writing tests (right?), and they already check all of the things that the type system checks, and more, so why bother with static typing? Good point, certainly. But, this seems to rely on 2 basic facts, at least 1 of which must be true.

  • Using a language with static typing costs something.
  • Dynamic typing saves that same something (time, or money, or whatever).

Hopefully the above (points 1 through 3) has convinced that, while the first bit can be true, it doesn't have to be. As for the second bit, it's predicated on the assumption that you can't have a statically typed language that saves you the same something in most cases; we'll see about that!

So, if you take all my claims as true, you might be wondering why I went through all that trouble only to basically show that static and dynamic typing are more or less equivalent. Because that's the result of point 4 above, right? Well, here's where I'm going out on a limb to give static typing the edge (under certain conditions, but don't worry, dynamic typing gets to win, too; we're all winners here). In my personal experience, most of the features of dynamic languages that I use regularly could (probably) be incorporated in a sbuitably designed static type system (of course, things like eval aren't in the "most" I'm talking about; if you need stuff like that then dynamic takes the game).

Not much content here, but maybe if you were on the fence before you'll hear me out going forward.

Monday, August 17, 2009

Algorithmics Simplistic

Last time I said that my notion of correctness is simple: you should never invoke a method on an object that doesn't implement it. To start off discussion, let's look at a very simple algorithm that statically determines this. The subset of Mix that it handles is one without loops, without recursive calls, and with no inheritance; basically a fairly useless subset of Mix, but let's go with it.

The idea is simple: starting from the programs entrypoint (a free function I'll call main), we'll interpret the program, but instead of using regular values like ints, and strings, and objects, we'll use types; that is, we'll perform an abstract interpretation. Abstract interpretation, invented and first formalized by Cousot and Cousot, has a long history in static analysis; I'll be more or less ignoring the formalization from here on in, and just going with an informal approach. Check this out for a fairly readable introduction to "real" abstract interpretation.

Moving on to the actual algorithm, consider the following trivial code:

class A
{
  public var Member;
  
  public new(arg){ this.Member = 1; };
  
  public Method1(){};
  public Method2(){ this.Method1() };
};

main()
{
  var a = new A();
  a.Method2();
  a.Member;
  return;
}

So, starting in main, let's interpret the program. At the outset, we have a variable declaration; let's pretend that it was really written as a variable declaration followed by an assigment:

var a;
a = new A();

That let's us look at the two steps separately, instead of all at once. So, first we see a variable declaration, so we add a to the environment; we associate with a the empty typeset. A typeset can be thought of as a set containing all types that the variable is known to hold; by looking at how all of the types in a typeset intersect, we can understand what operations on a are guaranteed to be legal.

Next, there's an assignment: the effect of the assignment under the interpretation is that the typeset associated with a should be augmented by the right side's value; but how can we determine the right side's value? Let's continue interpreting. new (with a particular class name and constructor arguments, of which there are none in this case) corresponds to constructing a concrete type based on the class and then invoking the class's construction operator.

Concrete types are values that represent the properties of an object, namely it's class name and its fields (both members and methods). Concrete types are what go into typesets. From now on, we can think of the value of a particular expression (under the abstract interpretation) as a typeset containing 0 or more concrete types.

So, first we construct the concrete type: it has name A, and it has 1 member and 2 methods, namely Method1 and Method2; for now we'll indicate this concrete type as follows: A[Member: {}; Method1: {function()}; Method2: {function()}]. This should be read as a concrete type "A with field 'Member' having the empty typeset, 'Method1' having a typeset of 1 element, and 'Method2' having an typeset of 1 element." The elements in the typesets for Method1 and Method2 are callable; I'll talk about callables in some other post, but for now the name should give you enough information to go on.

Now that we've constructed the concrete type, we interpret the constructor. This means we analyze the body of the new operator for class A, using an environment in which this is bound to a typeset consisting of only the just-constructed concrete type. So, we look inside the operator's definition, and the first thing we see is another assignment.

Once again we add the value of the thing on the right to the typeset of whatever's on the left, and this time the thing on the right is just an integer; for now we'll say that the concrete type of an integer is Int[], therefore the value of the right side is a typeset containing only 1 concrete type. Essentially we're setting the typeset of the thing on the left to be the union of itself and the typeset of the thing on the right.

So, the result is that the original concrete type we constructed in the previous paragraph is now the following: A[Member: {Int[]}; Method1: {function()}; Method2: {function()}]. Finally we return to main; the important thing to notice now is that we know that the concrete type has a member Member that at some point could be instantiated with an Int[], and that a's intersection contains this concrete type.

Next we come to an method invocation:

  ...
  a.Method2();
  ...

It actually consists first of a field lookup and then a call. In particular, we first check to see that all of the types that a could be have implement Method1, and then we check that each of the fields' types are callable. Again, a can have only 1 type, which does indeed implement Method1, and it is callable. So, we interpret the call, by looking in the definition of Method1, passing the relevant concrete type in a's intersection as this.

In Method1 we follow a similar process to check that this implements Method2, which it does. We go into the (vacuous) body of Method2, and eventually come all the way back to main.

Next we check that all of the types in a's intersection have a field named Member; again they do, and in this case its type is dropped on the floor (just as the value of a.Member would be dropped on the floor at runtime). Finally, we return, and that's that: all of the code as been interpreted, and no invocation failed, so the code is deemed correct.

That got really long winded, and we didn't even seem to do much: there were no errors, and a only ever had 1 type in its typeset, so it wasn't too interesting. So, let's briefly look at a slightly more interesting example; we'll ignore some aspects (like where rand comes from) to keep it simple.

class C
{
  public new(){};
  public Foo(){ if(rand() == 0) return 1; else return new A(); };
}

main()
{
  var c = new C();
  var i = c.Foo();
  i.ToString();
}

Interpreting main we see that, after the call to c.Foo(), i's typeset should have 2 concrete types, one corresponding to an integer, and the other to an instance of A. When we check to see if all of these types have a field named ToString in the last line of main we encounter an error: C doesn't implement ToString.

Hopefully that was pretty clear, and also pretty obvious. You might recall that I pretty severely limited the language at the outset, and now you might be able to guess why: if we allow recursion with this simple algorithm, it will go into an infinite loop, following recursive invocations over and over. If we allow loops, the problem is only slightly more subtle. Consider the following code:


main()
{
  var i = "Hello!";
  for(...)
  {
    i.ToString();
    i = new A();
  }
}

Our simple algorithm, with no modification, would not see an error here, though there clearly is one if the loop runs more than once. It will check the invocation i.ToString() when i's typeset contains only Int[], and not when it contains also A[...].

So, lot's of new stuff (not like there was much beforehand). I'll talk more about typesets, concrete types, callables, and I'll start to address both of these algorithmic issues in the next few posts.

Introductions

I said that Mix is an object-oriented language after the tradition of C# and Java, which is true enough. There are classes, and the syntax is C-like, and the operational semantics of the language are exactly what the average C# or Java programmer might expect. So, sample code:

class SomeClass
{
  private var someMember;
  
  public new(anArg, anotherArg)
  {
    if(anArg)
    {
      this.someMember = anotherArg.M(anArg);
      anotherArg.N();
    }
    else
    {
      this.someMember = new List();
    }
  }
  
  public GetSomething()
  {
    for(var i = 0; i < 42; i++)
    {
      this.someMember += i / 6;
    }
    return this.someMember;
  };
};

This class really doesn't do anything; it's here to highlight that Mix code really looks like some code you might have written in a language like C#. It has if and for, members and methods, and all kinds of normal stuff; if you don't like that everyday stuff you're not going to like Mix's syntax, unfortunately. In my opinion it is a reasonably usable and certainly widely known syntax, so I went with it.

The only interesting bit is that there aren't any types in the usual places. That is, you normally expect, in a statically typed language (unless your boss let's you write in O'Caml, or F#, or some other language with inference) to have to write types all over the place. If it were Java we'd need them for each member, for each method (return and argument types), and each local variable. In C# these days you can drop the types on locals. But here we haven't any at all, unless you count SomeClass itself, which I won't: it's more like a prototype, in that we use it to create an object, but not to classify objects later on, as we'll see.

So, this is the language in which I am interested: no types on members, methods, or local variables, looks somewhat like your everyday object-oriented language, simple enough. I could compile this down to Python, say, or Javascript, easy as that (where "that" is snapping your fingers). But I don't only care about compiling and running Mix programs, I care about analysing them for correctness at compile time (and really, all I would have done was give a different syntax to a fragment of one of those languages; that's no fun). Now, you might be the sort of person that isn't interested in static analysis; you probably won't care for Mix, either, so you might want to hang out with the folks that left when I mentioned braces.

Still here? Ok, so, what do I mean by correctness? I'm taking the very simple criteria of "doesn't call methods on objects that don't support those methods" as meaning "correct" with respect to the type system. So, I don't want to call Append on an Int, for example. We'll (eventually) see how this criteria can be used in some interesting and (hopefully) powerful ways, but for now let's leave it (my "correctness" criteria) at that.

To see what I mean, let's look at a (very slightly) more reasonable example, with PETS! (I like pets, in particular my cat, Mortimer). Here we go:

class Cat
{
  public Speak() { return "meow!"; };
  public Pounce(target) { ... };
}

class Dog
{
  public Speak() { return "arff!"; };
  public SearchForBombs() { ... };
};

main()
{
  var pet = null;
  if(somethingOrAnother)
  {
    pet = new Cat();
  }
  else
  {
    pet = new Dog();
  }
  print(pet.Speak());
};

So, super simple example, but you probably already get the idea: pet could be either a Dog or a Cat, but because both of them can Speak, we're all good. What if the we tried to call Pounce on pet? According to Mix (that is, according to me), we'll treat it as an error: there is some path through main that leads to pet having an instance (of the class Dog in this case) that cannot Pounce. Even if the programmer knows, somehow, that somethingOrAnother is always true, the type system will treat this as an error.

One way to think of this, which I'll make more concrete in future posts, is that every target of an method invocation must always (syntactically, in the source) be an instance of a class that implements the method. So, in the above example, in the expression pet.Speak(), pet will, no matter what, be set to an instance of a class that implements Speak, and is therefore correct.

Anyway, that should give you a bit of the flavor of what's to come.

Friday, August 14, 2009

Concrete Mix

Concrete Mix (or just Mix, if you prefer a shorter handle) is a statically typed object oriented language. The language itself follows the tradition of languages like Java and C#, with an important difference: Mix is intended to be an experiment in complete type inference for such languages (that is, whole program type inference without any annotations). This blog will mostly detail the language and my experiments with static analyses on Mix, though I might occasionally mention some programming languages- and type theory-related items.