Transactional Data Versioning

One of my goals for setting up stmts was to provide an avenue to talk about the reasons I every now and again feel like building my own language. These reasons tend to boil down to ideas and concepts, and I often find myself talking about them with my friend Justin White, whose impressive work I will link to later. (I once said “I didn’t think of you, but I thought of particle effects, which is close enough”, which he wore as a status message for the better part of the following week.)

Just last week, we were talking about the difficulty of working with shared data during concurrency. Software Transactional Memory models tend to use a brute force “keep retrying” which feels unnerving, even if in a situation where actual transactionality (like account balances) is paramount, the access would instead be serialized and the state never shared as such. It comes down to an equation where you choose to delay your execution by queuing or locking (and get to perform the operation with the fresh data) or work with data that might go stale as you go.

Functional languages, and increasingly the rest of the world, solve this by treating everything as values. Every change is a new value; you never change an object, you build a new one with your modifications applied. So Justin had an idea: what if objects were still “just” objects, but their mutability were batched into versions? (I’ll call this model TDV for Transactional Data Versioning because that’s what I named my short-lived Tamagotchi.)

You’d enter a scope with an object and, through undisclosed methods, the current state of the object is stamped as the “previous” state. Any mutation to the object is recorded on-top and remembered as the “next” state. If, when exiting the scope successfully, the object has any mutations, the “next” state becomes the current state. (We didn’t go into implementation details, but this last part is trivially doable by atomically exchanging the states.)

{ // TDV initiated somehow
    foo.x = 42;
    foo.z = 9;
}
foo.x == 42; // true, but not true from another thread
// before line two in that block above

At the outset, this seemed like a cool enough idea. The concept of object identity could be retained, it’d be more obvious that code could fondle old data and it’d be okay for it to do that because you wouldn’t mistakenly receive “half an update”, and so on. But the more we dug, the more issues started popping up.

So if you write to the object and then read back that field, do you get the version in progress?

{ // TDV initiated somehow
    foo.x = 42;
    42 == foo.x; // ???
}

You might say yes. Well, what happens on deeper scope, or method calls?

assert(foo.x == 98);
{ // TDV initiated somehow
    foo.x = 42;
    var y = foo.x; // is y 42 (next') or 98 (prev)?
    { // TDV initiated somehow, again
        var i = foo.x; // is i 42 (next') or 98 (prev)?
        foo.x = 37;
        var j = foo.x; // is j 37 (next''), 42 (next') or 98 (prev)?
    }

    // at this point, if the previous block was a method call,
    // you could not rely on anything before the call being
    // retained.

    var k = foo.x; // is k 37 (next'' having been committed),
    // 98 (prev) or 42 (next')?
}
var q = foo.x; // is q 37 (next'' due to being the latest assignment)
// or 42 (next' due to being the latest scope to close)?

It turns out that you’d end up with an unmanageable tree of new versions, and it’s unclear when they split off and what they base the data on. We toyed with an explicit construct — advance foo — to introduce a commit point. But what if you advanced and then begun editing in a new nested block; the block would certainly make its copy from the newly minted version, so would the new version be visible from other threads too? If you made three changes in succession, what would the previous version refer to? (The previous version remains notable because if you reject the changes, that’s what they will revert to.) Will they coalesce into one big new version?

Maybe this sounds academic, but it’ll make sense in a real way: Justin was heavily on the side of allowing every read to return the result from the previous version, and every write to affect the future value, figuring that this would best preserve the semantics of the system, which is true. But it’s also true that doing so and then calling into a method will leave the method clueless about any possible changes that it should help prepare the new version for.

Let’s say the object was about a person with a set number of hours and an hourly rate, and the method was supposed to precalculate the pay for a week’s work. If the new values don’t carry over, you can’t precalculate the new pay, you can only recalculate the old pay. This is a textbook case of why you’d want to use methods — for code reuse.

You could solve this by extracting the new values into local variables which are then passed to the method or closed over by a closure and manipulated at will, and then setting them at the end. (The local variables are also naturally unsuitable for falling prey to the parallelism bugs that prompted the idea in the first place.)

In the end, we fiddled some more and ended up putting away this idea for later. Every time I go back to look at this, I feel the onset of brain freeze. We don’t solve every problem, but we do have fun trying.

Maybe there’s a reason this doesn’t really exist. Or maybe it does exist with everything already worked out, but as a library or in an obscure dutch paper. What do you think about the approach to the problem? Have you seen anything like it? What kind of conclusions did those people come to?