On the way to step functions: dreams of marshalable stacks
Lately, it has been a great challenge to get into the world of durable execution. Granted, in the world of web applications we often do not need it. However, anyone who has come in contact with electronic payments, money orders, verifications - will likely encounter a need for durable execution.
The juggernauts in the space are, of course, Temporal.io - with the recently emerging contender restate following in its footsteps. Both are based on the premise of sagas, controlled by a separate service.
There is also DBOS and now there is also absurd and the Vercel Workflow.
As it happens, meeting Bouke when I joined Cheddar has spurred my interest in Temporal. I knew someone who did work on the payments infrastructure at Uber (where Temporal came from), and intellectually the problems in the space are just very stimulating. However, as luck would have it, it would take another 2 years at Cheddar before a need would arise for actually using durable execution, in actual features. And in the course of implementing it, a few things turned out to be very invigorating indeed. So, before we go into any libraries or solutions: let’s just contemplate.
However, if you are impatient (and a Rails user) - just head to the geneva_drive repo for the grand reveal.
🧭 I am currently available for contract work. Hire me to help make your Rails app better!
The premise is this. Imagine we have a payment workflow:
from_account = user.payment_accounts.first!
# The call to `.transfer` may retry or fail, and take an arbitrary long
# amount of time
loop do
payment_status = payment_provider.transfer(from_account:, to_account: recipient_account, amount:, idempotency_key:)
if payment_status != :still_processing
break
end
sleep 30.minutes # We want to suspend our code, completely - and resume from here later
end
What one wants here, is that the code being executed can actually be paused, and then resumed at arbitrary points. This is the same thing, effectively, as Terraform forcing its users to use a specifically designed declarative language - a declarative language does not impose a flow of execution, and externalises the places where the execution can be paused, retried or aborted. What Temporal tries desperately to do is to provide you an API which allows you to do this:
authorisation_result = run_remotely_and_asynchronously(:authorise_payment)
funds_check_result = run_remotely_and_asynchronously(:check_funds)
sleep until authorisation_result.received? && funds_check_result.received?
transfer_result = run_remotely_and_asynchronously(:transfer_funds)
run_remotely_and_asynchronously(:send_email) if transfer_result.ok?
sleep 10.days
mark_payment_as_final
Now, if you have a runtime with sufficiently light userspace threads or co-routines - be it Go, BEAM or even Node - you may get this to work fairly easily. When you enter one of those run_remotely_and_asynchronously sections, your workload gets shipped off to a separate co-routine - or, even, to a co-routine that performs an RPC call. Your co-routine waits as long as it needs to in a sleep state, before getting awakened by an IO reactor or other signaling primitive. It then returns its result to the calling coroutine, which then can choose to either retry the action, perform rollback or do any other interesting steps.
If you are following the async-io world in Ruby you would even know by now that it is absolutely possible to write such code today, and if you do not restart your program - it will work, with decent resource consumption.
If your program never exits, and your server never crashes, and you never get network partitions - this is entirely possible to achieve. The challenge here lies in the fact that we are speaking about an “orchestrating program” - a body of instructions that should always, at any possible cost, run to completion.
That is, it should always terminate cleanly, or - at the very least - terminate out of “its own volition”, from within its own call stack.
Now, if you are running a digital telephony device - which runs Erlang - and this device is meant for 24/7 operation, months on end - without reboots or software updates - you may very well be onto something. If you are dealing with a phone call, for instance, it does - after all - likely end in some reasonable timeframe - probably within 24 hours. During that time, if your appliance crashes - the call will be disconnected anyhow. Even then - the Erlang motto of “just let it crash” is already hinting us at the fact that even when your “orchestrating program” is cheap to run – it does not give you a guarantee that it will never crash, and you should be prepared for this eventuality.
We, with our cloud functions nonsense, have functions which spin up at the flick of a wand - and providers pride themselves in being able to bring one up in microseconds, and to terminate one just as quickly. Something that would be an acceptable call (sleep 30.minutes) in an always-on long-lived system becomes a forbidding luxury in the cloud.
Moreover, some systems are just not fit for this because in addition to the actual variables involved in the stack frame, there is a whole context around the invocation - which has a non-zero cost. A load balancer somewhere is holding connection to the invocation. A browser is waiting for responses on the other end of that connection. Rails allocates a database connection to that invocation as soon as the first SELECT * FROM users WHERE ... gets done, and does not release that connection back into the pool until the invocation returns. Suspending that invocation for an arbitrary amount of time is not only costly in terms of keeping the actual values on the stack alive, but also in terms of holding on to all that context.
Electric dreams of marshalable stacks
All of those systems - Temporal, Restate and the ilk - are trying to create a runtime, where just one thing would be different from the actual state of various runtimes we already have. It would be a runtime where a function would be able to “snapshot” itself and put itself into “deep sleep”. When the time comes to “resume it” - usually by having the external scheduler (or orchestrator, or another function call) knowing “when” - the function would be allocated to a machine (or a VM, or an isolate, or a thread) - and magically “revived”, upon which it would resume execution from the spot in the call stack where it left off.
All the APIs I’ve seen for durable execution try to do their damnedest best to pretend that this is possible to achieve, yet in none of the runtimes is it actually possible. Therefore a sleep 30.minutes becomes fancy_context.pause_for(30.minutes). Local variables necessarily become fancy_context.store("authorisation_token", token) and the like.
Granted, modern operating systems support hibernation pretty well, with two caveats: on the same machine with the same hardware, and without changes to the program. There is some support for VM snapshotting, but those snapshots are big - they are the whole kit and kitchen sink: the OS, the libraries, and the entire contents of memory. But even then: a marshalable stack would be quite a feat to pull off.
The main obstacle with a marshalable stack - which can also be relocated between machines - would be the handles. See, most of things in programs that do useful stuff - like accessing files, sockets, databases, GPUs, interrupts - they are what is called a “handle” - some “opaque reference to an externally provided resource”. And most of those are not marshalable. Remember how - by far - not everything in a Java program necessarily implements Serializable? Things which do not - those would just get tossed when serializing, but then one would need to have a way to revive them. A language with truly serializable stacks would likely implement such a construct:
module SerializableDatabaseConnection
def to_hibernated_handle
connection_configuration.to_h
end
def revive_from_hibernated_handle(connection_configuration)
connect(connection_configuration)
end
end
But handles are only a part of the problem. Code changes too. If our hibernating invocations, which were stored a couple of months ago, are still expecting our system to have PaymentInterfaceV1 - it will be a loud “bang!” when they get resumed and discover there is no longer such a thing - as it has been replaced by PaymentInterfaceAdapterFactory. Every code change, every deployment would become an exercise of verifying whether any hibernating invocations “on file” will still find their requisite primitives when revived.
The closest point where a Rails developer would encounter such a problem would be when you are deploying a new version of an ActiveJob subclass, or some code that is using such a subclass. Once you have a job for PerformPaymentJob in your queue - by the time that job should run you better damn have a class for PerformPaymentJob defined in your application, or else.
I suspect this mostly to be the reason hibernation (and thus - marshalable stacks) did not take hold in the cloud world. Even Smalltalk supported machine images, but these machine images were exactly what they are called - machine images (the invocation and the kitchen sink) - not method images - they were big, and would offer you a state of an entire system “ab initio”. I don’t know how you see keeping a Docker container in hibernation, with all of its 16GB of storage and memory, for every payment that you initiate. And besides, doing such a feat as marshalable invocations would mean that one needs a new language!
Not only because the primitives - or modules - of the host system could change between resumptions. But also because there are constructs which become very precarious if you try to see them within the same context of execution, but that context would get suspended - and then resumed in another machine. What would the following code block produce?
t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
# here we hibernate our invocation for 2 hours
delta = Process.clock_gettime(Process::CLOCK_MONOTONIC) - t1
Imagine that, by some feat of magic, we do have our distributed language that is a superset of Ruby and allows us to have those “hibernated invocations”. While we can (and will, of course) store t1 - by virtue of it being on the stack - we won’t have a guarantee that the monotonic clock will be continuing exactly where it froze at hibernation time. We could have such a guarantee - but again, at the cost of retaining more state (the value of the monotonic clock at first invocation + the wall clock at the same instant). That clock, in turn, must - of course - be of such sizing (width) as to survive getting suspended for days - or months - without losing precision.
Idempotency as substitute
Pretending that we can have a marshalable stack (or a serializable continuation, if you wish) is just so appealing, but inachievable in practice. What can we do instead? Well, we could say that our invocation is idempotent. That is: if we start it, and at some arbitrary point we want to suspend it (or it crashes) - there is a reasonable checkpoint someplace that we can skip to. Not resume from - and this is a very important difference - skip to. Imagine our library for durable execution wants to support the same bit of code we have written out earlier:
authorisation_result = run_remotely_and_asynchronously(:authorise_payment)
funds_check_result = run_remotely_and_asynchronously(:check_funds)
sleep() until authorisation_result.received? && funds_check_result.received?
transfer_result = run_remotely_and_asynchronously(:transfer_funds)
run_remotely_and_asynchronously(:send_email) if transfer_result.ok?
sleep 10.days
mark_payment_as_final
Let’s focus on just a small fragment of it:
authorisation_result = run_remotely_and_asynchronously(:authorise_payment)
funds_check_result = run_remotely_and_asynchronously(:check_funds)
If we want to use this “as if it were idempotent” in a convenient manner, here is what we would likely do:
- At invocation of
run_remotely_and_asynchronouslywe would check whether there is a saved checkpoint for this invocation, keyed with the invocation ID – and, possibly, its arguments. If it exists - we would skip the call altogether and lookup the result of the call in some form of cache. - If there is no checkpoint - we would create one, and spin up the
run_remotely_and_asynchronously- either remotely or locally.
Then, on the first invocation, our run_remotely_and_asynchronously(:authorise_payment) may succeed, but the program may crash before having reached run_remotely_and_asynchronously(:check_funds). When we try to reinvoke our invocation again, it would see that there already is a checkpoint (and a saved return value!) for run_remotely_and_asynchronously(:authorise_payment) and would allow the program to skip over directly to run_remotely_and_asynchronously(:check_funds). With any luck (and a lot of grease and duct tape) a somewhat sufficient API can be put together that it can “paper over” the absence of marshalable stacks.
But not all is roses here. A single argument that is not idempotent (or non-deterministic) passed to run_remotely_and_asynchronously would make the checkpoint useless:
authorisation_result = run_remotely_and_asynchronously(:authorise_payment, params: {idempotency_key: SecureRandom.uuid})
Every time we invoke our run_remotely_and_asynchronously here upon reentering the invocation, a new idempotency_key value would be generated. If we ignore it - our checkpoint may return values which will not belong to this invocation proper. If we use it as part of our cache key - every time it changes it is going to produce a new authorisation result, thereby not allowing us to “skip forward”.
A likely workaround for this would be to implement yet more distinctions between, say, “transient context parts” and “persisted context parts” - that way the payment_id would be considered contributing to the cache key / checkpoint, but the idempotency_key would not. Yet more things to take into account.
With more sophisticated VM control - or source code analysis - we could even go as far as placing those “checkpoints” on the level of source lines. This is incredibly brittle, of course - even more brittle than “context keying” - because a mere addition of some comments, or a split of a statement into multiple lines - could mean making the checkpoint becomes useless.
Effectively, we are getting to the same issue that we had with the absence of marshalable stacks - but on the level of source code. The code may change while the context is retained, and the context will no longer make sense in relation to the code.
If you look closely at Temporal documentation (and at the various clients, and the distinction between Workflows and Activities) you will see this “longing for marshalable stacks” everywhere. The longing which never gets answered.
There is more, of course. Imagine we need to supply an authentication token when we call our services. The token is time-limited (it has an exp claim), and it also has a random component (in the form of a nonce). It may not be reused. We could place it on the workflow level (“the body of code where we can skip”):
token = ExternalClient.generate_auth_token
authorisation_result = run_remotely_and_asynchronously(:authorise_payment, auth: token)
but when we reattempt the run_remotely_and_asynchronously we will reuse the token that got persisted. And the remote service will (rightfully) refuse the token - either because it is going to have expired by then, or because it still remembers the nonce from the previous call. Solution? Well, placing the ExternalClient.genereate_auth_token inside of our run_remotely_and_asynchronously of course:
authorisation_result = run_remotely_and_asynchronously(:authorise_payment) do
token = ExternalClient.generate_auth_token
end
It is not an issue of “having closures be used as step functions” - this is exactly why having closures is so useful and desirable (albeit the _remotely_ part becomes difficult that way) - closures are great. The issue is always remembering what should go into the “outer program” - the orchestrator - and what must be in the “inner” programs, the tasks. And ensuring that both the tasks are idempotent and the outer workflow - our main invocation.
So: papering over the absence of stack marshaling with idempotency does work, but it begets a lot of things you need to worry about, and they can be very intricate. Very, very intricate indeed. In other words: the “ambient idempotency and it will just work” is a leaky abstraction.
And we haven’t even discussed recovering from exceptions yet.
Bring it back, sing it back
Orchestrated rollbacks (reverting partially completed actions) is something one may or may not encounter. The case Restate brings up is very appropriate, although it is not necessarily very common. The issue is that again - for appropriate recovery we need to know where to start recovering from - from which point in the program. We have found this out the hard way - with gouda where, if a server crashes, the background job - which you could see as one such “unit of work” - may get interrupted simply because the container gets killed by the OOM killer.
Or there is a deploy and the instance group manager gets sick of waiting for the machine to shutdown cleanly. Point is that this InterruptException would have to be “synthesized” after the fact, judging by the workload not having had heartbeats for some time. There would not be a piece of code that would magically “raise” it - we had to make such piece of code.
The rollback performed afterwards is not impossible - but also something to think about. I have tried (with all my might) to be very minimal in “non-atomicity” of operations that may need staged or partial rollbacks, because I know how tricky those can be. Could even say I was lucky. But we should not discount the fact that they are a part of durable execution.
Conclusion
A lot of the “spirit” of durable execution comes from the desire to have a revivable call stack, or low-cost VM snapshots, with such snapshots being marshalable, and revivable on different hosts than the ones creating them. Most current “cloud” durable execution engines try to pretend this to be kind-of-how-it-works, but actually replace functioning snapshotting with forced idempotency. And that idempotency is largely on the developer - you, thus. It is not bad per se, but it is a limitation.
Also: they could have spent the VC millions actually developing a good language/VM for this instead of trying to pretend Go and Java are a good fit, but who am I to recommend.
Stay tuned for Part 2, where we will examine how this can be managed by divorcing the “workflow” code from the “activity” code - and how systems from non-cloud domains implement it.