Data Over Time
Since 2023 I have been working at Cheddar Payments, which is a fledgling fintech startup in the UK. It was a substantial change from WeTransfer in terms of the problem domain, but also scale.
The scale at a B2C fintech is smaller, but the challenges are, in ways, much harder. And the biggest challenge - engineering-wise - is “data over time”. I’ve learned more about data over time than I would like, and it can be useful to share my experience.
The Rails Way™ is harmful for data over time
Rails assumes building systems happen with a “current state of the world” database, filled with rows that get mutated. For example, this would be a very standard pattern in Rails:
class CustomerAccount < ActiveRecord::Base
# ...
end
account.update!(balance: account.balance - purchase_amount)
While obvious and very quick to execute, this has a multitude of tradeoffs which will bite you down the road (and fairly quickly - in a fintech situation):
- You don’t have visibility of specific account balance changes - only the aggregate
- You don’t have the ability to compute the balance at arbitrary points in time (what the balance “used to be”)
- You can’t easily make balance adjustments that get linked to the change they correct
If you want to still maintain the Rails Way to some degree, start with a ledger-like accounting system I’ve outlined earlier:
class CustomerAccount < ActiveRecord::Base
has_many :mutations
def balance(at: Time.now)
mutations.where("created_at < ?", at).sum(:amount)
end
end
class Mutation < ActiveRecord::Base
belongs_to :customer_account
end
account.mutations.create!(amount: -purchase_amount) # Negative amounts indicate debits
This gives you a large set of conveniences - you can audit specific mutations, monitor the balance of the account at a specific point in time, make predictions and accurate corrections.
Caveat: the balance calculation above assumes a single-threaded environment. In reality, balance calculations may be taking place at the same time as INSERT operations, and the result could depend on your database’s transaction isolation level..
There are double-entry ledger systems for Ruby and for Rails but we found most of them do “too much of what we don’t need and not enough of what we do need”.
This applies to many models, not only some sort of Account
. If you have, for example, a cashback someone has received - or a running mortgage - and these get adjustments over time, it is much (and I mean: orders of magnitude) better if you do this via “mutations” that concatenate, and not via changes to attributes.
Attribute change logs only get you so far
Yes, you can record changes to the attributes of a particular record using something like paper_trail or logidze - but they all have a shortcoming that you would want to consider. While debugging things does get easier because you know “the key 3 elements” - what changed, from and to what, and when - you do not have a clear description of why it changed.
Yes, the balance was decremented from 10 to 8. Was it because there was a mistake in the date and time calculation? Was it because some rate changed during that time? Was it because the person received a discretionary payout of some kind which got returned?
So what I believe is a better approach - and every time we used it at Cheddar, it turned out to be the right choice - is to apply the “mutation records” and aggregate the “state at T” from them. It is, in fact, a limited form of event sourcing.
If you want to discover more about event sourcing, Ismael Celis has a growing corpus of articles on the topic.
Intervals are tricky
Another pervasive concept in “data-over-time” systems is an interval. Intervals are pretty much all “start-inclusive and end-exclusive”, and are most frequently between times - but sometimes also between dates. A good example of an “interval-like” item in other systems would be a calendar entry.
Usually you would have something like
class ScopeGrantInterval < ActiveRecord::Base
# datetime :starts_at
# datetime :ends_before
scope :covering, ->(t) {
where("starts_at <= :t AND (ends_before > :t OR ends_before IS NULL)", {t:})
}
def create_next!
t = Time.now
update!(ends_before: t)
self.class.create!(starts_at: t)
end
def covers?(time)
time >= starts_at && (ends_before.nil? || time < ends_before)
end
end
Intervals are tricky because they can utterly break down unless you hold them “just right”:
- You want to place mnemonic hints everywhere you can whether a timestamp indicates an inclusive or exclusive containment. We settled on suffixing columns which are supposed to be queried non-inclusively with
_before
and_after
instead of_at
- this helps a ton. - Intervals can be tricky to query - you want
WHERE starts_at >= ? AND ends_before < ?
, notWHERE ? BETWEEN starts_at AND ends_before
- Use a sophisticated-enough DB that has range-type indexes and can create related constraints
- Intervals can - and will - have gaps, and your system should be able to deal with that. Both business-logic-wise and in terms of performance.
Intervals are likely to emerge with exchange rates, mortgage rates, limited discounts or boosts - any temporal entities which start at a certain point in time, end at a certain point in time, or both.
Time zones are also intervals
Time zones change over time. A timezone definition is not just a name and UTC offset - it’s also “in effect since” and “superseded at”.
If you want to convert “14:38 on 1st March 2009 in London” to UTC, you need the timezone rules that were in effect at that time, not the rules in effect now. The timezone database is a collection of intervals, not a static lookup table.
Treat it as queryable intervals and you’ll have a much better time (pun intended).
Timestamps are not great as cache keys
With data over time, you pretty much always want read-after-write consistency. If you have performed a payment, you want all the resources you load after to reflect that payment having taken place - your list of payments, your account balance…
Rails, by default, uses the updated_at
of a model as the cache key, combined with the model class name. This kind of works (example from logik-matchbook):
matches(dev)> MatchboxShader.first!.cache_key
MatchboxShader Load (0.1ms) SELECT "matchbox_shaders".* FROM "matchbox_shaders" ORDER BY "matchbox_shaders"."id" ASC LIMIT ? [["LIMIT", 1]]
=> "matchbox_shaders/2-20131224152413220104"
but is not good enough. Imagine the following scenario:
# Process 1
wallet.update!(balance: Money.new(100, "USD"))
# Process 2
wallet.update!(balance: Money.new(5, "USD"))
Just incidentally, these two processes happen to perform their operation at exactly the same time. What is the cache_key
that is going to be used for caching the model after, and what is the state of the database going to be? The cache key will change, but it is going to reflect the change applied by just one of the processes - depending on when the cached snapshot of the Wallet
gets generated and stored. Observe:
# Process 1
wallet.update!(balance: Money.new(100, "USD"))
# Process 3
wallet.cache_key #=> Returns the cache key based on the timestamp of the update from process 1
# Process 2
wallet.update!(balance: Money.new(500, "USD")) # Touches the wallet, but does not change the value of the timestamp
# Process 4
wallet.cache_key #=> Returns the cache key based on the timestamp of the update from process 2, which is the same as from process 1. Wallet with $5 balance gets served.
While update!
does set the updated_at
timestamp to the current time, the current time itself may be identical across both processes if they execute within the same second (or even millisecond, depending on your database’s timestamp precision). This is why relying solely on timestamps for cache keys can be problematic.
So, if you want meaningful conditional get or meaningful cache keys, you need one of the two possible additions:
- Use optimistic locking and integrate the lock version into your cache key, or
- Use content-addressable hashes which capture the entire state of the record instead of just the update timestamp
For the former, the following works well enough:
def cache_key
without_version = super
return without_version if new_record?
return without_version unless locking_enabled?
[without_version, public_send(locking_column)].join("/")
end
A heap is not a log, and vice-versa
Webhooks are great, and event sourcing is great. There is a large caveat, however: most event sourcing systems are not strictly ordered. Even with a sequence of Mutation
records on a CustomerAccount
from earlier:
class CustomerAccount < ActiveRecord::Base
has_many :mutations, -> { order(created_at: :desc) }
def balance(at: Time.now)
mutations.where("created_at < ?", at).sum(:amount)
end
end
class Mutation < ActiveRecord::Base
belongs_to :customer_account
end
we can’t reliably ascertain the state of the account between two or more Mutation
events if they occur at the same time:
results = account.mutations.pluck(:amount, :created_at)
# [
# {amount: -120, created_at: "2024-05-12-00:00:01" },
# {amount: +30, created_at: "2024-05-12-00:00:01" }
# ]
You can’t really say which one of these came first. In general, this is not a very big problem if you pay attention to your cutoff. But there are situations where events depend on each other and on the ordering. For example, if you need to do a balance check before adding 30 cents to the account, you probably want to do it after removing the 120 cents.
Different systems approach this differently. For example, in OpenBanking the transactions you can download do not specify exact ordering. But they do, usually, contain a Balance
structure which specifies the balance before and after the transaction took place.
If we want to do strict ordering in our mutations
, we can add a sequence number when inserting them. This is not as robust as using a trigger, but will provide ordering:
# ..._migration.rb
add_column :mutations, :seq, :bigint, null: true
add_index_ :mutations, [:seq, :customer_account_id], unique: true
class Mutation < ActiveRecord::Base
belongs_to :customer_account
after_create do |m|
m.class.connection.execute(<<~SQL)
UPDATE #{m.class.table_name} SET seq = (
SELECT COUNT(*) + 1 FROM #{m.class.table_name}
WHERE customer_account_id = #{m.customer_account_id}
) WHERE id = #{m.id}
SQL
end
end
You can use something more sophisticated like PostgreSQL sequences, but you probably only care about sequencing within a customer account anyway. The above solution is not ideal but will already give you more confident sequencing.
Webhooks are usually a heap
Imagine we have these events for a hypothetical Payment
with ID=123
:
payment_id | created_at | event_type
------------|------------------|------------
123 | 2024-05-12 00:00 | charged
123 | 2024-05-12 00:01 | fraud_reported
123 | 2024-05-12 00:02 | chargeback
123 | 2024-05-12 00:03 | refund
The peculiarity here is that created_at
is not the timestamp of when the event was generated. These events come from an external system via webhooks, and the created_at
is our - generated - timestamp. We can try and derive a sequence number “in order of reception” of those webhooks - where the seq
value is guaranteed to be unique in scope of payment_id
:
payment_id | seq | created_at | event_type
------------|-----|------------------|------------
123 | 1 | 2024-05-12 00:00 | charged
123 | 2 | 2024-05-12 00:01 | fraud_reported
123 | 3 | 2024-05-12 00:01 | chargeback
123 | 4 | 2024-05-12 00:02 | refund
But the issue here is that this ordering has been divinated by us - and not set by the sender of the webhooks. In actuality, the provider may have generated these events in the following order:
payment_id | event_type
------------|------------
123 | charged
123 | chargeback
123 | refund
123 | fraud_reported
but due to their sending HTTP client being slow, our receiving HTTP endpoint being slow, our database introducing a bit of a delay to serialize transactions - their order does not match ours. And their ordering may even not be defined. The problem is going to emanate from us trying to assume their events are ordered while they are not.
Here is a simple exercise I like to do: grab an array of all the events you may have. For example - [:charged, :chargeback, :refund, :fraud_reported]
. Use the following call to see how many possible orderings you can have for them:
[:charged, :chargeback, :refund, :fraud_reported].permutation.to_a.length # => 24 possible orderings
If you are finding yourself in a situation where you do not have a meaningful ordering of incoming events, here is what you can do in your tests:
[:charged, :chargeback, :refund, :fraud_reported].permutation.each do |possible_ordering|
description = possible_ordering.join(" -> ")
test "with events in order of #{description}" do
assert some_invariant1
assert some_invariant2
# ...any other tests that must be true for all orderings
end
end
This does not handle the situations where events are missed (skipped), but at least it gives you some coverage. Anyway - unless your sender of webhooks/events specifies certain ordering invariants or gives you a sequence number/vector clock for the events they send, assume the events are unordered. Your event log is not a log anymore - it’s a heap, where you need to exercise caution to establish some basic invariants. For example:
- If the webhook sender gives you a timestamp, you can assume that the webhooks are ordered within certain large time buckets, or in clusters. So while you can’t have exact ordering, you can at least make guesses.
- If you are receiving, say, events from our previous example - you can assume that the
charged
event should always be present and it should always be considered first. To that end, you can buffer the other events, but not proceed to processing the events until you have recorded thecharged
event for thatpayment_id
.
And most providers - even Stripe - do not report event sequencing information in their webhooks. If you use webhooks between your systems, even having a sequence number for webhooks that you send which would be SELECT COUNT(*) + 1 FROM generated_webhooks WHERE related_model_id = ?
will help establish causality on the receiving side.
Conversely, if you are only receiving and there is no sequencing - frequently the best course of action is to not use the data from the webhook at all - but to use the webhook as a signal to re-read the remote resource. Yes, it will be slower, but at the very least your local representation of the changed resource is going to be reasonably up-to-date.
Know when you are dealing with a heap versus an ordered log and act accordingly. And do not let the transport deceive you. A Kafka topic or an EventEmitter will be delivering you messages serially and sequentially - sure, but do they get placed into that topic or server-sent-events stream in the order they got generated by the system? If there is no producer-defined ordering which is business-logic specific, the fact that the events arrive via a serial channel does not matter.
What I didn’t cover
This article focuses on correctness over performance. I didn’t cover transaction isolation levels or idempotency keys - those get drilled into programming courses while proper data-over-time architecture gets left by the wayside. But yes, you’ll need both. And yes, the ledger approach scales poorly without materialized views or snapshots. But first, get the data model right.