The little Random that could
Sometimes, after a few pints in a respectable gathering of Rubyists, someone will ask me “what is the most undervalued module in the Ruby standard library?”
There are many possible answers, of course, and some favoritism is to be expected. Piotr Szotkowski, who untimely passed away this summer, did a wonderful talk on the topic a wee while back.
My personal answer to that question, however, would be Random
. To me, Random
is a unsung hero of a very large slice of the work we need to do in web application, especially so when we need things to be deterministic and testable. So, let’s examine this little jewel a bit closer.
🧭 I am currently available for contract work. Hire me to help make your Rails app better!
The various random sources
A Random
and its far-removed cousin SecureRandom
are interfaces to random number generators, or sources of entropy. I won’t bother you with too many details, but the simplest use for an RNG is simulating a dice roll:
random_side = Random.rand(1..6)
You can also do the same using SecureRandom
of course (there is quite a bit of overlap between Random
and SecureRandom
but not all of it):
random_side = SecureRandom.rand(1..6)
There is a substantial difference between them, though. SecureRandom
latches onto the entropy source of the operating system, and reads the random bytes from that source (it is a file, usually). Therefore, SecureRandom
is not deterministic and not repeatable. Random
, however, is - which is where it becomes useful.
It is an implementation of something known as the Mersenne twister and is built to produce a sequence of pseudorandom values which is achieved by various techniques of bit mixing and twiddling.
A Random is stateful
It is, after all, a sequence. A seeded Random
object is able to produce pseudo-random numbers (or output) indefinitely, by revolving its internal byte bag - starting at the seed. You can think of it as an infinite Range
:
counter = (0..).each
counter.next # First call returns 0, second - 1, third - 2 and so forth
except that the values will be wildly different and uniformly distributed, and will be also appropriately scaled. For example, if we want to obtain random numbers in the range from 0 to 0xFFFF we can create the same Enumerator
with a Random:
emitter = Enumerator.new do |y|
rng = Random.new(42)
loop { y.yield(rng.rand(0xFFFF)) }
end
emitter.next # => 56422
emitter.next # => 15795
Try it - it’s going to generate exactly the same values on your machine as it does on mine. This highlights another important trait of a Random
- it has internal state, and that state can’t be manipulated. When you make Random
emit you data, that state changes irreversibly, and it only ever changes “forward”. On one hand, statefulness may be undesirable. The state mutations (twists) depend on how much data is requested from Random
in sequence. For example:
rng = Random.new(42)
rng.rand(0xFF)
rng.rand(0xFFFF) # => 15795
rng = Random.new(42)
rng.rand(0xFFFF)
rng.rand(0xFFFF) # => 15795, same value
but when we first request a “sizeable” blob of bytes
rng = Random.new(42)
rng.bytes(1024*1024); nil
rng.rand(0xFFFF) # => 2321, different
so while Random
sequences are repeatable, they are not guaranteed to produce the same value depending on the count of invocations. Rather, it depends on the amount of generated “entropy” consumed so far.
When “fast” is better than “secure”
Random
is also much, much faster than SecureRandom
. For example, if you want to generate some random bags of bytes:
require "securerandom" # This adds uuid methods to Random
def timed(what)
t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
yield
dt = Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
puts("%s - %0.6f seconds" % [what, dt])
end
timed("SecureRandom") do
10_000.times { SecureRandom.uuid }
end
timed("Random") do
10_000.times { Random.uuid }
end
timed("Random obj") do
rng = Random.new
10_000.times { rng.uuid }
end
gives:
SecureRandom - 0.025562 seconds
Random - 0.018445 seconds
Random obj - 0.010401 seconds
Using a pre-seeded Random
is 2.5x faster than using SecureRandom. Moreover, in many situations it will be more appropriate too! Depending on the speed of your system, using Random#bytes
can be much faster than SecureRandom.bytes
as well.
Note that requiring securerandom
adds several compatibility methods to Random
(like uuid
, hex
, base64
, etc.) that make it more interchangeable with SecureRandom
while maintaining the deterministic behavior of Random
.
Where can you use a Random?
A couple methods in Ruby stdlib actually accept a random:
, and the argument can be either an instance of Random
or SecureRandom
(the module). Those are Array#sample
and Array#shuffle
and they are mighty useful:
(0..999).to_a.sample(random: Random.new(42)) #=> 102
(0..999).to_a.sample(random: SecureRandom) #=> something truly random
On its own you use it to obtain a random value from 0 to 1.0:
rng = Random.new(12)
rng.rand #=> # 0.3745401188473625
which is useful for sampling a uniform distribution (if something ends up below a threshold value - it gets picked, otherwise - not).
Seeds, seeds everywhere
When you have flaky tests, usually you will look for the --seed=872771
printed to the console, and you try to reproduce your failure with that --seed
value. What does it do?
Well, when tests are order-independent, most likely it does something like this:
all_tests = collect_test_cases.sort_by(&:name)
all_tests.shuffle!(random: Random.new(seed_arg))
This deterministically shuffles the tests so that they run in random order, and will shuffle them in the same order every time. This seed is actually available inside of your tests, and it is an absolute requirement to make your tests reliable. For example, imagine you are testing some kind of event processing, and you need to test events arriving out of order. In your test, you may do
def test_does_not_raise_when_events_arrive_out_of_order
assert_nothing_raised do
events = [:payment_received, :payment_chargeback, :payment_adjustment]
events.shuffle!
event_processor.process_event_stream(events)
end
end
You now have a flaky test - you can’t say what the ordering of events under test is going to be, and if your process_event_stream
does, indeed, raise - you won’t be able to quickly reproduce the same ordering that caused the failure. And there are 9 possible orderings here! The fix?
events.shuffle!(random: Random.new(Minitest.seed))
This feeds Array#shuffle!
a deterministic random number sequence generator which is guaranteed to produce the same sequence of values given the same seed. And if your ordering comes out a particular way with --seed=123
you will be able to reproduce your failing test, 100% accurately. The same facility is available in RSpec under RSpec.configuration.seed
by the way.
That predictable randomness goes further. For example, imagine you are testing something like ActiveStorage and need to ensure that your checksum calculation routine somewhere far down in your library stores your data correctly. You can, of course, have a fixture like image.png
in your fixtures/files/
and “fake-upload” it inside of your test. But a much faster way - which is also not going to have an adverse effect on your checkout speeds, would be:
rng = Random.new(Minitest.seed)
key = rng.hex(12)
random_binary_data = rng.bytes(33*1024)
sha2 = Digest::SHA2.hexdigest(random_binary_data)
uploader.upload(key: key, io: StringIO.new(random_binary_data))
blob = StoredBlob.where(key: key).first!
assert_equal sha2, blob.sha2
if you have a feature flag which is deployed to 20% of your userbase, you can find out whether your user should receive a feature or not:
rng = Random.new(user.id)
rng.rand <= 0.2 # => true or false
This shows another specific feature of Random
in that it accepts a seed which is a large int. While Random
doesn’t natively support string seeds, you can convert a string to an integer seed using a simple approach:
# Convert string to a deterministic integer seed
seed_value = seed_string.bytes.inject(0) { |acc, byte| (acc << 8) | byte }
Random.new(seed_value)
Fractal sequencing and Faker
Nothing lends itself better to Random
application than generating all sorts of fake data. At Cheddar we had a fake bank, used for internal testing, where entire graphs of user data would be generated using Random
combined with proper Faker
use. This would work using seed sets. Imagine you need to generate a number of fake users, and every user has a certain number of fake accounts. These accounts, in turn, have transactions in them.
Since we know that a Random
produces a deterministic sequence, we can use a derivation process to chain Random
generators together. A single RNG would be initialized for the entire seed set. Then, for every user, we would make our “root” RNG generate a seed, which would be used to seed the RNG for the user. Inside the user we would get a seed from that RNG to generate the list of accounts, and inside of those accounts - transactions. Observe:
def with_faker_random(rng)
r = Faker::Config.random
Faker::Config.random = rng
yield
ensure
Faker::Config.random = r
end
def with_faker_seed(seed, &blk) = with_faker_random(Random.new(seed), &blk)
SEED_MAX = 0xFFFFFFFFFFFFFF
def generate_user(seed)
user = User.create!(
name: with_faker_seed(seed) { Faker::Name.name },
email: with_faker_seed(seed) { Faker::Internet.email }
)
seed_for_accounts = Random.new(seed).rand(SEED_MAX)
generate_accounts(user, seed_for_accounts)
end
def generate_accounts(user, accounts_seed)
rng = Random.new(accounts_seed)
with_faker_random(rng) do
account = user.accounts.create!(iban: Faker::Bank.iban)
generate_transactions(account, rng.rand(SEED_MAX))
end
end
def generate_transactions(account, transactions_seed)
amount_range_cents = 1..2500_00
time_range = Time.utc(2021, 1, 25)..Time.now.utc
multipliers = [1, -1] # For debit/credit
rng = Random.new(transactions_seed)
n_txns = rng.rand(50..3000)
n_txns.times do |n|
amount = rng.rand(amount_range_cents) * multipliers.sample(random: rng)
created_at = rng.rand(time_range)
account.transactions.create!(amount:, created_at:)
end
end
If you read this closely you will see that the entire output graph (User -< Account(N) -< Transaction(N)
) gets produced from a single random seed, percolating downwards.
There is one tricky bit here: note the difference between with_faker_random
and with_faker_seed
. The with_faker_seed
is useful when we need to output multiple pieces of data which should be deterministic regardless of order. Specifically, imagine we want our User
to be generated with the same, predictable email and the same, predictable name. If we set the RNG for Faker and “leave it be”, the values generated depend on call order - which is to be expected, because Faker ratchets our RNG for obtaining entropy out of it, repeatedly - and the RNG is stateful:
irb(main):033> Faker::Config.random = Random.new(42)
=> #<Random:0x00000001395b3fa8>
irb(main):034> Faker::Name.name_with_middle
=> "Brittany Klocko Prohaska"
irb(main):035> Faker::Internet.email
=> "lyndon_rempel@schumm-jaskolski.test"
But if we do the calls in reverse, the output changes!
irb(main):036> Faker::Config.random = Random.new(42)
=> #<Random:0x000000012ea9dfd8>
irb(main):037> Faker::Internet.email
=> "ruby_ebert@hamill.test"
irb(main):038> Faker::Name.name_with_middle
=> "Zachery Weimann McGlynn"
If that is OK for your use case - that’s fine, and it will be faster too. But if you want truly deterministic output, the correct way is to assign a “fresh” RNG before every Faker invocation:
irb(main):039> Faker::Config.random = Random.new(42)
=> #<Random:0x0000000129411458>
irb(main):040> Faker::Name.name_with_middle
=> "Brittany Klocko Prohaska"
irb(main):041> Faker::Config.random = Random.new(42)
=> #<Random:0x00000001396155a0>
irb(main):042> Faker::Internet.email
=> "ruby_ebert@hamill.test"
It can be tricky to twist your head around this (pun intended!) but it can create fairly nice fractal graphs of objects.
Reproducing distributions
Imagine you want to simulate some background jobs. Your APM gives you a distribution of the job run time. You know that the mean time for the job to complete is 433ms, but the p90 can go up to 1,56s of duration. You want to simulate queue load which is saturated with those jobs and you want them to have a similar profile. You can do that, and you can do it in a predictable way:
# This class takes the mean and the p90 (which will normally be durations of a task taken from Appsignal)
# and uses those to generate a random job duration. It should allow us to have reasonable simulated
# job durations. This is an https://en.wikipedia.org/wiki/Inverse_transform_sampling function.
class Distribution
def initialize(mean:, p90:, random: Random.new)
@p90 = p90.to_f
@mean = mean.to_f
@std_dev = (@p90 - @mean) / 1.28
@random = random
end
def value
# Calculate the standard deviation using the provided mean and 90th percentile
u1 = @random.rand
u2 = @random.rand
r = Math.sqrt(-2 * Math.log(u1))
theta = 2 * Math::PI * u2
z = r * Math.cos(theta)
value = @mean + @std_dev * z
value < 0 ? @mean - value : value
end
end
and then, to create your job durations:
rng = Random.new(42)
dist = Distribution.new(mean: 433.0 / 1000, p90: 1.56, random: rng)
2000.times do
simulator.enqueue(SimulatedJob.new(duration: dist.value))
end
This and allows you to create very large datasets. They are perfect for performance testing. Need a few million bank transactions to test out a hypothesis? Grab your production distributions to have a plausible corpus, arm yourself with some wicked-fast SQLite INSERT mojo and Faker
and you can have a 100% reproducible corpus generator that runs in seconds!
To summarize
Mersenne twisters are awesome and Random
is, in my view, the most undervalued Ruby class from the standard library. It can generate you mock data sets, it can drive entire graphs of output from Faker, and it can make your test suites super-reliable. And it is wicked fast.
Also, it’s not really random - and I love cheating that way.