Julik Tarkhanov

The little Random that could

Sometimes, after a few pints in a respectable gathering of Rubyists, someone will ask me “what is the most undervalued module in the Ruby standard library?”

There are many possible answers, of course, and some favoritism is to be expected. Piotr Szotkowski, who untimely passed away this summer, did a wonderful talk on the topic a wee while back.

My personal answer to that question, however, would be Random. To me, Random is a unsung hero of a very large slice of the work we need to do in web application, especially so when we need things to be deterministic and testable. So, let’s examine this little jewel a bit closer.

🧭 I am currently available for contract work. Hire me to help make your Rails app better!

The various random sources

A Random and its far-removed cousin SecureRandom are interfaces to random number generators, or sources of entropy. I won’t bother you with too many details, but the simplest use for an RNG is simulating a dice roll:

random_side = Random.rand(1..6)

You can also do the same using SecureRandom of course (there is quite a bit of overlap between Random and SecureRandom but not all of it):

random_side = SecureRandom.rand(1..6)

There is a substantial difference between them, though. SecureRandom latches onto the entropy source of the operating system, and reads the random bytes from that source (it is a file, usually). Therefore, SecureRandom is not deterministic and not repeatable. Random, however, is - which is where it becomes useful.

It is an implementation of something known as the Mersenne twister and is built to produce a sequence of pseudorandom values which is achieved by various techniques of bit mixing and twiddling.

A Random is stateful

It is, after all, a sequence. A seeded Random object is able to produce pseudo-random numbers (or output) indefinitely, by revolving its internal byte bag - starting at the seed. You can think of it as an infinite Range:

counter = (0..).each
counter.next # First call returns 0, second - 1, third - 2 and so forth

except that the values will be wildly different and uniformly distributed, and will be also appropriately scaled. For example, if we want to obtain random numbers in the range from 0 to 0xFFFF we can create the same Enumerator with a Random:

emitter = Enumerator.new do |y|
  rng = Random.new(42)
  loop { y.yield(rng.rand(0xFFFF)) }
end
emitter.next # => 56422
emitter.next # => 15795

Try it - it’s going to generate exactly the same values on your machine as it does on mine. This highlights another important trait of a Random - it has internal state, and that state can’t be manipulated. When you make Random emit you data, that state changes irreversibly, and it only ever changes “forward”. On one hand, statefulness may be undesirable. The state mutations (twists) depend on how much data is requested from Random in sequence. For example:

rng = Random.new(42)
rng.rand(0xFF)
rng.rand(0xFFFF) # => 15795

rng = Random.new(42)
rng.rand(0xFFFF)
rng.rand(0xFFFF) # => 15795, same value

but when we first request a “sizeable” blob of bytes

rng = Random.new(42)
rng.bytes(1024*1024); nil
rng.rand(0xFFFF) # => 2321, different

so while Random sequences are repeatable, they are not guaranteed to produce the same value depending on the count of invocations. Rather, it depends on the amount of generated “entropy” consumed so far.

When “fast” is better than “secure”

Random is also much, much faster than SecureRandom. For example, if you want to generate some random bags of bytes:

require "securerandom"  # This adds uuid methods to Random

def timed(what)
  t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  yield
  dt = Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
  puts("%s - %0.6f seconds" % [what, dt])
end

timed("SecureRandom") do
  10_000.times { SecureRandom.uuid }
end

timed("Random") do
  10_000.times { Random.uuid }
end

timed("Random obj") do
  rng = Random.new
  10_000.times { rng.uuid }
end

gives:

SecureRandom - 0.025562 seconds
Random - 0.018445 seconds
Random obj - 0.010401 seconds

Using a pre-seeded Random is 2.5x faster than using SecureRandom. Moreover, in many situations it will be more appropriate too! Depending on the speed of your system, using Random#bytes can be much faster than SecureRandom.bytes as well.

Note that requiring securerandom adds several compatibility methods to Random (like uuid, hex, base64, etc.) that make it more interchangeable with SecureRandom while maintaining the deterministic behavior of Random.

Where can you use a Random?

A couple methods in Ruby stdlib actually accept a random:, and the argument can be either an instance of Random or SecureRandom (the module). Those are Array#sample and Array#shuffle and they are mighty useful:

(0..999).to_a.sample(random: Random.new(42)) #=> 102
(0..999).to_a.sample(random: SecureRandom) #=> something truly random

On its own you use it to obtain a random value from 0 to 1.0:

rng = Random.new(12)
rng.rand #=> # 0.3745401188473625

which is useful for sampling a uniform distribution (if something ends up below a threshold value - it gets picked, otherwise - not).

Seeds, seeds everywhere

When you have flaky tests, usually you will look for the --seed=872771 printed to the console, and you try to reproduce your failure with that --seed value. What does it do?

Well, when tests are order-independent, most likely it does something like this:

all_tests = collect_test_cases.sort_by(&:name)
all_tests.shuffle!(random: Random.new(seed_arg))

This deterministically shuffles the tests so that they run in random order, and will shuffle them in the same order every time. This seed is actually available inside of your tests, and it is an absolute requirement to make your tests reliable. For example, imagine you are testing some kind of event processing, and you need to test events arriving out of order. In your test, you may do

def test_does_not_raise_when_events_arrive_out_of_order
  assert_nothing_raised do
    events = [:payment_received, :payment_chargeback, :payment_adjustment]
    events.shuffle!
    event_processor.process_event_stream(events)
  end
end

You now have a flaky test - you can’t say what the ordering of events under test is going to be, and if your process_event_stream does, indeed, raise - you won’t be able to quickly reproduce the same ordering that caused the failure. And there are 9 possible orderings here! The fix?

events.shuffle!(random: Random.new(Minitest.seed))

This feeds Array#shuffle! a deterministic random number sequence generator which is guaranteed to produce the same sequence of values given the same seed. And if your ordering comes out a particular way with --seed=123 you will be able to reproduce your failing test, 100% accurately. The same facility is available in RSpec under RSpec.configuration.seed by the way.

That predictable randomness goes further. For example, imagine you are testing something like ActiveStorage and need to ensure that your checksum calculation routine somewhere far down in your library stores your data correctly. You can, of course, have a fixture like image.png in your fixtures/files/ and “fake-upload” it inside of your test. But a much faster way - which is also not going to have an adverse effect on your checkout speeds, would be:

rng = Random.new(Minitest.seed)
key = rng.hex(12)
random_binary_data = rng.bytes(33*1024)
sha2 = Digest::SHA2.hexdigest(random_binary_data)

uploader.upload(key: key, io: StringIO.new(random_binary_data))
blob = StoredBlob.where(key: key).first!
assert_equal sha2, blob.sha2

if you have a feature flag which is deployed to 20% of your userbase, you can find out whether your user should receive a feature or not:

rng = Random.new(user.id)
rng.rand <= 0.2 # => true or false

This shows another specific feature of Random in that it accepts a seed which is a large int. While Random doesn’t natively support string seeds, you can convert a string to an integer seed using a simple approach:

# Convert string to a deterministic integer seed
seed_value = seed_string.bytes.inject(0) { |acc, byte| (acc << 8) | byte }
Random.new(seed_value)

Fractal sequencing and Faker

Nothing lends itself better to Random application than generating all sorts of fake data. At Cheddar we had a fake bank, used for internal testing, where entire graphs of user data would be generated using Random combined with proper Faker use. This would work using seed sets. Imagine you need to generate a number of fake users, and every user has a certain number of fake accounts. These accounts, in turn, have transactions in them.

Since we know that a Random produces a deterministic sequence, we can use a derivation process to chain Random generators together. A single RNG would be initialized for the entire seed set. Then, for every user, we would make our “root” RNG generate a seed, which would be used to seed the RNG for the user. Inside the user we would get a seed from that RNG to generate the list of accounts, and inside of those accounts - transactions. Observe:

def with_faker_random(rng)
  r = Faker::Config.random
  Faker::Config.random = rng
  yield
ensure
  Faker::Config.random = r
end

def with_faker_seed(seed, &blk) = with_faker_random(Random.new(seed), &blk)

SEED_MAX = 0xFFFFFFFFFFFFFF

def generate_user(seed)
  user = User.create!(
    name: with_faker_seed(seed) { Faker::Name.name },
    email: with_faker_seed(seed) { Faker::Internet.email }
  )
  seed_for_accounts = Random.new(seed).rand(SEED_MAX)
  generate_accounts(user, seed_for_accounts)
end

def generate_accounts(user, accounts_seed)
  rng = Random.new(accounts_seed)
  with_faker_random(rng) do
    account = user.accounts.create!(iban: Faker::Bank.iban)
    generate_transactions(account, rng.rand(SEED_MAX))
  end
end

def generate_transactions(account, transactions_seed)
  amount_range_cents = 1..2500_00
  time_range = Time.utc(2021, 1, 25)..Time.now.utc
  multipliers = [1, -1] # For debit/credit

  rng = Random.new(transactions_seed)
  n_txns = rng.rand(50..3000)
  n_txns.times do |n|
    amount = rng.rand(amount_range_cents) * multipliers.sample(random: rng)
    created_at = rng.rand(time_range)
    account.transactions.create!(amount:, created_at:)
  end
end

If you read this closely you will see that the entire output graph (User -< Account(N) -< Transaction(N)) gets produced from a single random seed, percolating downwards.

There is one tricky bit here: note the difference between with_faker_random and with_faker_seed. The with_faker_seed is useful when we need to output multiple pieces of data which should be deterministic regardless of order. Specifically, imagine we want our User to be generated with the same, predictable email and the same, predictable name. If we set the RNG for Faker and “leave it be”, the values generated depend on call order - which is to be expected, because Faker ratchets our RNG for obtaining entropy out of it, repeatedly - and the RNG is stateful:

irb(main):033> Faker::Config.random = Random.new(42)
=> #<Random:0x00000001395b3fa8>
irb(main):034> Faker::Name.name_with_middle
=> "Brittany Klocko Prohaska"
irb(main):035> Faker::Internet.email
=> "lyndon_rempel@schumm-jaskolski.test"

But if we do the calls in reverse, the output changes!

irb(main):036> Faker::Config.random = Random.new(42)
=> #<Random:0x000000012ea9dfd8>
irb(main):037> Faker::Internet.email
=> "ruby_ebert@hamill.test"
irb(main):038> Faker::Name.name_with_middle
=> "Zachery Weimann McGlynn"

If that is OK for your use case - that’s fine, and it will be faster too. But if you want truly deterministic output, the correct way is to assign a “fresh” RNG before every Faker invocation:

irb(main):039> Faker::Config.random = Random.new(42)
=> #<Random:0x0000000129411458>
irb(main):040> Faker::Name.name_with_middle
=> "Brittany Klocko Prohaska"
irb(main):041> Faker::Config.random = Random.new(42)
=> #<Random:0x00000001396155a0>
irb(main):042> Faker::Internet.email
=> "ruby_ebert@hamill.test"

It can be tricky to twist your head around this (pun intended!) but it can create fairly nice fractal graphs of objects.

Reproducing distributions

Imagine you want to simulate some background jobs. Your APM gives you a distribution of the job run time. You know that the mean time for the job to complete is 433ms, but the p90 can go up to 1,56s of duration. You want to simulate queue load which is saturated with those jobs and you want them to have a similar profile. You can do that, and you can do it in a predictable way:

# This class takes the mean and the p90 (which will normally be durations of a task taken from Appsignal)
# and uses those to generate a random job duration. It should allow us to have reasonable simulated
# job durations. This is an https://en.wikipedia.org/wiki/Inverse_transform_sampling function.
class Distribution
  def initialize(mean:, p90:, random: Random.new)
    @p90 = p90.to_f
    @mean = mean.to_f
    @std_dev = (@p90 - @mean) / 1.28
    @random = random
  end

  def value
    # Calculate the standard deviation using the provided mean and 90th percentile
    u1 = @random.rand
    u2 = @random.rand
    r = Math.sqrt(-2 * Math.log(u1))
    theta = 2 * Math::PI * u2
    z = r * Math.cos(theta)
    value = @mean + @std_dev * z
    value < 0 ? @mean - value : value
  end
end

and then, to create your job durations:

rng = Random.new(42)
dist = Distribution.new(mean: 433.0 / 1000, p90: 1.56, random: rng)
2000.times do
  simulator.enqueue(SimulatedJob.new(duration: dist.value))
end

This and allows you to create very large datasets. They are perfect for performance testing. Need a few million bank transactions to test out a hypothesis? Grab your production distributions to have a plausible corpus, arm yourself with some wicked-fast SQLite INSERT mojo and Faker and you can have a 100% reproducible corpus generator that runs in seconds!

To summarize

Mersenne twisters are awesome and Random is, in my view, the most undervalued Ruby class from the standard library. It can generate you mock data sets, it can drive entire graphs of output from Faker, and it can make your test suites super-reliable. And it is wicked fast.

Also, it’s not really random - and I love cheating that way.