Supercharge SQLite with Ruby Functions

An interesting twist in my recent usage of SQLite was the fact that I noticed my research scripts and the database intertwine more. SQLite is unique in that it really lives in-process, unlike standalone database servers. There is a feature to that which does not get used very frequently, but can be indispensable in some situations.

By the way, the talk about the system that made me me to explore SQLite in anger can now be seen here.

Normally it is your Ruby (or Python, or Go, or whatever) program which calls SQLite to make it “do stuff”. Most calls will be mapped to a native call like sqlite3_exec() which will do “SQLite things” and return you a result, converted into data structures accessible to your runtime. But there is another possible direction here - SQLite can actually call your code instead.

There is no support for stored procedures in SQLite (which is imaginable, since the database already lives inside of a very large stored procedure - your program, basically). But: in addition to stored procedures, databases sometimes support something called UDFs, which stands for User-Defined Functions. Most SQL databases have some functions built in:

SELECT RANDOM()

calls a built-in SQL function called RANDOM, which will accept no arguments and return you a random integer. A call like this:

SELECT LOWER('HELLO')

calls a built-in SQL function called LOWER, which accepts one argument and returns that argument converted to lowercase. The beauty of it is, of course, that it does not have to be a literal - it can be a column resulting from a different part of the SQL statement:

SELECT id, UPPER(legal_name) FROM users

will select you the user names converted to uppercase. Important to note: these functions are what’s called scalar functions - they map a piece of data from a single result row to a piece of output. Another family of SQL functions are aggregate functions, which accept the entire result set and apply to all values in that dataset. These are your MIN and MAX functions:

SELECT MAX(age) FROM users

They map values from multiple rows to one piece of output. Same for SUM.

The builtin functions in SQLite are pretty neat, but the menu is not that large, and sometimes exactly the function we need is not on that menu, sadly. In most DB systems you can define functions in the language the database supports - usually it will be some kind of superset of SQL the database supports, or one of the scripting languages the database embeds. With SQLite, however, something much nicer is possible: you can expose your own code from your “main” program and make it available to your SQL queries. And in some situations it can be very handy.

Creating your own scalar UDF

For instance: at Cheddar we use a small library called tou - which is a flavour of time-ordered UUIDs. Let’s make it available to our database:

CREATE TABLE users (
  uid VARCHAR NOT NULL PRIMARY KEY,
  name VARCHAR NOT NULL
)

Then, let’s create an enumerator which will issue us a deterministic sequence of Tou IDs, all sorted to the same timestamp:

id_gen = Enumerator.new do |yielder|
  time = Time.at(1737665388.5792)
  random = Random.new(42)
  loop do
    yielder.yield(Tou.uuid(random:, time:))
  end
end

The nice thing about this Enumerator is that we can now grab us a fresh Tou UUID by calling next:

[13] pry(main)> id_gen.next
=> "062c65c0-3989-4806-9ce1-5fb33deacb5c"
[14] pry(main)> id_gen.next
=> "062c65c0-3989-480e-95f5-2e6af463bb47"

We can insert users into our table with those IDs like so:

db.query("INSERT INTO users (uid, name) VALUES (?, ?)", [id_gen.next, "John"])

and it will work, but sometimes we would want those values to be provided by SQLite itself. A good example is when doing a UNION ALL select to combine multiple datasets:

SELECT 'Legal drinking age' AS desc, COUNT(1) AS cnt FROM users WHERE age >= 21
UNION ALL SELECT 'Below legal drinking age' AS desc, COUNT(1) AS cnt FROM users WHERE age < 21

If this is for a report this is fine, but if we want to retain our result rows and address them, we may want to give them some kind of ID. Because we are letting the database produce us a dataset - using placeholders here is not possible. However, we can tell SQLite to generate those IDs by itself, for every row of the SELECT result sets:

SELECT TOU() AS id, 'Legal drinking age' AS desc, COUNT(1) AS cnt FROM users WHERE age >= 21
UNION ALL
SELECT TOU() AS id, 'Below legal drinking age' AS desc, COUNT(1) AS cnt FROM users WHERE age < 21

To be able to do that, we need to expose our little ID generator to SQLite. It is done like this:

db.create_function( "TOU", _arity = 0) do |func|
  func.result = id_gen.next
end

We create a closure which addresses our ID generator and ratchets it to a new value. The value then gets assigned to the result of the function.

[22] pry(main)> db.get_first_value("SELECT TOU()")
=> "062c65c0-3989-480c-ae41-99142ccb9866"

and now we can auto-ID our rows.

Another great use for scalar functions: doing some operation that you need to happen in your main runtime, and examining the results of this operation in bulk. Imagine we want to do some regex matching, but either SQLite does not have the regexp extension (is of an older version), or - which is something that actually happened to me - you want to test a Ruby regexp against all the items in your dataset. Regex engines are different in implementation and in their quirks. If you are designing regexes and you intend to run them in your Ruby runtime later, it can be very useful to test them against a dataset that you already have. So, let’s design a function which will just tell us whether a passed string matches a Ruby regexp:

db.create_function( "TEXT_MATCHES_RE", _arity = 2) do |func, text, regexp_str|
  re = Regexp.new(regexp_str)
  func.result = text&.match?(re) ? 1 : 0 # SQLite has no boolean type
end

Note how I word the name of the function in such a way that would hint the order of arguments (string to match first, regex pattern second). Let’s run it:

[30] pry(main)> db.get_first_value("SELECT TEXT_MATCHES_RE('john', '[a-z]')")
=> 1
[31] pry(main)> db.get_first_value("SELECT TEXT_MATCHES_RE('123', '[a-z]')")
=> 0

Since SQLite treats 1 and 0 as booleans we can use our new function in a WHERE clause (which is the main spot this kind of UDF will be useful):

db.query("CREATE TABLE users (name TEXT NOT NULL)");
db.query("INSERT INTO users (name) VALUES (?), (?), (?), (?)", ["Jane", "Bobby", "Jake", "Peter"]);
db.query("SELECT * FROM users WHERE TEXT_MATCHES_RE(name, '^Ja')").to_a # => [["Jane"], ["Jake"]]

Rolling it up with aggregates

When I was working on Sked - the scheduling simulator - one of the most useful things was rapidly doing statistical queries after the simulation ran. Doing things like computing the percentiles, the standard deviation and the like. While SQLite does have MIN and MAX, it does not have the more advanced functions that may be useful in those scenarios - like percentile_cont - luckily, though, these can be implemented right inside the script used for experimenting.

An aggregate function in SQLite can be defined by supplying callbacks, but the Ruby SQLite library offers another option - defining your aggregate function as a class which gets instantiated. The instance holds the state that needs to be kept while the aggregate function gets called (the so-called “step”) for every value in the resultset.

class StddevSampHandler
  def self.arity = 1
  def self.name = "stddev_samp"

  def initialize
    @sample = []
  end

  def step(ctx, value)
    @sample << value.to_f if value
  end

  def finalize(ctx)
    if @sample.length > 1
      mean = @sample.sum.to_f / @sample.length
      sd = @sample.map { |v| (v - mean)**2 }.sum
      ctx.result = Math.sqrt(sd / (@sample.length - 1))
    else
      ctx.result = 0
    end
    @sample.clear
  end
end

class PercentileHandler
  def self.arity = 1
  def self.name = "p90"

  def initialize
    @sample = []
    @fraction = 0.9
  end

  def step(ctx, value)
    @sample << value.to_f if value
  end

  def finalize(ctx)
    if @sample.length > 1
      @sample.sort!
      k = (@fraction * (@sample.length - 1) + 1).floor - 1
      f = (@fraction * (@sample.length - 1) + 1).modulo(1)
      ctx.result = @sample[k] + (f * (@sample[k + 1] - @sample[k]))
    else
      ctx.result = 0
    end
    @sample.clear
  end
end

db.create_aggregate_handler(StddevSampHandler)
db.create_aggregate_handler(PercentileHandler)

Then we can create our table of samples:

db.query("CREATE TABLE metric_samples (name VARCHAR NOT NULL, value FLOAT)")
rng = Random.new(42)
values = [1, 2, 3, 4, 4, 4, 5, 5, 6, 7, 7, 7, 8, 9, 9, 11, 12, 12, 25]
metrics = %w( latency duration )
199.times do
  db.query("INSERT INTO metric_samples (name, value) VALUES (?, ?)", [metrics.sample(random: rng), values.sample(random: rng)])
end

and query it:

db.query(<<~SQL).to_a
SELECT
  COUNT(1) AS sample_size,
  name,
  printf("%.2f", MIN(value)) AS min,
  printf("%.2f", MAX(value)) AS max,
  printf("%.2f", AVG(value)) AS avg,
  printf("%.2f", stddev_samp(value)) AS stddev,
  printf("%.2f", p90(value)) AS p90
FROM metric_samples GROUP BY name
SQL

which gives:

[
  [115, "duration", "1.00", "25.00", "6.10", "4.21", "12.00"],
  [84, "latency", "1.00", "25.00", "7.65", "5.78", "12.00"]
]

This turned out to be very useful when printing reports about the various simulations we ran for our queue workloads.

A stub is a stub

There is a caveat though. A SQL function can, normally, be used both in queries and in SQL statements run by the database itself. That may be triggers - which run on various changes to the tables, or the default values for columns. For example, nothing precludes us from creating a view with a TOU() value:

CREATE VIEW single_tou AS SELECT TOU()

and in our process, this will work just fine. However, an unpleasant surprise will await you if you try to open the database “outside” of your process - and portability is one of the great qualities of SQLite databases. When you open a database in sqlite or using your favourite GUI editor (I use SQLPro) you will be greeted with an error if you try to SELECT from that view or examine it. Unlike other RDBMSes, the function in SQLite is just a “callback” which delegates work to your host application, and cannot be stored with the database - or executed without your application having defined the function.

When defining a function, there is a flag you can supply - as per SQLite documentation - which is called SQLITE_DIRECTONLY - which hints SQLite that your function should not be used in “retained” SQL statements, and may only be used via a direct call. The way to set that is a bit obtuse, but it is available. There is an argument called text_rep that these flags may be OR-d with. The default value tells SQLite that it is getting text data in UTF-8. Let’s define a TOU2() which will be direct-only:

SQLITE_DIRECTONLY = 0x000080000
text_rep_and_flags = SQLite3::Constants::TextRep::UTF8 | SQLITE_DIRECTONLY
db.create_function("TOU2", _arity = 0, text_rep_and_flags) do |func|
  func.result = id_gen.next
end
db.get_first_value("SELECT TOU2()") #=> "062c65c0-3989-480a-91de-0eca55917557"

And we can create a view with that function, but if we try to SELECT from it, we will get an error:

db.query("CREATE VIEW no_tou AS SELECT TOU2()");

db.get_first_value("SELECT * FROM no_tou")
> SQLite3::SQLException: unsafe use of TOU2():
> SELECT * FROM no_tou

This is at least better than having this view hang around - but something to keep in mind. SQLITE_DETERMINISTIC also seems handy.

To SQLite and back again

Wiring Ruby methods into SQLite can give you some very interesting capabilities (doing HTTP requests from your SQLite calls, parsing HTML with Nokogiri, tokenizing and matching data) and are a great tool when exploring datasets. If you use a Ruby program to generate a SQLite database for later inspection, like I do - just mark them as SQLITE_DIRECTONLY to not be unpleasantly surprised. It would, of course, be wonderful to be able to “talk” to the SQLite database with the custom UDFs attached using a familiar GUI – but since the functions live in Ruby-land, it will be impractical.

Now go and make your SELECTions.