Reviving zip_tricks as zip_kit
·Well-made software has a lifetime, and the lifetime is finite. However, sometimes software becomes neglected way before its lifetime comes to an end. Not obsoleted, not replaced - just.. neglected. Recently I have decided to resurrect one such piece of software.
See, zip_tricks holds a special place in my heart. It was quite difficult to make, tricky, but exceptionally rewarding. It also went through a number of iterations, and working on it taught me a great lot. How short methods are not always a good thing. How it is important to provide defaults. How over-reliance on teensy-tinesy-objects can make software hard to read and understand (in case of Rubyzip). And how open source might work in a corporate setting.
What follows is the story of how zip_tricks became zip_kit and what I have learned along the way.
There was a time in my life when I was trying to do a startup. As a matter of fact, I tried to do the same startup twice and also failed twice at that – if I ever get over my anxieties and regrets I might write a story about that too. Anyway, during that second attempt at it, I wanted to implement some functionality so that you could download a pack of files all-packaged into a neat ZIP.
And since I built a gem exactly for that - what could be more obvious than to use it? So we did. Except that… it didn’t stream. I have put resolving this into our backlog, as there was just… so much other stuff to do. A couple of years after, I joined Cheddar - and again, found that we had a few report-downloading endpoints which were generating ZIPs. And, oh-horror, they were using Rubyzip and they were generating tempfiles, and it was slow and unwieldy. How could I walk past this and leave this untouched? Again: some code replaced, some tweaks made… and the thing is not streaming. It just wouldn’t! And I could not really pinpoint why.
This led me to investigate what really went wrong, and where. But I didn’t like the idea that I have built a library, and I know it is broken, and I would make a fix but would not be able to share it. So I went over to the zip_tricks repo on Github and found that… no updates to it have taken place, since the time I left. This led me to ponder…
Story of zip_tricks (and zip_kit)
zip_tricks got started as a set of hacks on top of Rubyzip, which I basically copied from zipline. We wanted to assemble ZIPs on the fly, and Ruby did not, at the time, have any libraries which would do that. There was a gem which did work, to an extent, but it did not support Zip64 (large files), which we absolutely required to work. Once it became clear that the very fabric of Rubyzip does not allow us to build a streaming solution, zip_tricks (which used to be just an internal library) has rejected Rubyzip as a dependency, and started providing a ZIP writer of its own. Later on we needed to do a large migration, which entailed “unpacking” the existing ZIP transfers into their constituent files - so that they could afterwards get re-assembled on the fly. The number of those ZIPs - which were pre-generated for every transfer - neared about a dozen million, if note more. Most of them - very large. Some of them - having corrupted structure. It would simply not be possible to “expand” those ZIPs by downloading them onto the EC2 instances first. The filesystem on EC2 is severely throttled, and some of the files would simply not fit on the EC2’ instances. So the second application of zip_tricks was to permit this “lazy unpacking” of ZIP files which could be done directly via HTTP.
Then we have overhauled our download servers and implemented resumable ZIP downloads with dynamic addressing. To my knowledge, this is something even Google Drive does not support to this day - remember the “Preparing ZIP” popover? How it works is described in this GH issue and this blog article, and it worked perfectly. This was actually the main use-case for zip_tricks once “the big migration” was completed. We would generate a “manifest” containing URLs for edge-includes and some ZIP file bytes for the central directory and local headers, and our download servers would stream those manifests out, substituting actual files from S3.
After I have left WeTransfer, zip_tricks became abandoned. For a number of reasons, really. The strong Ruby clique at WT has mostly disintegrated with myself leaving for Cheddar, Fabio leaving for Booking - and a few other folks being gone too. To much of my chagrin, TypeScript and Go have won at WT without much hope for recourse for folks still writing Ruby, even though there was a lot of potential to realize. As a matter of policy, myself and others who left lost committer rights to all the OSS repositories they were maintaining.
That was unfortunate. I had no grudges and was, in principle, open to maintaining stuff I used to maintain indefinitely - particularly the things that I was using myself. It was not to be, however. The few PRs I have opened stayed open for a long while, and there wasn’t a single release of zip_tricks since I have left.
And zip_tricks holds a special place in my heart. I really wanted to fix it, to “own” it again (“ownership” in the context of teamwork and organisations in general is something I still haven’t been able to wrap my head around). Since taking the software over turned out not to be an option, I decided to take the “measure 0” and fork zip_tricks under a different name.
And so came zip_kit
.
A few changes came with that.
Preventing corporate open-source abandonware
What happened as part of WT’s open source policy actually led me to adopt a completely different approach at Cheddar. Everybody who left the company on friendly terms stays a contributor to the open-source libs. We hardly had a moment when this was objectionable.
It might just so be, that the nature of open-source - which makes forking possible - has made ressurecting zip_kit viable. And I am very grateful for that.
And we are not on the ball with something - maintainership-post-leave. Quite a few corporate OSS projects are not actually corporate-backed. They are backed insofar as the maintainer gets paid to work on that software within their usual work responsibilities - because it is needed to the business, but the fact that the software is open-source is often an afterthought. The company does get their profit by having a better public image (“we are supporting open-source”), the maintainer gets profit by having creative control (“look how I am releasing software without being subjected to the feature-factory-waterfall-scrumgile-theater”). But while a company might not help the maintainer with any marketing or publicity - something that is very much necessary for an OSS project to succeed - it will often claim copyright and control over that OSS. This means that once the maintainer leaves the company, it is not very likely they will be able to “carry” the software with them and keep maintaining it. “Why is that weird person still part of our Github organization and costing us 4 bucks a month?” “-Oh, that’s Jake and they used to work here 3 years ago and they wrote libfoo, which we are not even using anymore, but they still need access”.
I would posit that 99% of what we consider “corporate-backed OSS” at this point is something like libfoo, and we – as an industry – do not necessarily have a good story in place for assuring maintenance. That provided a maintainer is willing to step up and do it, which might not even be the case. People change, their incentives and priorities change, people become caregivers, people burn out, people switch stacks just because they feel like it, people… get ill and die, too.
What the OSS policy has led to in case of zip_tricks was that even though I was willing to continue maintaining and have been sending PRs, they went unmerged for months.
A good open source policy for your org might be “no policy”
Back at WT, at one point one of the team members has raised a point that “we need an open-source policy”. In retrospect, this was a premature call. What came out of that was:
- We needed a CLA policy. That was probably a good call, but people were not allowed to implement automation for this - and so it didn’t come to pass.
- Managing GH repos would be in purview of one of the teams (which never published its own open-source software). People authoring OSS would not be permitted to allow external contributors
- Every time an OSS library needed, say, cloud resources, there would be a negotiation process to get them. Since a lot of the sofware had to do with the cloud (AWS for us), this hampered the flexibility considerably as well
- The “spicy question” of “what to do if a contributor no longer works here” was left unanswered
Most importantly - it created too much process, and it created that process too soon. In retrospect, I should have resisted more on this, and reduce that policy to an absolute minimum. The triangle of agency, autonomy, responsibility was not shaped well there.
Having seen this first-hand, I would say that the only thing that truly makes sense is a good CLA process. A proper provision with a CLA would be that the contributors do not object to you changing the licensing terms on the open-source software you are producing.
And that’s it. No - really - that’s it. For 99% of the joints out there, this will be more than adequate.
Forking as last resort
There was, however, some possibility for a revival. Older versions of zip_tricks did not carry the Hippocratic license, and while the name could be considered copyrighted, the actual code could not. And so the decision came about fairly organically: fork the library based on the MIT-licensed branch (which we did keep in place, to not pull the rug from under people who could not permit themselves to use a Hippocratic-licensed library), backport useful changes (most of them were mine anyway), and release it under a different name. And that’s how zip_kit came to be.
The funny part here is that a lot of the “hardliner” provisions of “true” open-source (forking is always an option, a license cannot be turned into a more restrictive license willy-nilly, code can’t be copyrighted…) actually permitted the library to survive.
Bidding farewell to the Hippocratic license
When WT went all-in on the Hippocratic license, it seemed to be a good experiment. I, myself, was mildly curious to try it - who doesn’t want to “do the good thing” and doesn’t want to “hold back those, who do bad things”. Yet: while the emotional message of the Hippocratic license is just, I no longer believe it to be the right call. The problem with the “do no harm” licenses is that it imposes responsibility which is the hardest to enforce when one is in the business of tool making - the responsibility to forbid certain people (or parties) from using your tools. There are instruments of comparable purpose, such as economic sanctions, but any instrument of that nature is as good as it is enforceable. The Hippocratic license is not enforceable for an even moderately-sized company, and much less so - for an individual. Moreover, most folks writing Ruby applications are creating commercial solutions. A commercial solution implies usage by parties you may not even know. It might be used by a company that acquires yours. It might be used by a contractor of a company you have sold to. A company you have sold to may have an “evil” other company amongst its hundreds (or thousands) of clients.
In practice, what the Hippocratic license meant for the software I have built at WT, was this:
- Most joints would just avoid using that software - exactly because the license imposed liabilities on them they would not be able to hold themselves accountable for managing, and sometimes would not even bother to understand
- zip_tricks did not become a viable alternative for rubyzip partially because its license was not liberal
- These things were going “to market” in an already minuscule ecosystem (the Ruby/Rails ecosystem), where - even if they had a chance of being popular - they would always be dangerous goods.
And have no doubt about it - the OSS software market is a market, only the currency there is the expertise and reputation of the author. While you can’t necessarily “sell” OSS software as a product - although some are trying - you absolutely do market yourself as the maker. Make your bets right and you can count on more interesting gigs, better employment opportunities, and other perks that a lot of… ahem… exposure can get you. Of course, in OSS a lot can get done (and does get done) out of altruistic motives. But pretending we don’t want our OSS software to be popular would be… hypocritical.
The more practical “fallout” of the license was this:
- The first dependent library, zipline had to accommodate changes to allow older zip_tricks versions to be used if people are not OK with using the Hippocratic license. Those older versions did not include important bug fixes and tweaks, and code had to be backported into zipline to replicate changes done to zip_tricks itself just to support that capability
- The second dependent library, xslxstream had to carry an alternative ZIP writer adapter which would allow people to keep using Rubyzip if they do not agree to the license
- The third dependent library… never happened. Plenty of libraries use Rubyzip, but with a restrictive license in tow it was impossible for zip_tricks to sway them over. I don’t have quantitative evidence, but I do have a strong conviction that the fact that zip_tricks remained unpopular was - for the most part - the Hippocratic license it started to carry.
Knowing that the Hippocratic license was not working, I never moved any of my own libraries to it - but at the end of my tenure I also have decided not to release libraries via WT, for that reason among others (the other was that managing the Github configuration became a huge nuisance).
With the decision to fork, I was facing a bit of a conundrum. How would it look if I were to take the code that others wrote - assumung we were committed to the Hippocratic license - and remerge it under the MIT license zip_tricks had previously? This is actually something that can be scoffed at. I am, after all, doing something contributors likely haven’t signed up for - making their contributions available to all those Evil Enterprises and Oppressive Government Agencies, right?
In the end the following solution presented itself: I made most of the contributions to zip_tricks. There were a few other contributors, and one of them stood out. If I were to contact them all and wait for their permission, forking under a more permissive license would have taken a very long time indeed. Instead, I have decided to go about it like this:
- I would contact the most prolific contributors only, and ask them whether they have objections to the license change
- I would allow anyone’s code to be backed out if they did not agree to the licensing policy.
And so it went. And I’ll mention here: if, by any chance, you have contributed to zip_tricks and you object to the license change in zip_kit, please do the following
- Contact me (the contact details are on the site as well as in the zip_kit repo) and ask me to back out your contributions, pointing to them specifically
- Continue using zip_tricks instead of zip_kit. You are on your own though regarding upstreaming fixes.
With the above, I would now not recommend the MIT-Hippocratic license for new projects. Or rather: I would not recommend it if you value adoption of your software. It is a great idea, but I can’t see it work well.
Of Rack and streaming
Next step was to figure out why streaming was broken. Remember the issue I started the article with? Well, it turned out that it was quite a peculiar thing. See, zipline
did not have that problem - but zip_tricks did! So there was something in there that was causing buffering, and consequently - made zip_tricks useless for exactly the purpose it was made for.
Once I started digging, I was surprised to find that I was not the only one who started to experience buffering. The issue turned out to be the Rack::ETag middleware. This middleware started getting included in the default Rails middleware stack, and what it does is… strange. It computes a “weak” ETag if the underlying Rack app did not compute one, and if the response may be cached. I am, personally, not a fan of solutions like this. It is not wise to buffer (or to iterate over) responses of arbitrary size.
For example, while idempo does buffer and cache responses, it does so very carefully. It tries to buffer only in a few, very specific cases: when the response can be sized upfront, when the response already is sized (by a supplied Content-Length
HTTP header), and when that size is below a certain threshold. Rack::ETag
, however, goes about it in a much less gentle manner. Any response which can be cached (or does not have a Last-Modified
or ETag
headers) is going to be checksummed in full before serving starts. The absence of those two headers is what actually caused zip_tricks not to stream inside Rails. This reminded me that while Rack is (in my view) a very decent interface, its semantics are sometimes hard to nail down - and all it takes is one misbehaved middleware to make the whole thing buffer.
It gets even more interesting though. It turned out that in a standard Rack package you would have not one, but two pieces of middleware which could buffer. The other offender is Rack::ContentLength
. Now, Rails does not include ContentLength
- and for a good reason, because it is just not a very well-designed piece of kit, sadly. But there is a fairly common piece of infrastructure which does add the ContentLength
middleware, and it is a fairly unexpected one - the rackup
binary. I tend to use rackup
as the default “serve-this-app-via-Rack” command, and to my surprise rackup
does forcibly add ContentLength
even though nobody asks it to! Luckily this has been removed in recent rackup
versions (moreover, rackup
has become a separate library at this point).
So, what’s the verdict? Well, to fix streaming in (now-renamed) zip_kit, I had to do two things.
- For the
Rack::ETag
fix, the ZIP archive had to be output with headers containing either a recentLast-Modified
orETag
header. This was relatively easy to fix. Rack::ContentLength
is a harder problem. Since Rack is a “diamond dependency” for so many pieces of software (including Rails itself), and can’t thus be easily updated, I needed a way to “switch”Rack::ContentLength
off. There is no “trigger header” implemented, so the only way to do this in a “legitimate” fashion was to pre-apply thechunked
transfer-encoding. That was also fairly easy, given that a middleware for this encoding used to be a part of Rack all the way along, and got even copied into recent Rails.
I did get a recommendation from Samuel (of async fame) to not encode in chunked
forcibly, but to leave it to the webserver. However, given that the Rack::ContentLength
may get introduced into the stack without the user’s consent/knowledge, it was wise to make this an option which can be enabled explicitly. Just for “those” cases.
Once those were fixed, we were back in business - zip_kit was streaming from Rails again. Without using ActionController::Live, too!
Maybe you should try it
This has turned into a long rant - but this is about stuff that matters deeply to me. It is considered unlealthy to have a meaningful connection to your work in sofware, but despite all of my attempts I just can’t seem to let go of it. Maybe you should give zip_kit a whirl.