Show HN: Retry a command with exponential backoff and jitter (+ Starlark exprs)

stevekemp · 2 days ago

I have a collection of small sysadming/scripting utilities distributed as a single binary here:

One of those is "splay" to sleep a random amount of time, before running a command. Very useful to avoid lots of things running across a fleet at the same time.

reply

networked · 2 days ago

I actually had SysBox starred, but `splay` wasn't on the list of alternatives. This has been fixed.

reply

broken_broken_ · 2 days ago

Hey, that’s funny, I wrote a blog post about the many ways you can implement such a program, and it was discussed on HN: https://news.ycombinator.com/item?id=42103200

reply

Terretta · 2 days ago

For Python, consider Tenacity: https://tenacity.readthedocs.io/en/latest/

At the CLI, this is nice for not depending on Node.

reply

ddorian43 · 2 days ago

Stamina for good defaults on Tenacity: https://github.com/hynek/stamina

reply

derhuerst · 2 days ago

for Node.js, consider retry: https://www.npmjs.com/package/retry

reply

yodon · 2 days ago

>For Python, consider Tenacity

>>For node.js, consider retry

For .NET, consider Polly[0]

[0]https://www.pollydocs.org/

reply

derr1 · 2 days ago

Would also recommend opnieuw: https://github.com/channable/opnieuw

reply

maleldil · 2 days ago

backoff is good, too: https://github.com/litl/backoff

I moved away from tenacity because of type-checking issues. I might check out stamina next time.

reply

lordswork · 2 days ago

Does recur depend on node?

reply

jstanley · 2 days ago

My view is that you basically never want exponential backoff.

The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

In the common case that the thing you're trying to talk is just down, exponential backoff with base N (e.g. wait 2x longer each time) increases your expected downtime by a factor of N (e.g. 2), because by the time your dependency is working again, you may be waiting up to the same amount of time again before you even retry it! Meanwhile, your service is down and your customers can't use it and your program is doing nothing but sleeping for another 30 minutes before it even checks to see if it can work.

And for what? What is the downside to you if your program retries much more frequently?

I much prefer setting a fixed time period to wait between retries (would you call that linear backoff? no backoff?), so for example if the thing fails you just sleep 1 second and try again, forever. And then your service is working again within 1 second of your dependency coming back up.

If you really must use exponential backoff then pick a quite-low upper bound on how long you'll wait between retries. It is extremely frustrating to find out that something wasn't working just because it was sleeping for a long time because the previous handful of attempts failed.

reply

dragonwriter · 2 days ago

> The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

Exponential backoff is applicable to any failure where the time it has so far gone unresolved is the primary piece of available data on how lilong it is likely to take before being resolved, which is a very common situation, which is why it is a good default for most situations where you don’t have a better knowable-in-advance information at hand and the probability distribution ofn time to resolve, and where delays aren't super costly (though knowledge of when delays become costly can be used to set a cap on exponential backoff, too.)

reply

jstanley · 2 days ago

> and where delays aren't super costly

This is key.

If delays aren't costly, sure, any algorithm is fine.

But if you want your service up as soon as possible, why spend pointless minutes calling sleep() when you could be getting things working sooner?

reply

dragonwriter · 2 days ago

> If delays aren't costly, sure, any algorithm is fine.

Not true, delays aren't the only costs. Compute, network, and developer time digging through logs all cost money, and hammering a service fruitlessly when it is down adds to all of those.

(It also doesn't help that “system is overloaded” may sometimes be communicated clearly, but also in many systems can manifest in... just about any other kind of error, too.)

reply

mplewis · 2 days ago

I've used a fixed-time retry before. It DOSed the target server. Now, I use exponential backoff.

reply

thegrim33 · 2 days ago

The jitter part is important too, to spread the retries out more in time.

reply

jperras · 2 days ago

> The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

That's what you should be using exponential backoff for. In actuality, the new latency introduced by the backoff should be maintained for some time even after a successful request has been received, and gradually over time the interval reduced.

> I much prefer setting a fixed time period to wait between retries (would you call that linear backoff? no backoff?)

I've heard it referred to as truncated exponential backoff.

reply

dragonwriter · 2 days ago

That's only truncated exponential backoff if you do exponential backoff to some point.

If its just a fixed retry interval, then its... a fixed retry interval.

reply

Gasp0de · 2 days ago

Instead of using a fixed backoff, just use a token-bucket algorithm. Try a few requests every now and then and have each successful request enable another retry.

reply

throwaway314155 · 2 days ago

> basically never want exponential backoff.

> [unless] due to a rate limit

Pretty common use-case for automatic retries...

reply

jstanley · 2 days ago

That's fine!

If it's for a rate limit, sure, use exponential backoff, and put a known upper bound on how long you'll back off for.

But don't make your application wait for 50 minutes to come back up just because the database was down for 25 minutes. That's the kind of thing I am protesting here.

reply

itslennysfault · 2 days ago

If I had a nickel for every time I've written exponential backoff with jitter I'd have like several nickels.

reply

evgpbfhnr · 2 days ago

Looks similar to https://github.com/rye/eb

reply

networked · 2 days ago

Thanks. I have added eb to https://github.com/dbohdan/recur#alternatives.

reply

mariusor · 2 days ago

I created a library for Go that includes retry mechanisms. I wouldn't call it an alternative, but it can be used to do similar things: https://pkg.go.dev/git.sr.ht/~mariusor/ssm#example-Retry

reply

chubot · 2 days ago

Great list! I appreciate whenever an open source project links to similar ones

(FWIW my own contribution is https://github.com/oils-for-unix/oils/wiki/Alternative-Shell... )

reply

iamjackg · 2 days ago

Very cool project! Just a suggestion: since you do have pre-built releases on GitHub, you should mention that in the Installation section of your readme.

reply

networked · 2 days ago

Thank you. It is a good suggestion. I have mentioned the binaries in the readme and linked to https://github.com/dbohdan/recur/releases.

reply

netvarun · 2 days ago

This looks cool - will give it a try (hah!) Curious on why you picked starlark instead of cel for the conditional scripting part?

reply

networked · 2 days ago

All right, let me tell you the history of recur to explain this choice. :-)

I wrote the initial version in Python in 2023 and used simpleeval [1] for the condition expressions. The readme for simpleeval states:

> I've done the best I can with this library - but there's no warranty, no guarantee, nada. A lot of very clever people think the whole idea of trying to sandbox CPython is impossible. Read the code yourself, and use it at your own risk.

In early 2024, simpleeval had a vulnerability report that drove the point home [2]. I wanted to switch to a safer expression library in the rare event someone passed untrusted arguments to `--condition` and as a matter of craft. (I like simpleeval, but I think it should be used for expressions in trusted or semi-trusted environments.)

First, I evaluated cel-python [3]. I didn't adopt it over the binary dependency on PyYAML and concerns about maturity. I also wished to avoid drastically changing the condition language.

Next in line was python-starlark-go [4], which I had only used in a project that ultimately didn't need complex config. I had been interested in Starlark for a while. It was an attractive alternative to conventional software configuration. I saw an opportunity to really try it.

A switch to python-starlark-go would have made platform-independent zipapps I built with shiv [5] no longer an option. This was when I realized I might as well port recur to Go, use starlark-go natively, and get static binaries out of it. I could have gone with cel-go, but like I said, I was interested in Starlark and wanted to keep expressions similar to how they were with simpleeval.

[1] https://github.com/danthedeckie/simpleeval

[2] https://github.com/danthedeckie/simpleeval/issues/138

[3] https://github.com/cloud-custodian/cel-python

[4] https://github.com/caketop/python-starlark-go

[5] https://github.com/linkedin/shiv#gotchas

reply

netvarun · 2 days ago

Wow that had been quite a journey - thanks for the detailed response. We’ve been using cel internally in a golang codebase and been pretty happy with it. I’ve only know about starlark in the Bazel context - I’ve learned a couple of things from your post. Thanks :)

reply

westurner · 2 days ago

There's not yet a Python implementation of Starlark (which, like Bazel, is a fork of Skylark FWIU)?

All that have tried to sandbox Python with Python have failed. E.g. RestrictedPython and RPython

reply

networked · 1 day ago

No, there is no pure-Python Starlark. So far there are only Python bindings for the Go and the Rust implementation: https://github.com/laurentlb/awesome-starlark#getting-starte.... I thought about porting the Go implementation to Python. Doing it as a subgoal for porting recur seemed a little like scope creep. (Tokei says there are 16792 SLOC in the latest d4d7611 commit of starlark-go and 899 in recur 67b38c1.)

reply

morcus · 2 days ago

I believe Starlark was renamed Skylark, even internally. Bazel is a build system that uses Starlark as a configuration language, not a fork of Starlark.

reply

westurner · 2 days ago

Bazel is an open source rewrite of Blaze (which introduced Skylark)

reply

adonovan · 14 hours ago

Less a rewrite, more a variant: a majority of the source is common to both.

reply

westurner · 2 days ago

Systemd does exponential retry but IDK about jitter?

reply

westurner · 2 days ago

A systemd unit file with RestartMaxDelaySec= for exponential back off:

  [Unit]
  Description="test exponential retry"
  
  [Service]
  ExecStart=-/bin/sh -c "date -Is"
  Type=simple
  Restart=on-failure
  RestartSec=60ms
  RestartMaxDelaySec=¶
  RestartSteps=
  
  # StartLimitBurst=2
  # StartLimitIntervalSec=30
  
  # OnFailure=
   
  # TimeoutStartSec=
  # TimeoutStartFailureMode=
   
  # WatchdogSec=
   
  # RestartPreventExitStatus=
  # RestartForceExitStatus=
  
  # OOMPolicy=
   
  [Install]
  WantedBy=default.target

man systemd.service https://www.freedesktop.org/software/systemd/man/latest/syst...

man systemd.unit https://www.freedesktop.org/software/systemd/man/latest/syst...

man systemd.resource-control https://www.freedesktop.org/software/systemd/man/latest/syst...

man systemd.service > "Table 1. Exit causes and the effect of the Restart= settings" https://www.freedesktop.org/software/systemd/man/latest/syst...

"RFE: Exponentially increasing RestartSec=" https://github.com/systemd/systemd/issues/6129#issuecomment-...

"core: add RestartSteps= and RestartSecMax= to dynamically increase interval between auto restarts" https://github.com/systemd/systemd/pull/26902

> RestartSteps= accepts a positive integer as the number of steps to take to increase the interval between auto-restarts from RestartSec= to RestartMaxDelaySec=

/? RestartMaxDelaySec https://www.google.com/search?q=RestartMaxDelaySec

Is this where to add jitter?

"core/service: make restart delay increase more smoothly" https://github.com/systemd/systemd/pull/28266/files

reply

greatgib · 2 days ago

Typically the kind of library that is useless and root cause of the dependency hell we are now living in.

That kind of simple things should be a basic inside once program or at worse a simple snipper copied from stack overflow or anything like that

reply

yoavm · 2 days ago

It does not seem to be a library at all, so very little to do with dependency hell. It's something you prepend to your commands if you want them to retry until they succeed. Seems pretty useful to me.

reply

bobnamob · 2 days ago

Also, retries are more nuanced than most people expect, see [1][2]. Getting them right is exactly something I’d appreciate in a library and not something I’d want to reimplement per project/service.

[1] https://brooker.co.za/blog/2022/02/28/retries.html

[2] https://medium.com/yandex/good-retry-bad-retry-an-incident-s...

reply

crest · 2 days ago

A non-trivial application should not add a dependency for just exponential backoff + proportional jitter, an easy to use wrapper to put around a quick script is a good idea that and the lack of such a "basic" defensive programming technique has made untold initially small problems a lot worse by creating a thundering herd.

reply