I feel this benchmark compares apples to oranges in some cases.
For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.
This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.
While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.
It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.
Actually, I think this benchmark did the right thing, that I wish more benchmarks would do. I'm much less interested in what the differences between compilers
are than in what the actual output will be if I ask a professional Go or Node.js dev to solve the same task. (TBF, it would've been better if the task benchmarked was something useful, eg. handling an HTTP request.)
Go heavily encourages a certain kind of programming; JavaScript heavily encourages a different kind; and the article does a great job at showing what the consequences are.
As far as I know there is no way to do Promise like async in go, you HAVE to create a goroutine for each concurrent async task. If this is really the case then I believe the submition is valid.
But I do think that spawning a goroutine just to do a non-blocking task and get its return is kinda wasteful.
The requirement is to run 1 million concurrent tasks.
Of course each language will have a different way of achieving this task each of which will have their unique pros/cons. That's why we have these different languages to begin with.
This depends a lot on how you define "concurrent tasks", but the article provides a definition:
Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.
Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.
Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.
I have no idea what Go is doing to get 2500 bytes per task.
> I have no idea what Go is doing to get 2500 bytes per task.
TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.
One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.
It would be nice if the author also compared different runtimes (e.g. NodeJS vs Deno, or cpython vs pypy) and core language engines (e.g. v8 vs spider monkey vs JavaScript core)
> high number of concurrent tasks can consume a significant amount of memory
note absolute numbers here: in the worst case, 1M tasks consumed 2.7 GB of RAM, with ~2700 bytes overhead per task. That'd still fit in the cheapest server with room to spare.
My conclusion would be opposite: as long as per-task data is more than a few KB, the memory overhead of task scheduler is negligible.
I write (async) Rust regularly, and I don't understand how the version in the appendix doesn't take 10x1,000,000 seconds to complete. In other words, I'd have expected no concurrency to take place.
Am I wrong?
UPDATE: From the replies below, it looks like I was right about "no concurrency takes place", but I was wrong about how long it takes, because `tokio::time::sleep()` keeps track of when the future was created, (ie when `sleep()` was called) instead of when the future is first `.await`ed (which was my unsaid assumption).
The implementation of `sleep` [1] decides the wake up time by when `sleep` is called, rather than when its future is polled. So the first task waits one second, then the remaining tasks see that they have already passed the wake-up time and so return instantly.
I think the points people made in other replies make sense, but "Tokio::sleep is async" by itself is not enough of an explanation. If it were the case that `Tokio::sleep()` tracked the moment `.await` was called as it's start time, I believe it would indeed take 10x1,000,000 seconds, _even if it's async_.
Yeah, I think you're wrong. It should only take ~10s. tokio::time::sleep records the time it was called before returning the future [1]. So, all 1 million tasks should be stamped with +/- the same time (within a few milliseconds).
The referencesource repository is only relevant if you are using the legacy .NET Framework. Modern .NET has a special case for passing a List<Task> and avoids the allocation:
It doesn't take that much space, and not all languages have option to easily map an initial range onto an iterator that produces tasks. Most are dominated by the size of state machines/virtual threads.
Please note that the link above leads to old code from .NET Framework.
This benchmark is nonsense. Apart from the fact that Go has an average Goroutine overhead of a 4kB stack (meaning an average usage of 3.9GB for 1M tasks), the code written is also in a closure, and scheduling a 2nd Goroutine in the wg.Done(), so unlike some of the others it had at least 2M function calls on the event loop stack in addition to at least 1M closure references. So yeah, it’s a great example of bad code in any language.
I'm not sure what "memory efficient" means. But, Go sprung as a competitor to Java (portability, language stability, corporate language support/development) and C++ (faster compile times). Can't beat C++ in terms of memory management (performance, guys, not safety) much. But, you can fare well against the JVM, I'm guessing.
In this benchmark actually no, Go doesn't fare well. There is actually higher static overhead per goroutine than JVM VirtualThread.
I presume this is because of a larger initial stack size though/
This probably doesn't matter in the real world as you will actually use the tasks to do some real work which should really dwarf the static overhead is almost all cases.
Note 1: The gist is in Ukrainian, and the blog post by Steve does a much better job, but hopefully you will find this useful. Feel free to replicate the results and post them.
Note 2: The absolute numbers do not necessarily imply good/bad. Both Go and BEAM focus on userspace scheduling and its fairness. Stackful coroutines have their own advantages. I think where the blog post's data is most relevant is understanding the advantages of stackless coroutines when it comes to "highly granular" concurrency - dispatching concurrent requests, fanning out to process many small items at once, etc. In any case, I did not expect sibling comments to go onto praising Node.js, is it really that surprising for event loop based concurrency? :)
Also, if you are an Elixir aficionado and were impressed by C#'s numbers - know that they translate ~1:1 to F# now that it has task CE, just sayin'.
For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.
This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.
While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.
It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.
reply