Running the benchmarks locally (zig is fastest, but that's probably expected)
zig (~1 week older than HEAD):
./zig-out/bin/qsort
filling
shuffling
running
took 177.46ms
rust nightly:
cargo run --release --bin qsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/qsort`
filling
shuffling
running
took 896.91656ms
cargo run --release --bin rsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/rsort`
filling
shuffling
running
took 212.190694ms
go 1.16.3:
go run qsort.go
filling
shuffling
running
took 222.90356ms
on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.
rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.
Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )
I don't think tokio's slowness here is "to be expected". There isn't much reason for tokio tasks to have that much overhead over normal thread pools. The I/O driver shouldn't be called in such a benchmark given there's no I/O work happening. Waker's only add a reference count inc/dec + an atomic cas for wake() which should only happen on the JoinHandle `awaits` [4] compared to just an atomic swap for the Zig case on the join handle.
Golang doesn't poll for I/O under such cases [0] and tokio should be using the `ParkThread` parker for putting threads to sleep [1] given `net` features aren't enabled (not sure if this is the actually the case) which you can force with a custom Runtime initialization instead of `#[tokio::main]` as an exercise.
`crossbeam-deque` requires heap allocation for the run queues, heap allocates on growth, and garbage collects said memory. This is an overhead I wished to avoid and is something tokio has been improvements with avoiding as well [2].
`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. This isn't general purpose and it takes advantage of unbounded stack allocation which can technically cause a stack overflow. Zig could do this and also take advantage of batch scheduling, but it complicates the code and is unfair here given `async` usage. Tokio, golang, and the Zig benchmarks require heap allocation on spawn so I believe it makes it a fairer comparison. This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.
The `unsafe` there is from me copying the benchmark code from the tokio version to the rayon version and forgetting to remove the hack. In tokio, ownership of the array needed to be passed into the function given the lifetime was no longer linear from the spawn() I assume (correct me if i'm wrong here, but this is what the compile error hinted at). So I needed to recreate the array after the function, hence unsafe. If there's a better way for the tokio version, please send a PR. I see you've done so for the rayon version and I gladly merged it.
>I don't think tokio's slowness here is "to be expected".
And yet on my machine the rayon version takes ~160ms and the tokio version takes ~1350s. This isn't at the level of some minor missed performance optimization.
>There isn't much reason for tokio tasks to have that much overhead over normal thread pools.
tokio is an async runtime. tokio tasks are meant to be for distributing I/O-bound work. It would be at least a little more correct to use spawn_blocking for CPU-bound tasks, but that still doesn't work for your recursive calls because that's not what it's meant for.
In general, if you have CPU-bound work in your tokio-using application, you run it on a different threadpool - tokio's blocking one, or a completely different one.
>`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. [...] This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.
My original comment was also talking about scopes, not `rayon::join`. So yes, `rayon` is absolutely a good comparison.
This actually can be at the level of a missed optimization. A run queue with a lock-shared queue amongs all the threads scales even worse than the tokio version. Sharding the run queues and changing the notification algorithm, even while keeping locks on the sharded queues improves throughput drastically.
Tokio is an async runtime, but I don't see why being an async runtime should make it worse from a throughput perspective for a thread pool. I actually started on a Rust version [0] to test out this theory of whether async-rust was the culprit, but realized that I was being nerd-sniped [1] at this point and I should continue my Zig work instead. If you're still interested, I'm open to receiving PRs and questions on that if you want to see that in action.
It's still correct to benchmark and compare tokio here given the scheduler I was designing was mean to be used with async tasks: a bunch of concurrent and small-executing work units. I mention this in the second paragraph of "Why Build Your Own?".
The thread pool in the post is meant to be used to distribute I/O bound work. A friend of mine hooked up cross-platform I/O abstractions to the thread pool [2], benchmarked it against tokio to be have greater throughput and slightly worse tail latency under a local load [3]. The thread pool serves it's purpose and the quicksort benchmark is to show how schedulers behave under relatively concurrent work-loads. I could've used a benchmark with smaller tasks than the cpu-bound partition()/insertion_sort() but this worked as a common example.
I've already mentioned why rayon isn't a good comparison: 1. It doesn't support async root concurrency. 2. scoped() waits for tasks to complete by either blocking the OS thread or using similar inline-scheduler-loop optimizations. This risks stack overflow and isn't available as a use case in other async runtimes due to primarily being a fork-join optimization.
I'm not an expert on the go scheduler, but my perception is that it is more of a focused single-purpose component whereas tokio seems like a sprawling swiss-army-knife of a library if you browse the source
The tokio scheduler and the go scheduler are roughly equivalent. Much of tokios bulk is reimplementing much of the std lib in an async compatible way (io, os, blocking).
If you browse the source, the go scheduler has complexities to deal with that tokio doesn't as well. The thread pool is unified between worker threads & blocking threads. Go also does goroutine preemption via signaling/SuspendThread + a background monitor thread called sysmon. Go does garbage collection and the tracing/race-detection/semantics are tightly coupled to both its allocator and it's scheduler. Go also exposes and maintains its entire standard library which includes an http client/server (tokio the org maintains their own as hyper but its separated from tokio the runtime). It can be fair to argue that Go is just as "swiss-army-knife" of a system as tokio.
It depends on what you want to do. If you are doing io-bound work, Tokio would be what you want -- you would use it as a runtime for the async capabilities in Rust. If you have cpu-bound work, then rayon is what you want to use. Rayon is a work-stealing parallelism crate -- it will schedule work to be done, and different threads will schedule portions of it as they become available to do work. It's very easy to get 100% CPU utilization across all cores use Rayon if your work is naturally paralellizable, and the interface is dead simple: anywhere you do a .iter(), you can turn it into a .par_iter(), and Rayon will parallelize the work.
Note there is some overhead to using Rayon -- you normally will be better off doing your work on a single thread unless you have a large number of elements in your collection... I found for my application I needed more than 1e6 elements before I saw an appreciable performance benefit to using Rayon.
As others said, Crossbeam is for sending and receiving messages across threads. I use it alongside of tokio and rayon.
Crossbeam is a library that provides various concurrency primitives, such as a queue type that allows stealing of work.
Rayon is a library for running code in parallel. Typically you'll give it a (parallel) iterator of sorts, and it will distribute the work across a pool of threads.
Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?
Go's value proposition is that it has good bang-for-the-buck. It's fairly easy to write in. There's easier, but it's fairly easy. It performs fairly well. There are things that perform better, but it's fairly good. It scales fairly well. There's things that scale better, but it's pretty good.
If you draw all this out on the programming landscape, I think one of the reasons Go has succeeded as well as it did is that this was a poorly covered part of the landscape. There were languages that were much easier, but performed much worse, and languages that performed much better, but generally required a lot more developer effort.
I don't expect Go to ever top the charts on performance on a task like this, but it's often hanging around at only a bit slower, and it does that across a wide variety of tasks and on a wide variety of dimensions. It's not the best at much of anything, but it's pretty darned good for pretty cheap on a lot of measures.
Well, work with several task is THE thing with Go, right? If it were too much worse Go will fail.
In situations like this, is important to remember that Rust/Zig are (system)languages to MAKE "Go" but also, million other different things with different goals.
So other way to look at this is "If I wanna designa language with a runtime like Go, this both have it that close!"
I'm not sure how this implies it is flawed. It benchmarks thread pools so on a system which allowed more parallel concurrent tasks (i.e. 32t amd cpu) the throughput is expected to scale somewhat, and you see this in your results.
Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers. The readme contains instructions to compile it with optimizations (`-Drelease-fast`), the rust versions do so as well (`--release`) so you should re-evaluate your results.
See one of my other comments in this post on why I didn't use rayon::join() by default.
> Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers
Right! This was a misunderstanding of mine. Updated results:
8-thread Intel laptop machine:
zip 0.9.0-dev.958: 230
rust 1.5.0-nightly 21-09-12/qsort: 545
rust 1.5.0-nightly 21-09-12/rsort: 255
go 1.16.8: 355
32-thread AMD workstation:
zip 0.9.0-dev.958: 125
rust 1.5.0-nightly 21-09-12/qsort: 780
rust 1.5.0-nightly 21-09-12/rsort: 135
go 1.16.8: 135
The reason why I think that conclusions should not be drawn due to the excessive variability between systems.
On the parent poster's system, zig is faster by a much larger margin than the two systems I've tried. And there's a ~45% difference in Go's performance compared to the fastest runner for the respective system.
The results will vary on different systems given how the combination of the CPU and OS handle thread synchronization + scheduling. On one of my desktops running Windows 10 with an i7-4790k, the Go qsort runs slower (~600ms) than the Rust qsort (~490ms) while on Linux systems the results are generally the opposite.
The zig thread pool appears consistently faster for this benchmark on different systems, and rayon::join appears consistently faster than the zig thread pool on different systems too. I believe you can somewhat conclude the advantages of their designs from this relative to each other rather than in an absolute sense.
At first glance, this looks great, almost 20% speedup!
But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.
But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.
I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?
fwiw Zig is taking its role as a low-level, bare bones "c replacement" as a core design constraint. For instance zig has eschewed from even fairly tame features like operator overloading because they're not considered in line with Zig's identity as a minimal, explicit language.
So feature creep is always a concern, but from what I understand Zig has a better chance at avoiding this than other projects due to the language's guiding philosophy.
> how can one be sure that this 20% gap is a guaranteed design advantage
It's no guarantee, but in my experience, Zig is a language that lends itself very well to composability. I imagine that instead of adding features to stdlib, anyone who needs "extra features" in their threading system will find it quite easy to implement their own as a library outside of stdlib that can easily be swapped in and out in A/B tests for the library consumer. If some feature or another is REALLY wanted in the stdlib for some reason it would probably be easy to drop it behind a comptime-flag as part of a type constructor function, which won't compromise (runtime) performance, teeny tiny hit to comptime performance.
> this looks to be how almost every new tech begins
On second thought, do you think this is true? The level of detail shown in this post is fairly extraordinary for event loops just getting started.
> how can one be sure that this 20% gap is a guaranteed design advantage
I would say because the design is solid and recognizes principles of mechanical sympathy that are fundamental and unlikely to change for quite some time.
Nobody can predict the future, but one thing that might help is that Protty is in the Zig core dev team and, as the post shows, he really cares about efficiency.
it seems like the design philosophy of zig is such that it's not going for feature creep/bloat, it's keeping the language very simple and eschewing unexpected complexity
Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.
Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.
It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!
I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.
I'm increasingly skeptical that any language with lifetimes can be anything other than "ugly to read". People complain about syntax, but what they're really complaining about is semantics.
Was? New junk is added to Rust every few weeks, using past tense seems intentionally deceitful, as does re-characterizing my comment. I really do mean ugly as in ugly. That's why I wrote that.
One of the strongest factors behind a lot of people choosing to stay away from Rust is that so many conversations with advocates are met with toxic and subtly toxic behavior. They demean software that's not memory safe the way that politicians use their words to sow anger. C has won, and Rust blew it's shot aiming at C++ instead.
As Amazon swallows Rust and spits all the refuse out, I can't help but smile, because it's a community that's been begging for obsolescence via their attitudes and behaviors since day one.
It's not just the function's body, it's any block. Blocks in rust are expressions and return their last statement's value. It's nothing like Javascript's ASI where it's just inserting ; implicitly and only `return` can return a value.
It's also nothing like Javascript because Rust is statically typed. If you accidentally insert a semi-colon (or not), you're almost guaranteed to get a compile error.
zig (~1 week older than HEAD):
rust nightly: go 1.16.3: on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9