Running the benchmarks locally (zig is fastest, but that's probably expected) zi...

Arnavion · on Sept 13, 2021

The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.

rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.

Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )

kprotty · on Sept 13, 2021

I don't think tokio's slowness here is "to be expected". There isn't much reason for tokio tasks to have that much overhead over normal thread pools. The I/O driver shouldn't be called in such a benchmark given there's no I/O work happening. Waker's only add a reference count inc/dec + an atomic cas for wake() which should only happen on the JoinHandle `awaits` [4] compared to just an atomic swap for the Zig case on the join handle.

Golang doesn't poll for I/O under such cases [0] and tokio should be using the `ParkThread` parker for putting threads to sleep [1] given `net` features aren't enabled (not sure if this is the actually the case) which you can force with a custom Runtime initialization instead of `#[tokio::main]` as an exercise.

`crossbeam-deque` requires heap allocation for the run queues, heap allocates on growth, and garbage collects said memory. This is an overhead I wished to avoid and is something tokio has been improvements with avoiding as well [2].

`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. This isn't general purpose and it takes advantage of unbounded stack allocation which can technically cause a stack overflow. Zig could do this and also take advantage of batch scheduling, but it complicates the code and is unfair here given `async` usage. Tokio, golang, and the Zig benchmarks require heap allocation on spawn so I believe it makes it a fairer comparison. This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

The `unsafe` there is from me copying the benchmark code from the tokio version to the rayon version and forgetting to remove the hack. In tokio, ownership of the array needed to be passed into the function given the lifetime was no longer linear from the spawn() I assume (correct me if i'm wrong here, but this is what the compile error hinted at). So I needed to recreate the array after the function, hence unsafe. If there's a better way for the tokio version, please send a PR. I see you've done so for the rayon version and I gladly merged it.

[0]: See `atomic.Load(&netpollWaiters) > 0` in https://golang.org/src/runtime/proc.go

[1]: https://github.com/tokio-rs/tokio/blob/master/tokio/src/runt...

[2]: https://tokio.rs/blog/2019-10-scheduler#a-better-run-queue

[3]: https://docs.rs/rayon-core/1.9.1/src/rayon_core/join/mod.rs....

[4]: https://github.com/tokio-rs/tokio/blob/98578a6f4a494e709f000...

Arnavion · on Sept 13, 2021

>I don't think tokio's slowness here is "to be expected".

And yet on my machine the rayon version takes ~160ms and the tokio version takes ~1350s. This isn't at the level of some minor missed performance optimization.

>There isn't much reason for tokio tasks to have that much overhead over normal thread pools.

tokio is an async runtime. tokio tasks are meant to be for distributing I/O-bound work. It would be at least a little more correct to use spawn_blocking for CPU-bound tasks, but that still doesn't work for your recursive calls because that's not what it's meant for.

In general, if you have CPU-bound work in your tokio-using application, you run it on a different threadpool - tokio's blocking one, or a completely different one.

>`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. [...] This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

My original comment was also talking about scopes, not `rayon::join`. So yes, `rayon` is absolutely a good comparison.

kprotty · on Sept 14, 2021

This actually can be at the level of a missed optimization. A run queue with a lock-shared queue amongs all the threads scales even worse than the tokio version. Sharding the run queues and changing the notification algorithm, even while keeping locks on the sharded queues improves throughput drastically.

Tokio is an async runtime, but I don't see why being an async runtime should make it worse from a throughput perspective for a thread pool. I actually started on a Rust version [0] to test out this theory of whether async-rust was the culprit, but realized that I was being nerd-sniped [1] at this point and I should continue my Zig work instead. If you're still interested, I'm open to receiving PRs and questions on that if you want to see that in action.

It's still correct to benchmark and compare tokio here given the scheduler I was designing was mean to be used with async tasks: a bunch of concurrent and small-executing work units. I mention this in the second paragraph of "Why Build Your Own?".

The thread pool in the post is meant to be used to distribute I/O bound work. A friend of mine hooked up cross-platform I/O abstractions to the thread pool [2], benchmarked it against tokio to be have greater throughput and slightly worse tail latency under a local load [3]. The thread pool serves it's purpose and the quicksort benchmark is to show how schedulers behave under relatively concurrent work-loads. I could've used a benchmark with smaller tasks than the cpu-bound partition()/insertion_sort() but this worked as a common example.

I've already mentioned why rayon isn't a good comparison: 1. It doesn't support async root concurrency. 2. scoped() waits for tasks to complete by either blocking the OS thread or using similar inline-scheduler-loop optimizations. This risks stack overflow and isn't available as a use case in other async runtimes due to primarily being a fork-join optimization.

[0]: https://github.com/kprotty/zap/blob/blog-rust/src/thread_poo...

[1]: https://xkcd.com/356/

[2]: https://github.com/lithdew/hyperia

[3]: https://gist.github.com/kprotty/5a41e9612657de00788478a7dde4...

skohan · on Sept 13, 2021

Honestly tokio is so complex, and it serves so many use-cases it's hard for me to believe it could ever be truly optimal for any one use-case.

kristoff_it · on Sept 13, 2021

Isn't Go's scheduler just as versatile?

skohan · on Sept 13, 2021

I'm not an expert on the go scheduler, but my perception is that it is more of a focused single-purpose component whereas tokio seems like a sprawling swiss-army-knife of a library if you browse the source

bbatha · on Sept 13, 2021

The tokio scheduler and the go scheduler are roughly equivalent. Much of tokios bulk is reimplementing much of the std lib in an async compatible way (io, os, blocking).

kprotty · on Sept 14, 2021

If you browse the source, the go scheduler has complexities to deal with that tokio doesn't as well. The thread pool is unified between worker threads & blocking threads. Go also does goroutine preemption via signaling/SuspendThread + a background monitor thread called sysmon. Go does garbage collection and the tracing/race-detection/semantics are tightly coupled to both its allocator and it's scheduler. Go also exposes and maintains its entire standard library which includes an http client/server (tokio the org maintains their own as hyper but its separated from tokio the runtime). It can be fair to argue that Go is just as "swiss-army-knife" of a system as tokio.

tick_tock_tick · on Sept 13, 2021

It's much more versatile but also has many more years of optimization.

gefhfff · on Sept 13, 2021

Can you explain that Tokio vs crossbeam vs rayon a bit more? Should I default to rayon or crossbeam because of that?

ModernMech · on Sept 13, 2021

It depends on what you want to do. If you are doing io-bound work, Tokio would be what you want -- you would use it as a runtime for the async capabilities in Rust. If you have cpu-bound work, then rayon is what you want to use. Rayon is a work-stealing parallelism crate -- it will schedule work to be done, and different threads will schedule portions of it as they become available to do work. It's very easy to get 100% CPU utilization across all cores use Rayon if your work is naturally paralellizable, and the interface is dead simple: anywhere you do a .iter(), you can turn it into a .par_iter(), and Rayon will parallelize the work.

Note there is some overhead to using Rayon -- you normally will be better off doing your work on a single thread unless you have a large number of elements in your collection... I found for my application I needed more than 1e6 elements before I saw an appreciable performance benefit to using Rayon.

As others said, Crossbeam is for sending and receiving messages across threads. I use it alongside of tokio and rayon.

YorickPeterse · on Sept 13, 2021

Crossbeam is a library that provides various concurrency primitives, such as a queue type that allows stealing of work.

Rayon is a library for running code in parallel. Typically you'll give it a (parallel) iterator of sorts, and it will distribute the work across a pool of threads.

Tokio is a library for async IO operations.

nicoburns · on Sept 13, 2021

Tokio is designed for concurrent io, rayon, or tokio-threadpool for concurrent cpu-bound task.

knuthsat · on Sept 13, 2021

What about smol?

ModernMech · on Sept 14, 2021

smol is in the same category as Tokio - it's a runtime for the async capabilities in Rust.

stewbrew · on Sept 13, 2021

Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?

jerf · on Sept 13, 2021

Go's value proposition is that it has good bang-for-the-buck. It's fairly easy to write in. There's easier, but it's fairly easy. It performs fairly well. There are things that perform better, but it's fairly good. It scales fairly well. There's things that scale better, but it's pretty good.

If you draw all this out on the programming landscape, I think one of the reasons Go has succeeded as well as it did is that this was a poorly covered part of the landscape. There were languages that were much easier, but performed much worse, and languages that performed much better, but generally required a lot more developer effort.

I don't expect Go to ever top the charts on performance on a task like this, but it's often hanging around at only a bit slower, and it does that across a wide variety of tasks and on a wide variety of dimensions. It's not the best at much of anything, but it's pretty darned good for pretty cheap on a lot of measures.

kristoff_it · on Sept 13, 2021

I think that's a perfectly reasonable way of looking at it.

mamcx · on Sept 13, 2021

Well, work with several task is THE thing with Go, right? If it were too much worse Go will fail.

In situations like this, is important to remember that Rust/Zig are (system)languages to MAKE "Go" but also, million other different things with different goals.

So other way to look at this is "If I wanna designa language with a runtime like Go, this both have it that close!"

pizza234 · on Sept 13, 2021

I get inconsistent results, both in terms of platform and language. The benchmark may be flawed, and no conclusions should be drawn, to be prudent.

Running on Linux (zig -Dc), rounded average of 3 runs.

8-thread Intel laptop machine:

  zip 0.9.0-dev.958:                 945
  rust 1.5.0-nightly 21-09-12/qsort: 545
  rust 1.5.0-nightly 21-09-12/rsort: 255
  go 1.16.8:                         355

32-thread AMD workstation:

  zip 0.9.0-dev.958:                 410
  rust 1.5.0-nightly 21-09-12/qsort: 780
  rust 1.5.0-nightly 21-09-12/rsort: 135
  go 1.16.8:                         135

Additionally, I haven't done a thorough review, but enabling the commented code:

        // Optimized version (hooks directly into the scheduler)
        // rayon::join(
        //     || quick_sort(low),
        //     || quick_sort(high)
        // );

yields a 110ms run on the faster machine.

kprotty · on Sept 13, 2021

I'm not sure how this implies it is flawed. It benchmarks thread pools so on a system which allowed more parallel concurrent tasks (i.e. 32t amd cpu) the throughput is expected to scale somewhat, and you see this in your results.

Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers. The readme contains instructions to compile it with optimizations (`-Drelease-fast`), the rust versions do so as well (`--release`) so you should re-evaluate your results.

See one of my other comments in this post on why I didn't use rayon::join() by default.

pizza234 · on Sept 14, 2021

> Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers

Right! This was a misunderstanding of mine. Updated results:

8-thread Intel laptop machine:

  zip 0.9.0-dev.958:                 230
  rust 1.5.0-nightly 21-09-12/qsort: 545
  rust 1.5.0-nightly 21-09-12/rsort: 255
  go 1.16.8:                         355

32-thread AMD workstation:

  zip 0.9.0-dev.958:                 125
  rust 1.5.0-nightly 21-09-12/qsort: 780
  rust 1.5.0-nightly 21-09-12/rsort: 135
  go 1.16.8:                         135

The reason why I think that conclusions should not be drawn due to the excessive variability between systems.

On the parent poster's system, zig is faster by a much larger margin than the two systems I've tried. And there's a ~45% difference in Go's performance compared to the fastest runner for the respective system.

kprotty · on Sept 14, 2021

The results will vary on different systems given how the combination of the CPU and OS handle thread synchronization + scheduling. On one of my desktops running Windows 10 with an i7-4790k, the Go qsort runs slower (~600ms) than the Rust qsort (~490ms) while on Linux systems the results are generally the opposite.

The zig thread pool appears consistently faster for this benchmark on different systems, and rayon::join appears consistently faster than the zig thread pool on different systems too. I believe you can somewhat conclude the advantages of their designs from this relative to each other rather than in an absolute sense.

rdsubhas · on Sept 13, 2021

At first glance, this looks great, almost 20% speedup!

But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.

But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.

I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?

skohan · on Sept 13, 2021

fwiw Zig is taking its role as a low-level, bare bones "c replacement" as a core design constraint. For instance zig has eschewed from even fairly tame features like operator overloading because they're not considered in line with Zig's identity as a minimal, explicit language.

So feature creep is always a concern, but from what I understand Zig has a better chance at avoiding this than other projects due to the language's guiding philosophy.

dnautics · on Sept 13, 2021

> how can one be sure that this 20% gap is a guaranteed design advantage

It's no guarantee, but in my experience, Zig is a language that lends itself very well to composability. I imagine that instead of adding features to stdlib, anyone who needs "extra features" in their threading system will find it quite easy to implement their own as a library outside of stdlib that can easily be swapped in and out in A/B tests for the library consumer. If some feature or another is REALLY wanted in the stdlib for some reason it would probably be easy to drop it behind a comptime-flag as part of a type constructor function, which won't compromise (runtime) performance, teeny tiny hit to comptime performance.

_vvhw · on Sept 13, 2021

> this looks to be how almost every new tech begins

On second thought, do you think this is true? The level of detail shown in this post is fairly extraordinary for event loops just getting started.

> how can one be sure that this 20% gap is a guaranteed design advantage

I would say because the design is solid and recognizes principles of mechanical sympathy that are fundamental and unlikely to change for quite some time.

kristoff_it · on Sept 13, 2021

Nobody can predict the future, but one thing that might help is that Protty is in the Zig core dev team and, as the post shows, he really cares about efficiency.

adamrezich · on Sept 13, 2021

it seems like the design philosophy of zig is such that it's not going for feature creep/bloat, it's keeping the language very simple and eschewing unexpected complexity

komuW · on Sept 13, 2021

When you do `go run qsort.go` you are also timing the time taken by Golang to build/compile the code. I suspect the same applies to `cargo run`

You should do something like; `go build . && time ./qsort`

Arnavion · on Sept 13, 2021

The "took #ms" line is output of the program. It does not include compilation time.

masklinn · on Sept 13, 2021

The same occurs with Rust, we can even see that the no-op recompilation takes about 20ms.

It’s obviously just a rough estimate / comparison otherwise it’d be using hyperfine or something along those lines.

nessex · on Sept 14, 2021

This is not the case, it measures the time between two points in the code at runtime and prints that[1].

[1] https://github.com/kprotty/zap/blob/blog/benchmarks/rust/ray...

erichdongubler · on Sept 13, 2021

Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.

adenozine · on Sept 13, 2021

Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.

It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!

I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.

pcwalton · on Sept 13, 2021

There was no "design committee" for Rust.

I'm increasingly skeptical that any language with lifetimes can be anything other than "ugly to read". People complain about syntax, but what they're really complaining about is semantics.

adenozine · on Sept 14, 2021

Was? New junk is added to Rust every few weeks, using past tense seems intentionally deceitful, as does re-characterizing my comment. I really do mean ugly as in ugly. That's why I wrote that.

One of the strongest factors behind a lot of people choosing to stay away from Rust is that so many conversations with advocates are met with toxic and subtly toxic behavior. They demean software that's not memory safe the way that politicians use their words to sow anger. C has won, and Rust blew it's shot aiming at C++ instead.

As Amazon swallows Rust and spits all the refuse out, I can't help but smile, because it's a community that's been begging for obsolescence via their attitudes and behaviors since day one.

pcwalton · on Sept 17, 2021

You seem awfully angry about programming languages!

mcronce · on Sept 19, 2021

The Rust community is by a wide margin the most welcoming and positive Internet community I've personally seen.

In no small part because they don't make comments like this one.

platinumrad · on Sept 20, 2021

Welcoming? Yes. Positive? I don't know.

fiddlerwoaroof · on Sept 14, 2021

The thing that really surprised me about Rust is the significance of a semicolon at the end of a function body. E.g.:

    fn foo { 1; }

Returns unit while

    fn foo { 1 }

Returns 1. This reminds me of nothing so much as the infamous JavaScript ASI footgun where this returns undefined:

    function foo() {
      return
      2;
    }

c-cube · on Sept 14, 2021

It's not just the function's body, it's any block. Blocks in rust are expressions and return their last statement's value. It's nothing like Javascript's ASI where it's just inserting ; implicitly and only `return` can return a value.

burntsushi · on Sept 18, 2021

It's also nothing like Javascript because Rust is statically typed. If you accidentally insert a semi-colon (or not), you're almost guaranteed to get a compile error.

adamnemecek · on Sept 19, 2021

The second function is a syntax error since you are not declaring the return value in the function signature.